Recent Advances in Intelligent Information Systems ISBN 978-83-60434-59-8, pages 579–588 Inference Processes on Clustered Partial Decision Rules Agnieszka Nowak and Beata Zielosko Institute of Computer Science, University of Silesia 39, Będzińska St., 41-200 Sosnowiec, Poland, http://zsi.tech.us.edu.pl agnieszka.nowak,[email protected] Abstract The aim of the paper is to study efficeient of inference process using clustered partial decision rules. Partial decision rules are constructed by greedy algorithm. They are clustered with Agglomerative Hierarchical Clustering (AHC) algorithm. We study how exact and partial decision rules clustered by AHC algorithm influence on inference process in knowledge base. Clusters of rules are a way of modularization of knowledge bases in Decision Support Systems. Results of experiemnts present how different facors (e.g. rule length, number of facts given as an input knowledge) can influence on the efficiency of inference process. Keywords: rules clusters, partial decision rules, greedy algorithm, composited knowledge bases, knowledge representation 1 Introduction Knowledge based systems are created for specialized domains of competence, in which effective problem solving normally requires human expertise. In recent years, knowledge based systems technology has proven itself to be a valuable tool for solving hitherto intractable problems in domains such a telecommunication, aerospace, medicine and the computer industry itself. The goals of knowledge based systems are often more ambitious than conventional programs. They frequently perform not only as problem solvers but also as intelligent assistant and training aids, they help their creators and users to understand better own knowledge (Luger and Stubblefield, 2002). Proper knowledge acquisition is one of the main goal of systems based on knowledge. Usually they are implemented using knowledge acquired from human experts or discovered in data bases (for example using rough set theory (Moshkov et al., 2008a; Pawlak, 1991)). Rules are the most popular method of knowledge representation. Unfortunately if we use — possibly different — tools for automatic rules acquisition and/or extraction, the number of rules can rapidly grows. For modern problems, knowledge bases can count up to hundreds or thousands of rules. For such knowledge bases, number of possible inference paths is very high. In such cases knowledge engineer can not be totally aware that all possible rules interactions are legal and provide to expected results. It brings problems with inference efficiency and interpretation 580 Agnieszka Nowak, Beata Zielosko of inference results. For example, if we consider forward reasoning, a lot of fired rules forming a lot of new facts that are sometimes difficult to properly interpret and they may be useless for user (Nowak et al., 2006a, 2008). The problem of efficiency may be important in technical applications, especially in real-time systems. Inference methods are well known, there are well described algorithms for forward and backward reasoning and some of their modifications (Luger and Stubblefield, 2002). We believe that the increase of efficiency should not rely on the modification of these algorithms but on modification of data structures used by them. We build knowledge base from partial decision rules constructed by greedy algorithm. Partial decision rules consist of smaller number of attributes than exact rules (Zielosko and Piliszczuk, 2008). We can say that such rules are less complex and better for understand. It is not the only improvement of the efficiency we can do. We propose to reorganize the knowledge base from set of not related rules, to groups of similar rules (using cluster analysis method). Thanks to clustering of conditional parts of similar rules, possibly only small subset of rules need to be checked for a given facts, which influence on performing of inference process (Nowak et al., 2006a; Nowak and Wakulicz-Deja, 2008b). The paper consists of five sections. In Section 2 main notions of partial decision rules and greedy algorithm are considered. Section 3 describes performing of inference on clusters of rules. In Section 4 results of experiments with real-life decision tables are considered. Section 5 contains summary. 2 Partial decision rules Decision rules can be considered as a way of knowledge represenatation. In applications we often deal with decision tables which contain noisy data. In this case, exact decision rules can be ”over-fitted” i.e. depend essentially on the noise. So, instead of exact decision rules with many attributes, it is more appropriate to work with partial decision rules with small number of attributes, which separate almost all different rows with different decisions (Zielosko and Piliszczuk, 2008). The problems of construction decision rules with minimal number of attributes, are N P -hard. Therefore, we should consider approximate polynomial algorithms. In (Moshkov et al., 2008a) we adapted well known greedy algorithm for set cover problem to construction of partial decision rules. From obtained bounds on greedy algorithm accuracy and results proved in (Moshkov et al., 2008b) it follows, that under some natural assumptions on the class N P , the greedy algorithm is close to the best polynomial approximate algorithms, for minimization of partial decision rule length. Now, main notions for partial decision rules are presented. Let T be a table with n rows labeled with decisions and m columns labeled with attributes (names of attributes) a1 , . . . , am . This table is filled by values of attributes. The table T is called a decision table (Pawlak, 1991). Two rows are called different if they have different values at the intersection with at least one column ai . Let r = (b1 , . . . , bm ) be a row from T labeled with a decision d. We will denote by U (T, r) the set of rows from T which are different (in at least one column ai ) Inference Processes on Clustered Partial Decision Rules 581 from r and are labeled with decisions different from d. We will use the parameter α to denote a real number such that 0 ≤ α < 1. A decision rule (ai1 = bi1 ) ∧ . . . ∧ (ait = bit ) → d is an α-decision rule for the row r of decision table T if attributes ai1 , . . . , ait separate from r at least ⌈(1 − α)|U (T, r)|⌉ rows from U (T, r). We will say that an attribute ai separates a row r′ ∈ U (T, r) from the row r, if the rows r and r′ have different values at the intersection with the column ai . For example, 0.01-decision rule means that attributes contained in this rule should separate from r at least 99% of rows from U (T, r). If α is equal to 0 we have an exact decision rule. Algorithm 1 describes the greedy algorithm with threshold α which constructs an α-decision rule for the row r of decision table T . Algorithm 1: Greedy algorithm for partial decision rule construction Input : Decision table T with conditional attributes a1 , . . . , am , row r = (b1 , . . . , bm ) of T labeled by the decision d and real number α, 0 ≤ α < 1. Output: α-decision rule for the row r of decision table T Q ←− ∅; while attributes from Q separate from r less than (1 − α)|U (T, r)| rows from U (T, r) do select ai ∈ {a1 , . . . , am } with minimal index i such that ai separates from r the maximal number of rows from U (T, r) unseparated by attributes from Q; Q ←− Q ∪ {ai }; end V return ai ∈Q (ai = bi ) → d; 3 The hierarchical structure of knowledge base It is known that cluster analysis brings useful technique to reorganize the knowledge base to hierarchical structure of rules. The hierarchy is a very simple and natural form of presentation the real structure and relationships between data in large data sets. Instead of one long list of all rules in knowledge base, better results we achieve if we build composited knowledge base as a set of groups of similar rules. That is why we used agglomerative hierarchical clustering algorithm to build a tree of rules clusters (called dendrogram). Such a tree has all features of binary tree, so we notes that the time efficiency of searching such trees is O(log2 n). Such an optimization of the inference processes is possible because small part of whole knowledge base is analyzed only, i.e. the most relevant to the given set of facts (input knowledge) (Nowak et al., 2008). 582 Agnieszka Nowak, Beata Zielosko 3.1 The knowledge base structure Having both: X as a set of rules in a given knowledge base and Fsim as a similarity function connected with a set of rules from X, we may build hierarchically organized model of knowledge base. Such model for n rules X = {x1 , .., xn }, where each rule uses set of attributes A and values of such attributes V (V = ∪a∈A Va is a set of values of attribute a), is represent as labeled binary tree S2n−1 T ree = {w1 , . . . , w2n−1 } = k=1 {wk }, created by clustering rules using such similarity function. Labels of nodes in such tree are fifts: {d(xk ), c(xk ), f, i, j}, where c(xi ) ∈ V1 × V2 × . . . × Vm is a vector of left hand-side (conditional part) of the rule xk and d(xi ) ∈ V1 × V2 × . . . × Vm is a vector of right hand-side (decision part) of such rule, and f = Fsim : X × X −→ [0..1] is a value of similarity between two clustered rules (or rules) xi and xj . Elements i and j are numbers of children actually analyzed k-th node. 3.2 Agglomerative hierarchical clustering of rules Agglomerative algorithm starts with each object being a separate itself, and successively merge groups according to a distance measure. The clustering may stop when all objects are in a single group (classical AHC). We could also use modified AHC, so called mAHC, which is widely presented in (Nowak et al., 2006a,b). Algorithm 2: Agglomerative hierarchical clustering algorithm for rules in knowledge bases Input : A set O of n objects and a matrix of similarities between the objects. Output: Clusterings C0 , C1 , ..., Cn−1 of the input set O; C0 = the trivial clustering of n objects in the set input set O; while |C| > 1 do find ci , cj ∈ Ck−1 where similarity s(ci , cj ) is maximal; Ck = (Ck−1 \{ci , cj }) ∪ (ci ∪ cj ); calculate similarity s(ci , cj ) ∀ ci , cj ∈ Ck ; end return C0 , C1 , ..., Cn−1 ; Algorithm gets as input a finite set O of n objects and a matrix of pairwaise distances between these objects. This means that executing of clustering algorithm is completely independent on distances between the objects were computed. The algorithm starts with a trivial clustering c0 with n singleton clusters. At each iteration phase two clusters ci , cj with highest similarity in ck−1 are searched and merged. A new clustering Ck is formed by removing these two clusters and adding the new merged cluster, e.g. Ck is Ck−1 with clusters ci and cj merged. It is continued until there is only one cluster left. The output of the algorithm is the sequence of clusterings C0 , C1 , ..., Cn−1 . A centroid as the representative of created group-cluster is calculated as the average distance all objects till the given cluster. Various similarity and distance metrics were analyzed for clustering, because it is very important to use proper metrics (Ćwik and Koronacki, 2008). Analysis of Inference Processes on Clustered Partial Decision Rules 583 clustering efficiency is presented in (Kaufman and Rousseeuw, 1990). The results show that only the Gower’s measure is effective for such different types of data that are in composited knowledge bases. 3.3 Inference process on hierarchical knowledge base In decision support systems without rules clusters, rule interpreter has to check each rule, one by one, and firing these which exactly match to given observations. It takes O(n) time efficiency, for n as the number of rules, whereas in decision support systems with hierarchical structure of knowledge base the time efficiency is minimalized to O(log2 2n − 1), where n is the number of rules clusters. In this situation rule interpreter doesn’t have to check each rule - one by one. The process of inference is based on searching binary tree with rules clusters and choosing nodes with highest similarity value. There are various types of searching tree techniques. We can choose one of two given methods of searching trees: so called "the best node in all tree" or "the minimal value of coefficient". Both are widely presented in (Nowak et al., 2006a,b). 4 Results of experiments We did experiments with 4 datasets (“lymphography", “spect_all", “lenses" and “flags") from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/ ). For each data set we use the greedy algorithm for partial decision rules constraction with parameter α ∈ {0.0, 0.001, 0.01, 0.1, 0.2, 0.5}. Next, for every data set of rules we build knowledge base and use AHC algorithm to construct clusters of rules which are searched during the inference process. Tables 1–4 presents results of inference process for data sets:“lymphogrampy”, “spect_all”, “lenses” and “flags”. With each experimental set we notes number of rules (NR ), number of nodes searched in the dendrogram using "the best node in all tree" method (Nn ) and different number of input knowledge (facts)(Ndata ). For such items we checked number of relevant rules in whole knowledge base (Nrr ), number of relevant rules successively searched (Nrs ), and number of searched rules that are not relevant (Nnrs ), (P recision) and the percent of the whole knowledge base, which is really searched (%KB). Precision is a measure of probability of searching only relevant rules during the inference process. It was considered further in (Nowak and Wakulicz-Deja, 2008a). For us, as it is in Information Retrieval, a perfect precision equal to 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant rules were retrieved, simply bacause there is no need to search each relevant rule). Based on results presented in Tables 1–4 we can see that the more rules in knowledge base consists of, the less percent of knowledge base is really searched. Figure 1 presents how average percent of searched knowledge base depends on the number of rules. Tables 5–8 presents minimal, average and maximal length of rules for data sets:“lymphogrampy”, “spect_all”, “lenses” and “flags”. Based on results presensted in Tables 5–8 we can see, that with growing of parameter α length of rules is decreasing. 584 Agnieszka Nowak, Beata Zielosko Table 1: Results of inference process for lymphography NR 148 148 148 148 148 148 148 Nn Ndata Nrr Nrs Nnrs P recision 12 1 7 1 0 1 12 2 1 1 0 1 16 3 3 1 0 1 18 2 2 1 0 1 20 1 11 1 0 1 14 2 2 1 0 1 26 1 11 1 0 1 % KB 4.07 % 4.07 % 5.42 % 6.10 % 6.78 % 4.75 % 8.81 % Table 2: Results of inference process for spect_all NR 267 267 267 267 267 267 267 267 267 267 267 267 267 Nn Ndata Nrr Nrs Nnrs P recision 16 1 38 1 0 1 12 9 6 1 0 1 4 9 6 0 1 0 12 9 6 1 0 1 24 9 1 1 0 1 8 9 1 1 0 1 26 1 38 1 0 1 16 5 19 1 0 1 16 2 19 1 0 1 16 1 6 1 0 1 14 1 34 1 0 1 20 1 11 1 0 1 8 2 6 1 0 1 % KB 3% 2.25 % 0.75 % 2.25 % 4.5 % 1.5 % 4.87 % 3% 3% 3% 2.63 % 3.75 % 1.5 % Table 3: Results of inference process for lenses NR Nn Ndata Nrr Nrs Nnrs P recision 24 8 1 12 1 0 1 24 10 3 2 1 0 1 24 8 4 1 1 0 1 24 8 1 12 1 0 1 24 8 3 2 1 0 1 24 8 4 1 0 1 0 24 8 1 12 1 0 1 24 8 3 2 1 0 1 24 8 4 1 0 1 0 24 8 1 12 1 0 1 24 8 3 2 1 0 1 24 8 1 12 1 0 1 24 8 2 2 1 0 1 24 8 1 12 1 0 1 % KB 17.02 % 21.27% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 17.02% 585 Inference Processes on Clustered Partial Decision Rules Table 4: Results of inference process for flags NR 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 Nn Ndata Nrr Nrs Nnrs P recision 22 3 2 1 0 1 20 1 2 1 0 1 20 2 1 1 0 1 22 3 2 1 0 1 20 1 2 1 0 1 20 2 1 1 0 1 14 2 2 1 0 1 28 1 5 1 0 1 14 2 3 1 0 1 12 1 5 1 0 1 18 1 2 1 0 1 12 1 5 1 0 1 18 1 2 1 0 1 12 1 5 1 0 1 18 1 2 1 0 1 % KB 5.68% 5.16% 5.16% 5.68% 5.16% 5.16% 3.61% 7.23% 3.61% 3.1% 4.65% 3.1% 4.65% 3.1% 4.65% 25 max % of KB min % of KB avg % of KB % of KB really searched 20 15 10 5 0 24 148 number of rules 194 267 Figure 1: Dependance between number of rules and % of KB really searched Table 5: Length of partial decision rules for lymphography: 148 rows, 18 conditional attributes α = 0.0 α = 0.001 α = 0.01 α = 0.1 α = 0.2 min avg max min avg max min avg max min avg max min avg max 1.0 2.13 4.0 1.0 2.13 4.0 1.0 2.13 4.0 1.0 1.49 2.0 1.0 1.06 2.0 586 Agnieszka Nowak, Beata Zielosko Table 6: Length of partial decision rules for spect_all: 267 rows, 23 conditional attributes α = 0.0 α = 0.001 α = 0.01 α = 0.1 α = 0.2 min avg max min avg max min avg max min avg max min avg max 1.0 3.19 10.0 1.0 3.19 10.0 1.0 2.92 10.0 1.0 1.56 7.0 1.0 1.30 5.0 Table 7: Length of partial decision rules for lenses: 24 rows, 4 conditional attributes α = 0.0 α = 0.001 α = 0.01 α = 0.1 α = 0.2 min avg max min avg max min avg max min avg max min avg max 1.0 2.13 4.0 1.0 2.13 4.0 1.0 2.13 4.0 1.0 1.92 3.0 1.0 1.50 2.0 Table 8: Length of partial decision rules for flags: 194 rows, 30 conditional attributes α = 0.0 α = 0.001 α = 0.01 α = 0.1 α = 0.2 min avg max min avg max min avg max min avg max min avg max 1.0 1.33 3.0 1.0 1.33 3.0 1.0 1.15 3.0 1.0 1.03 2.0 1.0 1.0 1.0 4 flags lenses lymphography spect_all 3.5 avg length of rules 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 alpha Figure 2: Avearge length of rules Figure 2 presents average length of rules for data sets “lymphogrampy”, “spect_ all”, “lenses” and “flags”. Figure 3 presents minimal, average and maximal length of rules for data set “spect_all” If we clustering rules before inference process in a given knowledge base, then instead of all set of rules only small percent of the rules is searched. In large knowledge bases we can significantly reduce the number of rules relevant to the inference initial data. Reduction in the case of backward reasoning is less significant than in case of forward reasoning and depends on selected goal, and need to search solution in small subset of rules. Results of experiments show that inference 587 Inference Processes on Clustered Partial Decision Rules 10 min length avg length max length length of rules 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 alpha Figure 3: Length of rules for spect_all process is sucessful, if for a given set of facts the relevant rules were found and fired, without searching whole knowlege base. As it can be seen in the Figure 2, the more rules the knowledge base consists of, the less part of it is really searched. For datasets with 267 of rules it is even 1.5% of whole knowledge base (Table 2). Different datasets were analyzed, with different type of data, and various number of attributes and their values. Also different values of parameter α were checked. We can see that with the growth of parameter α the length of rules is fall of. If the value of α is bigger then the length of rules is longer. For high value of α (e.g. α = 0.5) rules are the shortest what makes inference process faster. It is because the relevant rule has small number of conditions (for α = 0.5 each rule consists of only one condition) that must be checked with set of facts in given knowledge base. For longer rules the time of rule checking is longer. For α = 0 in dataset like "spect_all" (Figure 3) maximal lenght of rule is even 10. It is obvious that it makes the inference process slower and less efficient. The best results we can achieve for large knowledge bases with partial decision rules than exact ones, because in this case the small percent of whole knowledge base is searched and the inference process made on searched rules is really fast. 5 Summary We proposed to optimize inference processes by using modular representation of knowledge base. Thanks clustering rules we don’t have to search all rules in knowledge base but only the most relevant cluster. This process is based on efficient method of searching binary trees called "the best node in all tree" with time efficiency equal to O(log2 n). Exact rules can be overfitted or depends on the noise. So, instead of exact rules with many attributes we use partial rules with smaller number of attributes. So, the length of rules influences on the efficiency of inference process on knowledge base. Based on results from (Moshkov et al., 2008a) we used 588 Agnieszka Nowak, Beata Zielosko greedy algorithm for partial decision rules construction, because it was proved that under some natural assumptions on the class N P , the greedy algorithm is close to the best polynomial approximate algorithms, for minimization of partial decision rule length. It is worth to do some experiements on larger datasets with more than 20 attributes to study if the number of attributes influences on efficiency of inference process. Experiements on different ways (single linkage, average linkage and complete linkage) of creating clusters of rules are interesting. We also planning to do some experiements on modification of AHC algorithm - so called mAHC presented in (Nowak et al., 2006a,b). Comparing both methods for measuring efficiency would be very interesting. References Jan Ćwik and Jacek Koronacki (2008), Statistical learning systems [in polish], chapter 9, pp. 295–301, EXIT. Leonard Kaufman and Peter J. Rousseeuw (1990), Finding Groups in Data: An Introduction to Cluster Analysiss, chapter 5, pp. 199–253, Jonh Willey & Sons. George F. Luger and William A. Stubblefield (2002), Artificial Intelligence: Structures and Strategies for Complex Problem Solving, chapter 3, pp. 93–106, Addison Wesley. Mikhail Ju. Moshkov, Marcin Piliszczuk, and Beata Zielosko (2008a), On Partial Covers, Reducts and Decision Rules, Transaction on Rough Sets, 8:251–288. Mikhail Ju. Moshkov, Marcin Piliszczuk, and Beata Zielosko (2008b), Partial covers, reducts and decision rules in rough sets: theory and applications, volume 145 of Studies in Computational Intelligence, chapter 1, Springer-Verlag Berlin Heidelberg. Agnieszka Nowak, Roman Simiński, and Alicja Wakulicz-Deja (2006a), Towards modular representation of knowledge base, Springer Verlag - Advances in Soft Computing, pp. 421–428. Agnieszka Nowak, Roman Simiński, and Alicja Wakulicz-Deja (2008), Knowledge representation for composited knowledge bases, pp. 405–414, EXIT. Agnieszka Nowak and Alicja Wakulicz-Deja (2008a), The analysis of inference efficiency in composited knowledge bases [in polish], pp. 101–108, Uniwersytet Śląski. Agnieszka Nowak and Alicja Wakulicz-Deja (2008b), The inference processes on composited knowledge bases, pp. 415–422, EXIT. Agnieszka Nowak, Alicja Wakulicz-Deja, and Sebastian Bachliński (2006b), Optimization of Speech Recognition by Clustering of Phones, Fundamenta Informaticae, 72:283–293. Zdzisław Pawlak (1991), Rough Sets – Theoretical Aspects of Reasoning about Data, chapter 6, Kluwer Academic Publishers. Beata Zielosko and Marcin Piliszczuk (2008), Greedy algorithm for attribute reduction, Fundamenta Informaticae, 85:549–561.
© Copyright 2026 Paperzz