Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 Available online at www.ispacs.com/jsca Volume 2017, Issue 1, Year 2017 Article ID jsca-00073, 9 Pages doi:10.5899/2017/jsca-00073 Research Article Using blocking approach to preserve privacy in classification rules by inserting dummy Transaction Doryaneh Hossien Afshari1, Farsad Zamani Boroujeni1* (1) Department of Computer Engineering, Khorasgan (Isfahan) Branch, Islamic Azad University, Isfahan, Iran Copyright 2017 © Doryaneh Hossien Afshari and Farsad Zamani Boroujeni. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The increasing rate of data sharing among organizations could maximize the risk of leaking sensitive knowledge. Trying to solve this problem leads to increase the importance of privacy preserving within the process of data sharing. In this study is focused on privacy preserving in classification rules mining as a technique of data mining. We propose a blocking algorithm to hiding sensitive classification rules. In the solution, rules' hiding occurs as a result of editing a set of transactions which satisfy sensitive classification rules. The proposed approach tries to deceive and block adversaries by inserting some dummy transactions. Finally, the solution has been evaluated and compared with other available solutions. Results show that limiting the number of attributes existing in each sensitive rule will lead to a decrease in both the number of lost rules and the production rate of ghost rules. Keywords: Blocking Approach, Classification Rule Hiding, Dummy Transaction and Privacy Preserving. 1 Introduction Different data mining techniques and algorithms used to mine useful and hidden knowledge from databases have their own upsides and downsides. Production and hosting by ISPACS GmbH. * Corresponding Author. Email address: [email protected]; Tel: +98-313-5354001 44 Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 45 One of the disadvantages resulted from using these techniques is disclosure of sensitive information; this issue will jeopardize privacy and security of data and bring some disastrous consequences upon the owners. In order to solve the mentioned risk, Privacy Preserving in Data Mining (PPDM) was proposed. Taking advantage of privacy preserving and data sanitization techniques along with extracting new useful patterns would minimize the possibility to gain access over classified information. The issue of privacy preserving have been put under study throughout different data mining techniques. The present paper concerns the issue of privacy preserving in mining process of classification rules. In the proposed solution, after mining classification rules from main database, selecting a number of rules as the sensitive rules by using blocking approach, editing a limited number of attributes values, and inserting some dummy transactions in the database, hiding process of sensitive classification rules will be run. This study aims on hiding all the sensitive classification rules with leaving the minimum side-effects possible on the insensitive rules. The fundamental concepts in privacy preserving and related works are presented in Sections 2,3,4. Section 5 includes the proposed algorithm, and, at the end, section 6 and 7 present the discussions and the conclusion part. 2 Theoretical backgrounds Privacy preserving techniques impede sensitive information to be mined after data mining process by making changes in main database. The main purpose of preserving privacy in data mining process is to sanitize sensitive knowledge and make the minimum changes possible in main database in a manner that sensitive knowledge cannot be mined through (after) running data mining process. Generally, data sanitization or data censoring is carried out in one of the two following ways: Data Reconstruction: in this method, main database does not directly go through changes[1]. Data Modification: here, changes are directly made in main database. This method includes 2 different techniques: Definition 2.1. Data Distortion which is carried out by changing the data value from 1 to 0 or vice versa[2],[3]. Definition 2.2. Data Blocking which is operated by replacing the data value with “?” value[2],[3]. One of the downsides of Data Distortion technique is placement of wrong values in database. In some databases like medical databases Data Distortion technique cannot be used because excluding some elements can be highly dangerous, also placing a number of wrong values can lead to terrible consequences[4]. Data sanitization, however, has its own problems. Among its side-effects, hiding failure, lost rules, and ghost rules are the most common ones, detailed description in[5],[6]. Up to now, various attempts on sensitive knowledge sanitization have been done. Moreover, different algorithm have been introduced for hiding sensitive association rules or sensitive itemsets with regard to various approaches such as heuristic approach, border-based approach, exact approach, or hybrid approach[7-9]. This paper concerns classification rules hiding. As it follows, a number of previously conducted studies in this field will be presented. 3 Related works Methods which have been introduced for classification rules hiding go under 3 general categories as follows: Statistical-based methods, Reconstruction-based methods, and Perturbation-based methods. In[10], a reconstruction-based algorithm for hiding sensitive classification rules in Categorical database International Scientific Publications and Consulting Services Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 46 has been presented. In this method and for generating a sanitized database, a decision tree including only insensitive rules has been generated and used to reconstruct the database. Reference [11]presents another reconstruction-based algorithm (titled LSA) which uses insensitive rules mined form main database to reconstruct the sanitized database. In this method, LSA edits the transactions supporting sensitive and insensitive rules in a way that they only support insensitive ones. This way will decrease the amount of side-effects in the sanitized database. Reference [12]includes another algorithm for hiding classification rules. The method in this article is based on data reduction approach and performs the hiding process by completely deleting the selected tuples. The solution proposed in [13]is based on data perturbation. The ROD algorithm has been introduced to hide sensitive classification rules in categorical databases. In ROD, tuples supporting the insensitive rules enter the sanitized database after the separation of sensitive and insensitive rules. Then, by replacing some attributes in every tuple related to sensitive rules, these tuples are changed so as they look the same as insensitive rules. One of the obstacles of data perturbation based approaches is the substitution of correct values in main database with wrong values (changing 1 to 0 and vice versa). As it was mentioned earlier, such substitution under certain conditions such as medical databases may have severe consequences[14]. Hence, data blocking based approach has been introduced in [15]. In this approach, all values of attributes in sensitive rules are substituted with unknown value “?” in all transactions supporting sensitive rules. This method inspired our proposed solution for hiding classification rules. Our proposed solution performs the hiding process by substituting a limited number of sensitive attributes in sensitive transactions. Also, it has been tried to deceive and block the adversaries by inserting dummy transactions including unknown values. 4 Problem definition Suppose that Rc is a classification rule which has been mined form the main database D: (4.1) (𝐴𝐶1 = (𝑉 𝐴𝐶1 )) ∧ … (𝐴𝐶𝑛 = (𝑉 𝐴𝐶𝑛 )) → (𝐶 = 𝑉𝑐) The left side of the above rule includes attributes and their values so as 𝐴𝐶1 , 𝐴𝐶2 , … , 𝐴𝐶𝑛 ≠ 𝐶 and the right side includes class attribute and its value. Every classification rule has a degree of support which shows the number of transactions supporting the rule. A transaction in a database supports a classification rule if all attributes and all values of attributes of the rule exist in the transaction and they are the same as the rule. Regarding above, the issue of classification rules hiding is presented in the following way: if R is a set of classification rules mined form the main database D and Rs is the set of sensitive rules (RsϵR), then the purpose is to generate a sanitized database sensitive rules of which cannot be mined and its insensitive rules R-Rs can be mined as much as possible. 5 Main section: The proposed algorithm(BCR) The proposed solution substitutes a certain number of attribute values with an unknown value in transactions supporting sensitive rules in order to hide sensitive classification rules. This solution does its task in a manner that these sensitive rules cannot be mined from final database. This method aims both on hiding sensitive rules and editing the database in such way that adversaries cannot detect the values existed before editing. So, inserting transactions with unknown values is done for a number of sensitive rules. In BCR method, after mining the classification rules from database and selecting the sensitive rules, sensitive rules are sorted in descending order regarding the number of attributes. Database would be then scanned for every sensitive rule and the transactions supporting the rule would be selected for editing. To hide a sensitive rule not all attributes of a rule but a certain number of them are to be blocked. Hence, formula 2 is used to determine the number of attributes to be blocked: International Scientific Publications and Consulting Services Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 47 𝑁𝐴 (5.2) ⌉ 2 Where NA is the number of attributes of a sensitive rule. In blocking process, class attribute is always blocked and the remaining numbers of X value are being blocked randomly from the attributes on the left side. Blocking process for every sensitive rule is conducted on the same rate as the rule’s Degree of support. In other words, substituting main values with unknown values in transactions supporting the sensitive rule continues until the rule is finally hidden. During the blocking process, inserting dummy transactions for a number of sensitive rules is performed. 𝑋=⌈ 𝑛 According to the sorted sensitive rules, ⌈2 ⌉dummy transactions for the first sensitive rule with the highest 𝑛 𝑛 number of attributes, ⌈2 ⌉ − 1 dummy transactions for the second sensitive rule, ⌈2 ⌉ − 2 dummy transaction for the third sensitive rule ,…. Will be inserted. The first transaction supporting the sensitive rule will be the first to experience the insertion of dummy transactions. Dummy transactions will be inserted on a certain number. For every real transaction some values of which have been changed to unknown there will be a dummy transaction inserted (If a sensitive rule has Degree of Support of 5 and 5 supporting transactions, but we can insert only three dummy transactions for it, inserting process will then begin form the first main transaction and the two last transactions will earn no dummy transactions.). Formula 3 shows the total number of dummy transactions which can be inserted: 𝑛−1 𝑛 = 1,2 ∑(𝑛 − 𝑖) 𝑖=0 𝑛 ⌈ −1⌉ 2 𝑛>2 (5.3) 𝑛 ∑ (⌈ ⌉ − 𝑖) 2 { 𝑖=0 Where n is the number of sensitive rules. The insertion process begins from the first transactions which support the first sensitive rule. A dummy transaction including unknown values similar to the main transaction is inserted. Unknown values will be displaced regarding the attributes of the sensitive rule. This method prevents the repetition of transactions. Since inserting dummy transaction and displacing unknown values for sensitive rules with small number of attributes is difficult, sensitive rules have been sorted in a descending order with regard to the number of attributes, so inserting dummy transactions is being performed in a same order. Pseudo code for the proposed solution is shown in Figure 1. The following shows some examples of performance of the proposed solution. 5.1. Examples of performance of the proposed algorithm Table 1 shows the main database. Every single transaction in this table has 5 attributes, and Approval result is the class attribute. To mine the classification rules the RIPPER algorithm which is available in WEKA 3.7 software, Weka, Classifiers. Rules. JRip was used. In order to mine rules, all parameters of JRIP algorithm but the last one, i.e. Use Pruning which has been changed from True to False, are in default mode. Table 2 shows the mined rules. Rule No. 2 is considered as a sensitive rule. Since the selected rule is a sensitive one, rule sorting will not be performed. The transactions which support the sensitive rule are selected by scanning the initial database. In this example, transactions no. 6 and 14 support the sensitive rule. The blocking process is now being performed with regard to Formula 2 and variable value of X=2. Two attributes one of which is class attribute out of rule’s 3attributes are blocked. According to Formula 3, a dummy transaction is inserted for sensitive rule. This dummy transactions in which unknown values will be substituted is similar to transaction No.6. The sensitive rule is now hidden. After running the JRIP algorithm again with similar parameters, the only rule being mined is rule No.1. Table 3 shows the sanitized database. International Scientific Publications and Consulting Services Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 48 Table 1: The initial database. No. Years at current work Marriage status 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Short Short Long Medium Medium Medium Long Short Short Medium Short Long Long Medium Married Married Married Divorce Single Single Single Divorce Single Divorce Divorce Divorce Married Divorce No. 1 2 Gender Black list Female Female Female Female Male Male Male Female Male Male Male Female Male Female No Yes No No No Yes Yes No No No Yes Yes No Yes Approval result No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Table 2: Classification rules mined from the initial database. Classification rules (gender = female) and (years at current work = short) => approval result=NO (3.0/0.0) (years at current work = medium) and (black list = yes) => approval result=NO (2.0/0.0) Table 3: The sanitized database. No. Years at current work Marriage status Gender Black list 1 2 3 4 5 6 D* 7 8 9 10 11 12 13 14 Short Short Long Medium Medium Medium ? Long Short Short Medium Short Long Long ? Married Married Married Divorce Single Single Single Single Divorce Single Divorce Divorce Divorce Married Divorce Female Female Female Female Male Male Male Male Female Male Male Male Female Male Female No Yes No No No ? Yes Yes No No No Yes Yes No Yes Approval result No No Yes Yes Yes ? ? Yes No Yes Yes Yes Yes Yes ? 6 Comparative evaluation of the proposed algorithm To evaluate the performance of the algorithm, both the BCR proposed algorithm and ROD algorithm [13] were run on 3 real databases provided by UCI repository. Table 4 shows the specifications of the 3 databases. International Scientific Publications and Consulting Services Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 49 Table 4: Specifications of databases. Dataset Instances Attributes Breast-Cancer 286 10 Mushroom 8124 23 Car 1728 7 Rules 14 9 49 Number of Side Effectes Classification rules were mined using RIPPER algorithm and 2 and 3 sensitive rules were randomly selected. Three factors of hiding failure, lost rules, and ghost rules have been considered to evaluate the algorithms. Hiding failure factor shows the number of sensitive rules which will be mined from database after performing the sanitization process. Lost rule declares the number of insensitive rules which will not be mined from database after performing the sanitization process. Ghost rule shows the number of rules which do not exist in initial database but will be mined from sanitized database after performing the sanitization process. Results for running the proposed algorithm are shown in Figures 1, 2 and 3. 6 All Side Effectes in Mushroom Dataset 4 2 0 Hiding Failure Lost Rules Ghost Rules Hiding Failure Lost Rules BCR Ghost Rules ROD 2-Rules 3-Rules Number of Side Effectes Figure 1: Comparative evaluation in Mushroom database. 6 All Side Effectes in Cancer Dataset 4 2 0 Hiding Failure Lost Rules Ghost Rules Hiding Failure Lost Rules BCR Ghost Rules ROD 2-Rules 3-Rules Figure 2: Comparative evaluation in Breast Cancer database. International Scientific Publications and Consulting Services Number of Side Effectes Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 50 All Side Effectes in Car Dataset 6 4 2 0 Hiding Failure Lost Rules BCR Ghost Rules 2-Rules Hiding Failure 3-Rules Lost Rules Ghost Rules ROD Figure 3: Comparative evaluation in Car database. As it can be seen in diagrams above, the proposed algorithm has made the sensitive rules made hidden with the minimum side-effects possible. Selecting a limited number of attributes for hiding process would cause fewer changes in database and, consequently, lead to a decrease in number of lost rules and generation rate of ghost rules. Sorting the rules according to related number of attributes and inserting dummy transactions in a hierarchical manner, along with leaving some sensitive rules free of dummy transactions would cause perturbation in detecting the blocking process and result in a fall in the generation rate of ghost rules. The above result show that in ROD algorithm by increasing the number of sensitive rules the number of side effects increase as well however this is not the case in proposed algorithm. Although as it is illustrated in Figure3 inserting the transaction has resulted in ghost rules, yet in comparison to ROD the proposed algorithm shows better results. This is due to defining limitations for the insertion of dummy transaction. Given that the insertion of dummy transactions is subtractive and hierarchical the production on ghost rules is being controlled. By making changes in all those attributes that are supporting the non-sensitive rules, the ROD algorithm produces more ghost rules compared to proposed method. This problem gets even worst when the number of attributes of sensitive and non-sensitive rules are equal. The more the number of attributes for the sensitive rules the better the result from proposed algorithm compared to the ROD, and proposed method is even more efficient in maintaining data quality.in general in order to hide sensitive rules it is enough to change only a limited number of attributes and inserting more changes will only result in more side effects as mentioned above, therefore the result from suggested algorithm have shown to be better in all cases mentioned above compared to the results from ROD. Another factor contributing to better results in the proposed method compared to the ROD is their difference in sorting rules. Since in ROD the sorting of rules is based on the number of supports, if there are no non-sensitive rules with less number of support the sanitization of data may face some problems. 6.1. The limitation of the proposed algorithm One of the limitations in the proposed algorithm is its time complexity in large databases. In other words due to inserting transactions, the larger the database the worse becomes the sanitization in the proposed method. Further, by accidently choosing the attributes in proposed method in case the same attribute is present in many different non-sensitive rules it will increase the number of side effects such as ghost rules or lost rules. 7 Conclusion and future works The present study introduced an algorithm based on blocking approach, for hiding sensitive classification rules. The proposed solution performs the hiding process as follows: in this method, International Scientific Publications and Consulting Services Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 51 according to the number of attributes the sensitive rules are being sorted in descending order where the hiding is started from the sensitive rule with the highest number of attributes. For hiding the sensitive rules in addition to class attribute, values of a limited number of attributes on the left side of the sensitive rules are randomly substituted with “?” value. Proposed method inserts some dummy transactions which contain unknown values for a certain number of sensitive rules to disturb database. In the proposed method in order to control the outbreak of side effects resulted from inserting transaction the insert of transaction is done hierarchically and descending and this is done for the limited number of sensitive rules and transactions. All the results shows the proposed method is clean from any hiding failure and in the number of lost rules and ghost rules it has shown to be more efficient in comparison to the introduced algorithm in [15]. Given that the propose algorithm perform the sanitization with fewer adverse effects, therefor, is more successful in maintaining data quality. The proposed solution is capable of further development and can be used in other classification methods such as nearest neighbor classifier and decision tree. In the proposed algorithm, the manner of choice the victim attribute can be changed so that based on the non-sensitive rules. It can be used in conjunction with the insert of transactions, So that the insertion of transactions carried out according to the non-sensitive rules. References [1] Y. Guo, Reconstruction-based association rule hiding, SIGMOD2007 Ph. D. Workshop on Innovative Database Research, (2007) 51-56. [2] H. Wang, Association Rule: From Mining to Hiding, Applied Mechanics and Materials, 321 (2013) 2570-2573. https://doi.org/10.4028/www.scientific.net/AMM.321-324.2570 [3] V. S. Verykios, Association rule hiding methods, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3 (2013) 28-36. https://doi.org/10.1002/widm.1082 [4] V. Garg, A. Singh, D. Singh, A Survey of Association Rule Hiding Algorithms, Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, (2014) 404407. https://doi.org/10.1109/CSNT.2014.86 [5] H. Wang, Quality Measurements for Association Rules Hiding, AASRI Procedia, 5 (2013) 228-234. https://doi.org/10.1016/j.aasri.2013.10.083 [6] A. Gkoulalas-Divanis, J. Haritsa, M. Kantarcioglu, Privacy Issues in Association Rule Mining, Frequent Pattern Mining, ed: Springer, (2014) 369-401. https://doi.org/10.1007/978-3-319-07821-2_15 [7] G. V. Moustakides, V. S. Verykios, A MaxMin approach for hiding frequent itemsets, Data & Knowledge Engineering, 65 (2008) 75-89. https://doi.org/10.1016/j.datak.2007.06.012 [8] T.-P. Hong, C.-W. Lin, K.-T. Yang, S.-L. Wang, Using TF-IDF to hide sensitive itemsets, Applied intelligence, 38 (2013) 502-510. https://doi.org/10.1007/s10489-012-0377-5 International Scientific Publications and Consulting Services Journal of Soft Computing and Applications 2017 No.1 (2017) 44-52 http://www.ispacs.com/journals/jsca/2017/jsca-00073/ 52 [9] S. Xingzhi, P. S. Yu, A border-based approach for hiding sensitive frequent itemsets, Data Mining, Fifth IEEE International Conference, (2005) 8. https://doi.org/10.1109/ICDM.2005.2 [10] J. Natwichai, X. Li, M. E. Orlowska, A reconstruction-based algorithm for classification rules hiding, 17th Australasian Database Conference, 49 (2006) 49-58. [11] A. Katsarou, G.-D. Aris, V. S. Verykios, Reconstruction-based Classification Rule Hiding through Controlled Data Modification, Artificial Intelligence Applications and Innovations III, ed: Springer, (2009) 449-458. https://doi.org/10.1007/978-1-4419-0221-4_53 [12] J. Natwichai, X. Sun, X. Li, Data reduction approach for sensitive associative classification rule hiding, Nineteenth conference on Australasian database, 75 (2008) 23-30. [13] A. Delis, V. S. Verykios, A. A. Tsitsonis, A data perturbation approach to sensitive classification rule hiding, ACM Symposium on Applied Computing, (2010) 605-609. https://doi.org/10.1145/1774088.1774216 [14] Y. Saygin, V. S. Verykios, C. Clifton, Using unknowns to prevent discovery of association rules, ACM SIGMOD Record, 30 (2001) 45-54. https://doi.org/10.1145/604264.604271 [15] A. Parmar, U. P. Rao, D. R. Patel, Blocking based approach for classification Rule hiding to Preserve the Privacy in Database, International Symposium on Computer Science and Society (ISCCS), (2011) 323-326. https://doi.org/10.1109/isccs.2011.103 International Scientific Publications and Consulting Services
© Copyright 2026 Paperzz