Secure Incremental Maintenance of Distributed Association Rules Agenda Introduction Secure Technologies Problem Definition Our algorithm Experiments Conclusions Introduction Association Rules – Secure Distributed Association Rules – – A means to identify patterns and trends Privacy is concerned Restricted usage of some information Maintenance of environment – – Association rules with more sites Use past results to reduce workload Secure Data Mining Approach 1: Data Obfuscation – – Association rules from modified data Simple algorithms but may get false rules Approach 2: Secure Protocols – – – Complex communication Difficult and costly algorithms but get accurate rules Balance between cost and privacy Secure Technologies Secure Sum – – – There are n sites Each site holds a private number Compute the sum of a group of sites Secure Union – – – There are n sites Each site holds a private set of items Compute the union of sets Secure Sum Example 10 Upper Bound: 40 Site 1 R = 28 17 – 28 mod 40 = 29 38 11 8 17 38 + 11 mod 40 = 9 Site 2 9 Site 3 Secure Technologies Secure Comparison – – – Two sites A site holds a number a, another holds a number b Check if a >= b without letting anyone knows the value of a and b Problem Definition There are n old sites – There are r new sites – Knows the association rules in these sites Requires update of association rules in new environment Maintain the privacy as well Privacy? What to protect? Different requirements in different situation Basic requirements – – Protect individual transaction Protect individual site information Local large itemsets, counts for itemsets Secure Multi-party computation – The process does not reveal any other useful information except the information that can be derived from own input and the final result Algorithms Secure Incremental Maintenance of Distributed Association Rules (SIMDAR) – Mining association rules with basic privacy level More Secure Incremental Maintenance of Distributed Association Rules (MSIMDAR) – Mining association rules under the definition of Secure Multiparty computation SIMDAR: What we know? (Assumption) Original Large Itemset Lk is available Total count for each old large Itemset is known All sites follow a semi-honest model – They follow the rules, but may try to guess other’s information based on the received data (intermediate messages) No collusion among any sites – Sites do not exchange intermediate information Algorithm - SIMDAR To find the large itemsets – – – – Generate the candidate sets Count on the candidates Summing counts Check for large itemset Check if an association rule holds – Easy with counts available Generate the candidates C1 = I For Ck, – Each new site generates its own candidate set with own (k-1)th locally large and globally large itemsets Secure Union to find the candidate sets from the new sites Union with Lk Summing on candidates Partition into 2 groups – – Pk: in Lk Qk: not in Lk For Pk, we got the original count, just add up the count in new sites using secure sum (no scan on old sites) Summing Count for Qk First summed up in new sites, we get a count If the itemset is large in new sites, send to old sites for scan Otherwise, prune away Information Protected by SIMDAR Individual transaction – Large Itemset of specific site – We never access to individual transaction of others They are input to Secure Union Count of each Itemset on each site – They are input to Secure Sum MSIMDAR: for Higher privacy level Final result: global association rules Input: Site database Other information should be protected Cannot reveal large itemsets? – – Costly checking We treat the large itemsets as part of the result MSIMDAR Target: Global large itemsets and association rules Useful information revealed by SIMDAR – – – Total Counts of itemsets Original results of large itemset to new sites New Candidates at new sites to old sites Add fake itemset to hide the actual supported itemsets MSIMDAR Hiding the total count of an itemset – Do we really need to find out the total count? Protect the large itemsets of the original results – Use a more complex protocol MSIMDAR – Adding Total excess count: – Instead of summing X.counti, we sum the excess count X.excessi – X.excess = X.count – s% |DB| Even revealed, we cannot know the count and database size Checking for large itemsets after Secure Sum – – – Sa (the first site) holds random key Rx Sb (the last site) holds (X.count – s% |DB| + Rx) Secure Comparison between Sa and Sb Storage We can reuse it in future and we need it in the future – – Checking for association rules requires counting information Prepare for next update Storage Commonly used method – Each site holds their own information – Count for each itemset Database size need to calculate the total count each time Storage We first sum the total database size |DB| using Secure Sum – – Su (first site) holds the key of secure sum Rt Sv (last site) get the sum |DB| + Rt For each itemset X, we store also – – The protecting key Rx The protected excess count X.excess + Rx Reusing the count Checking association rules – A.count – c% B.count > 0 – N1 + (-1)N2 + (-c%)N3 + (c%)N4 + (c%-1)s%N5 + (1-c%)N6 Can be derived by six stored numbers N1 = A.excess + Ra N2 = Ra N3 = B.excess + Rb N4 = Rb N5 = |DB| + Rt N6 = Rt Secure sum and secure comparison Avoiding new sites knowing past results Generating the candidates is similar except an old site will join to the Secure Union process For counting, two old sites will join Define: – – – Pk = Lk intersect Ck Qk = Ck – Pk Note that the new sites should not be able to distinguish Pk and Qk Adding counts in new site Adding for Pk Protected excess A B Old sites A Random Key New Sites Sum Secure Compare Adding for Qk 0 A B Old sites A 0 New Sites Sum Secure Compare New site pruning New sites sends the count to an old site to continue We got final excess count for Pk – Comparison means if the itemset is large in all sites We got excess count in new sites for Qk – Comparison means if the itemset is large in new sites Experiments 3 programs – – – Environment – – With privacy but no maintenance (SEC) No Privacy but maintenance (MAN) With privacy and maintenance (MSIDMAR) P4 1.7GHz under Linux Each site is simulated by an individual computer Measure – CPU time CPU Time (in seconds) DB size 3000 2000 1000 0 0 400 800 1200 Database size (in thousands) SEC MAN(old) MAN(new ) MSIMDAR(old) MSIMDAR(new ) Support thousands seconds) Total CPU Time(in Ratio 30 20 10 0 3:12 Ratio of old sites to new site New Sites Old Sites MSIMDAR MAN SEC 6:9 MSIMDAR MAN SEC 9:6 MSIMDAR MAN SEC MSIMDAR MAN SEC 12:3 Number of candidate sets Ratio 16000 12000 new sites old sites 8000 4000 0 0.00 1.00 2.00 3.00 4.00 Ratio of new sites to old sites 5.00 Analysis Process time at new sites takes much longer – About 3 time to 5 times of that of old sites Cost overhead due to secure algorithm – – – At old sites, average 10% of total cost At new sites, average 6% of total cost Both decrease in proportion with increase in db size Conclusion We have proposed algorithms to solve the maintenance problem at different privacy level – All can give a more efficient solution than simply ignoring the past results As the number of sites are most likely to increase – – The load on old sites will be low relatively to new sites High entrance cost but low maintenance cost End
© Copyright 2026 Paperzz