A Privacy Preserving Incremental Technique for

Secure Incremental Maintenance of
Distributed Association Rules
Agenda






Introduction
Secure Technologies
Problem Definition
Our algorithm
Experiments
Conclusions
Introduction

Association Rules
–

Secure Distributed Association Rules
–
–

A means to identify patterns and trends
Privacy is concerned
Restricted usage of some information
Maintenance of environment
–
–
Association rules with more sites
Use past results to reduce workload
Secure Data Mining

Approach 1: Data Obfuscation
–
–

Association rules from modified data
Simple algorithms but may get false rules
Approach 2: Secure Protocols
–
–
–
Complex communication
Difficult and costly algorithms but get accurate
rules
Balance between cost and privacy
Secure Technologies

Secure Sum
–
–
–

There are n sites
Each site holds a private number
Compute the sum of a group of sites
Secure Union
–
–
–
There are n sites
Each site holds a private set of items
Compute the union of sets
Secure Sum Example
10
Upper Bound: 40
Site 1
R = 28
17 – 28 mod 40 = 29
38
11
8
17
38 + 11 mod 40 = 9
Site 2
9
Site 3
Secure Technologies

Secure Comparison
–
–
–
Two sites
A site holds a number a, another holds a number
b
Check if a >= b without letting anyone knows the
value of a and b
Problem Definition

There are n old sites
–

There are r new sites
–

Knows the association rules in these sites
Requires update of association rules in new
environment
Maintain the privacy as well
Privacy? What to protect?


Different requirements in different situation
Basic requirements
–
–
Protect individual transaction
Protect individual site information


Local large itemsets, counts for itemsets
Secure Multi-party computation
–
The process does not reveal any other useful
information except the information that can be
derived from own input and the final result
Algorithms

Secure Incremental Maintenance of
Distributed Association Rules (SIMDAR)
–

Mining association rules with basic privacy level
More Secure Incremental Maintenance of
Distributed Association Rules (MSIMDAR)
–
Mining association rules under the definition of
Secure Multiparty computation
SIMDAR: What we know? (Assumption)



Original Large Itemset Lk is available
Total count for each old large Itemset is known
All sites follow a semi-honest model
–

They follow the rules, but may try to guess other’s
information based on the received data (intermediate
messages)
No collusion among any sites
–
Sites do not exchange intermediate information
Algorithm - SIMDAR

To find the large itemsets
–
–
–
–

Generate the candidate sets
Count on the candidates
Summing counts
Check for large itemset
Check if an association rule holds
–
Easy with counts available
Generate the candidates


C1 = I
For Ck,
–


Each new site generates its own candidate set
with own (k-1)th locally large and globally large
itemsets
Secure Union to find the candidate sets from
the new sites
Union with Lk
Summing on candidates

Partition into 2 groups
–
–

Pk: in Lk
Qk: not in Lk
For Pk, we got the original count, just add up
the count in new sites using secure sum (no
scan on old sites)
Summing Count for Qk



First summed up in new sites, we get a count
If the itemset is large in new sites, send to
old sites for scan
Otherwise, prune away
Information Protected by SIMDAR

Individual transaction
–

Large Itemset of specific site
–

We never access to individual transaction of
others
They are input to Secure Union
Count of each Itemset on each site
–
They are input to Secure Sum
MSIMDAR: for Higher privacy level




Final result: global association rules
Input: Site database
Other information should be protected
Cannot reveal large itemsets?
–
–
Costly checking
We treat the large itemsets as part of the result
MSIMDAR


Target: Global large itemsets and association
rules
Useful information revealed by SIMDAR
–
–
–
Total Counts of itemsets
Original results of large itemset to new sites
New Candidates at new sites to old sites

Add fake itemset to hide the actual supported itemsets
MSIMDAR

Hiding the total count of an itemset
–

Do we really need to find out the total count?
Protect the large itemsets of the original
results
–
Use a more complex protocol
MSIMDAR – Adding

Total excess count:
–

Instead of summing X.counti, we sum the excess
count X.excessi
–

X.excess = X.count – s% |DB|
Even revealed, we cannot know the count and database
size
Checking for large itemsets after Secure Sum
–
–
–
Sa (the first site) holds random key Rx
Sb (the last site) holds (X.count – s% |DB| + Rx)
Secure Comparison between Sa and Sb
Storage

We can reuse it in future and we need it in
the future
–
–
Checking for association rules requires counting
information
Prepare for next update
Storage

Commonly used method
–
Each site holds their own information


–
Count for each itemset
Database size
need to calculate the total count each time
Storage

We first sum the total database size |DB|
using Secure Sum
–
–

Su (first site) holds the key of secure sum Rt
Sv (last site) get the sum |DB| + Rt
For each itemset X, we store also
–
–
The protecting key Rx
The protected excess count X.excess + Rx
Reusing the count

Checking association rules
–
A.count – c% B.count > 0

–
N1 + (-1)N2 + (-c%)N3 + (c%)N4 + (c%-1)s%N5 + (1-c%)N6







Can be derived by six stored numbers
N1 = A.excess + Ra
N2 = Ra
N3 = B.excess + Rb
N4 = Rb
N5 = |DB| + Rt
N6 = Rt
Secure sum and secure comparison
Avoiding new sites knowing past
results



Generating the candidates is similar except
an old site will join to the Secure Union
process
For counting, two old sites will join
Define:
–
–
–
Pk = Lk intersect Ck
Qk = Ck – Pk
Note that the new sites should not be able to
distinguish Pk and Qk
Adding counts in new site
Adding for Pk
Protected excess
A
B
Old sites
A
Random Key
New Sites
Sum
Secure Compare
Adding for Qk
0
A
B
Old sites
A
0
New Sites
Sum
Secure Compare
New site pruning


New sites sends the count to an old site to
continue
We got final excess count for Pk
–

Comparison means if the itemset is large in all
sites
We got excess count in new sites for Qk
–
Comparison means if the itemset is large in new
sites
Experiments

3 programs
–
–
–

Environment
–
–

With privacy but no maintenance (SEC)
No Privacy but maintenance (MAN)
With privacy and maintenance (MSIDMAR)
P4 1.7GHz under Linux
Each site is simulated by an individual computer
Measure
–
CPU time
CPU Time (in seconds)
DB size
3000
2000
1000
0
0
400
800
1200
Database size (in thousands)
SEC
MAN(old)
MAN(new )
MSIMDAR(old)
MSIMDAR(new )
Support
thousands seconds)
Total CPU Time(in
Ratio
30
20
10
0
3:12
Ratio of old sites to new site
New Sites
Old Sites
MSIMDAR
MAN
SEC
6:9
MSIMDAR
MAN
SEC
9:6
MSIMDAR
MAN
SEC
MSIMDAR
MAN
SEC
12:3
Number of candidate sets
Ratio
16000
12000
new sites
old sites
8000
4000
0
0.00
1.00
2.00
3.00
4.00
Ratio of new sites to old sites
5.00
Analysis

Process time at new sites takes much longer
–

About 3 time to 5 times of that of old sites
Cost overhead due to secure algorithm
–
–
–
At old sites, average 10% of total cost
At new sites, average 6% of total cost
Both decrease in proportion with increase in db
size
Conclusion

We have proposed algorithms to solve the
maintenance problem at different privacy level
–

All can give a more efficient solution than simply ignoring
the past results
As the number of sites are most likely to increase
–
–
The load on old sites will be low relatively to new sites
High entrance cost but low maintenance cost
End