An integer programming approach for frequent itemset

An integer programming
approach for frequent itemset
hiding
Aris Gkoulalas-Divanis
Vassilos S. Verykios
CIKM’06
outline
• Introduction
• Basic definitions
• Methodology
• Experimental results
• Conclusions
introduction
• It based on the notion of distance between
original database and the sanitized database
• goal: minimized the distance based on the
integer programming while hiding the sensitive
itemsets and minimally affecting non-sensitive
itemsets
Basic definitions
:the support count of itemsets in bitmap representation
 (ab)  1*0  1*1  1
 (ac)  1*1  1*1  2
a
b
c
1
0
1
1
1
1
Maximizing the number of 1 left in D’
non-sensitive itemsets should satisfy
this rule in D’
sensitive itemsets should satisfy this
rule in D’
(cont.)
• Solving this problem is NP-hard ,there are
2m-1 inequalities (m:transactions lists)
(cont.)
• SI={e,ae,bc} (sensitive itemsets)
• S={e,bc} (minimal sensitive itemsets)
• SS={e,ae,bc,ce,abc,……} set of all
sensitive itemsets and their supersets
• Ideal case : F‘=F-SS ,santized database D’
to contain all the frequent itemsets of D
expect from the sensitive ones
(cont.)
• Negative border
ex: acd:infrequent ac,cd,ad:frequent
acd  B  ( F )
• Positive border
ex: ac:frequent ac#:infrequent (#:anyitem)
ac  B  ( F )
Border revision
B- (F)={CD,ABD}
A
AB
frequent
null
B+ (F)={AD,BD,ABC}
AC
ABC
B
C
AD
ABD
BC
ACD
ABCD
revised border
Original border
D
BD
BCD
CD
infrequent
Problem size minimization
C:the total set of affected itemsets
Lc: the set of solutions of the corresponding
inequalities
:remove the inequality of C2
without affecting the global solution of the
system then C2 covers C1
(cont.)
• Corollary :any itemset belonging in the positive
border of F-SS covers all its subsets
=>B+(F’) cover all itemset of F’
B-(F’) cover all itemsets of
Ideal solution Lc:
(cont.)
example
•
•
•
•
F={A,B,C,D,AB,AC,AD,CD,ACD}
SI={AB},S={AB}
F’={A,B,C,D,AC,AD,CD,ACD}
B+(F’)={B,ACD}
B:frequent
ACD:frequent
AB:infrequent
msup=0.2
Constraint satisfaction problem
• A solution of a CSP is a complete assignment of
values to the variables that satisfies all the
constraints
• In CSP we usually wish to maximize or minimize
an objective function subject to a number of
constraints
• To solve this problem we use “binary integer
programming (BIP)” that transform the CSP to an
optimization problem
Binary integer problem
Experimental results
• 10,000 transactions,10items,msup=0.1
conclusions
• Defined a new metric to quantify the
distance of the initial database D and
its sanitized version D’
• It has benefit of being exact when
ideal solution can be identified
Exact knowledge hiding through
database extension
Aris Gkoulalas-Divanis
Vassilos S. Verykios
TKDE’08
introduction
• The goal of the hiding algorithm is to create a minimal
extension DX to the original database DO
D
(cont.)
• S={e,ae,bc}
methodology
• P=|D| N=|Do| Q=|Dx|
ex: e:4,ae:3,bc:4
 4

Q
 10   4
 0.3

(cont.)
• The distance between Do and D is measured based on the
extension Dx
(minimize)
(cont.)
Optimal solution set c:
S={e,ae,bc} mfreq=0.3 Q=4
C={e,f,bc,bd,ab,acd}
0.3*(10+4)-4
Safety margin
• The lower bound of Q under certain circumstances be
insufficient to allow for the identification of an exact
solution
• Safety margin(SM): Expand the size of Q of Dx, it can be
predefined or be computed dynamically
Ex:s={abc}
 3

Q
 10   1  1
 0.3

only 1 transaction is insufficient to
provide an exact solution
(cont.)
• Null transaction:
(i) an unnecessarily large safety
margin
Should be removed from Dx
(ii) a large value of Q essential for
proper hiding
Need to be validated ,since Q denotes the
lower bound in the number of transactions to
ensure proper hiding
(cont.)
• To ensure minimum size of Dx, the hiding
algorithm keeps only k null transactions
Qinv:null transaction
V=Q+SM-Qinv
Ex: s={abc} ,Q=1 ,SM=3
K=max(1-3,0)=1
Null transaction
Experimental results
(cont.)
conclusions
• Use a minimal extension to the
original database
• It has benefit of being exact when
ideal solution can be identified