Mining Free Itemsets under Constraints

Content
Mining Free Itemsets under
Constraints
z Introduction
z Constrained
z
z
By Jean-Francois Boulicaut and Baptiste Jeudy
International Database Engineering and
Application Symposium
z
z
Apriori revisit
Anti-monotone constrains
Monotone constrains
Generic algorithm
z Frequent
z
z
itemset mining
closed itemset mining
CLOSE algorithm
Incorporating constraints into Apriori
z Conclusion
Introduction
Introduction
z Frequent
z Problems
itemset mining
A set of items is referred to as itemset
#X
z X is an item(or itemset), Support ( X ) =
n
z Support is bounded by a threshold r
z A frequent itemset is an itemset with a support
larger than the minimum support
z Given a database, find all the frequent itemsets
z
with frequent itemset mining
algorithms
The computation may be intractable for a usergiven frequency threshold: the number of
frequent itemsets may explode
z Lack of focus leads to huge output of frequent
itemsets
z
Introduction
z Two
Introduction
issues to tackle these problems
Constraint-based extraction of the frequent
itemsets: only a subset of the collection of
frequent itemsets is interesting.
z Condensed representation of frequent itemsets:
extract a subset of the frequent patterns and
regenerate the whole collection when necessary
z
z
Introduction
z Condensed
z
z
Constraints related to objective measures of
interestingness
z
z
an item must not appear in the itemsets
the itemsets must be frequent
Push constraint checking into algorithms
z Anti-monotone constraints
z Monotone constraints
•Decrease the
size of output
• Improve
user guidance
Introduction
representation of frequent
itemsets
Extract a particular subset of the frequent itemset
collection
z The condensed subset is much smaller than the
original collection
z Can be extracted efficiently
z The whole frequent itemsets can be regenerated
z
Constraint-based extraction of frequent itemsets
z Syntactic constraints
z Main
idea of the paper
Combine the above two approaches into one
algorithm
z This algorithm is based on the structure of
Apriori
z
Content
Summary of paper
z Introduction
z Constrained
z
z
z
z
Apriori revisit
Anti-monotone constrains
Monotone constrains
Generic algorithm
z Frequent
z
z
z
itemset mining
closed itemset mining
CLOSE algorithm
Incorporating constraints into Apriori
z Conclusion
Summary of paper
Definition of constraints
z T : transactional database
Items
z2
: set of all itemsets
z C : constraint
Items
z S : itemset, S∈2
Items
z I : subset of 2
z S satisfies C in T iff C (S , T ) = ture
z SATC (I)={ S ∈ I , S satisfies C }
Items
)
z SATC denotes SATC (2
Summary of paper
TID
Items
Itemset
Support
Frequency
1
ABCD
A
1,2,3,4,6
0.83
2
AC
B
1,4,5,6
0.67
z
3
AC
AB
1,4,5
0.5
4
ABCD
z
AC
1,2,3,4,6
0.83
5
BC
CD
1,4
0.33
6
ABC
ACD
1,4
0.33
C freq ( S ) ≡ F ( S ) ≥ r : an itemset must be at least r frequent.
r
= 0,6 SATC freq = { A, B, C , AC , BC}
Csize ( S ) ≡| S |≤ 2 and Cmiss ( S ) ≡ B ∉ S , then
SATCsize ΛCmiss = { A, C , D, AC , AD, CD} , SATC
size ΛCmiss ΛC freq
= { A, C , AC}
z Constrained
itemset mining
T : transactional database
C : constraint
z Computation of the collection of itemsets that
satisfy C together with their frequecies
z
z
RC = {( S , F ( S )), S ∈ SATC }
Use Apriori for constrained itemset mining
where C is C fre q
Summary of paper - Apriori
z Apriori
z
z
z
z
g
1
z
)
generateapriori ( Lk ) = { A ∪ B, where
A, B ∈ Lk , A and B share the k-1 first
items(in lexicographic order)}
can be changed:
S does not satisfyC , every superset of S
does not satisfy C
z Example: sum( S , price) ≤ v
z A disjunction or conjunction of anti-monotone
constraints is an anti-monotone constraint
am
am
Monotone Constraints
z Definition
Let Cam be an anti-monotone constraint. Step 5 of
What about
Apriori is replaced by Lk := SATC (Ck )
monotone
constraints?
it is still correct and complete.
am
z Apriori
an anti-monotone constraint is a
constraint C such that for all itemsets S, S’:
z If
Anti-monotone constraints
z Apriori
z Definition:
( S ' ⊆ S ΛS satisfy C ) ⇒ S ' satisfy C
Lk := SATC freq (Ck )
5.
g
z 6. Ck +1 := generateapriori ( Lk )
z 7. k := k + 1
k −1
z 8. U i = 0 Li
z
The completeness of
Apriori relies on the
anti-monotonicity of
the constraint
Algorithm
Phase 1 – Candidate safe
pruning
; L0 = {Φcandidates
}
1. C := Items1Eliminate
for
which
a
subset
of
length
k
2. k := 1
is
not frequent
Phase
2frequency
Phase 3 – candidate
constraint
generation
for level k+1,
3.while Ckg ≠ Φ do
(database
scan) that
fuse
two elements
Ckg first
, Lk −1
share the same k-1
4. Ck :=safe-pruning-on(
items
Anti-monotone constraints
can be used to mine constrained
itemsets when the given constraint is antimonotone
z S ∈ Items, Cm ( S ) is
true ⇒ ∀S ' ⊃ S , Cm ( S ') is true
z Example
z sum( S , price) ≥ v
z Given
a monotone constraint Cm , simply
replacing Step 5 in Apriori with L := SAT (C )
leads to the loss of the completeness of
Apriori.
k
Cm
k
The generation step
in Apriori must be
complete: i.e., it
must not miss any
itemset satisfying C
Monotone Constraints
z Example
The pruning step
z Assume C ( S ) ≡ C ∈ S . Itemset ABC should
(Phase 1)be
must be
correct, i.e., it must
generate
apriori from AB and AC but
generated by
not prune an itemset
verify C
since C ( AB) = false, ACB is not generatedthat
whereas
C ( ABC ) = true
z
Assume C ( S ) ≡ A ∈ S . Itemset ABC is correctly
generated by generateapriori from AB and AC but
since C ( AB) = false, ACB is incorrectly pruned
whereas C ( ABC ) = true
Monotone Constraints
z Some
definition in modified generation
procedure
Negative border: If Cam denotes an anti-monotone
constraint, BdC is the collection of the minimal
itemsets that do not satisfy Cam
z Cm denotes a monotone constraint, it is the
negation of Cam , so ¬C 'am equals to Cm
z
am
The generation step and pruning step need to be modified in order to
include monotone constraints
Monotone Constraints
z Generation
Monotone Constraints
procedure
z Pruning
z generate1 (Lk ) = {A U B, where A ∈ Lk and B is a 1-itemset }
• Cam denotes an antimonotone constraint
• ¬C 'am denotes a
Assume C = C Λ¬C '
and ms = Max
|S|
monotone constraint
generatem
• BdCam denotes the
generate ( L ) = Bd ∩ Items
We do not
need toof the
collection
verify the
monotone
minimal
itemsets that do
For k ≥ 1 ,
constraint
thisC
notafter
satisfy
∩ Items
)
z If k<ms, generate ( L ) = generate ( L ) ∪ ( Bd
m
generation
z If k=ms, generate ( L ) = generate ( L )
procedure
z If k>ms,
z generate2 ( Lk ) = { A U B, where A,B ∈ Lk }
z
z
am
m
0
am
S∈BdC '
am
1
C 'am
m
k
1
k
m
k
1
k
C 'am
prunem
For all S ∈ Ckg+1 and for all S ' ⊂ S such that |S’|=k
do if S ' ∉ Lk and Cm ( S ') = true
g
then delete S from Ck +1
z prunem
is correct and complete
k +1
generatem ( Lk ) = generate2 ( Lk )
z
z
procedure
This generation procedure is complete and ensures that
every candidate itemset verifies ¬C 'am ( Cm )
The algorithm is correct because it does not prune any itemset that verify
.Its completeness means that if an itemset is not pruned then
every proper subset of that itemset verify Cam .
C = Cam Λ¬C 'am
Generic Algorithm
z
Generic Algorithm-example
For a constraint C = Cam ΛCm = Cam Λ¬C 'am , the generic algorithm uses the structure of Apriori
and the procedures generatem and pruning m
g
z 1. C1 := BdC ' ∩ Items1 ; L0 = {∅}
Apriori Algorithm
z 2. k:=1
g
z 3. while Ck ≠ ∅ do
1. C1g := Items1; L0 = {∅}
2. k:=1
z 4. Phase 1 – candidate safe pruning
g
3. while Ck ≠ ∅ do
Ck := pruning m (Ckg , Lk −1 )
4. Phase 1–candidate safe pruning
z 5. Phase 2 - anti-monotone constraint checking
C := safe − pruning − on(C g , L )
am
6. Phase 3 – candidate generation for level k+1
Ckg+1 := generatem ( Lk )
z
z
Cam ≡ A ∉ S , Cm ≡ B ∈ S
z BdC = {B, AB}
am
z ms = MaxS∈Bd
Cam
| S |= 2
z C1 = {B}
•
•
•
•
•
k<ms,
k=ms,
k<ms,
C2 = { AB, BC , BD} ∪ { AB} = { AB, BC , BD}
Ckg+1 := generateapriori ( Lk )
z
L2 = {BC , BD}
7. k:=k+1
8. Output ∪ik=−01 Li
z C3 = { ABC , BCD, ABD}
z L3 = {BCD}
ABCD
2
AC
3
AC
4
ABCD
5
BC
6
ABC
generatem ( Lk ) = generate1 ( Lk )
z
5. Phase 2-frequency constraint
1
generatem ( Lk ) = generate1 ( Lk ) ∪ ( Bd C 'am ∩ Itemsk +1 )
Lk := SATC freq (Ck )
k −1
Items
generate1 (Lk ) = {A U B, where A∈Lk and B is 1-itemset}
generate2 ( Lk ) = { A U B, where A,B∈ Lk }
z
k
6. Phase 3-candidate generate
7. k:=k+1
8. output ∪ik=−01 Li
Constraints:
L1 = {B}
k
Lk := SATCam (Ck )
z
z
TID
generatem ( Lk ) = generate2 ( Lk )
For all S ∈ Ck +1 and for all S ' ⊂ S such that |S’|=k
do if S ' ∉ Lk and Cm ( S ') = true
then delete S from C kg+ 1
g
{B,BC,BD,BCD}
CLOSE algorithmfrequent closed itemset mining
Content
z Introduction
z Constrained
z
z
z
z
Apriori revisit
Anti-monotone constrains
Monotone constrains
Generic algorithm
z Frequent
z
z
itemset mining
closed itemset mining
CLOSE algorithm
Incorporating constraints into Apriori
z Conclusion
z The
closure of an itemset S(closure(S)) is the
maximal superset of S which has the same
support as S.
z A closed itemset is an itemset that is equal to
its closure
z The set of closed itemset is a lattice called
the closed itemset lattice
CLOSE algorithm
z We
We can use this constraint in
our generic algorithm
together with other
constraints to achieve
constrained free-set mining
can consider CLOSE as an exploration
of the classical itemset lattice with a new
constraint
z A constraint for CLOSE
C 'Free ( S ) ≡ S ' ⊂ S ⇒ S ⊆/ closure( S ')
z Free
itemsets: itemsets that are not included
in any closure of their proper sub-set.
Equivalently, free itemsets are itemsets that
verify C 'Free
CLOSE algorithm
z The C 'Free
constraint is anti-monotone, it needs
a database pass to be checked
z Checking this constraint seems expensive if
the closure of every subset of S has to be
computed
CLOSE algorithm
The closure of an itemset S(closure(S)) is the maximal
superset of S which has the same support as S
z Example
S ⊆ closure( S ')
Items
1
ABCD
2
AC
3
AC
4
ABCD
5
BC
Closure(AB): items A and B
6
ABC
are simultaneously in transactions 1,4,6. Item C
is the only other item that is also present in these
three transactions, thus closure(AB)=ABC.
z Closure(A)=AC, Closure(B)=BC, AB ⊆
/ closure( A)
and AB ⊆/ closure( B). Therefore C ' ( AB) is true.
={∅ , A , B , D , AB }
z If frequency threshold r = ½,SAT
where ABC means that C' (AB) = true,closure(AB) = ABC
z
Free
CfreqΛC'Free
C
C
C
AC
Free
Incorporating constraints into Apriori
z Directly
using C = CFree ΛCam ΛCm causes two
problems
The closures of some candidates of level k are
not computed => impossible to check CFree at
level k+1
z SATC ΛC ΛC will no longer enables to compute
z
Free
•We can use an equivalent constraint CFree ( S ) ≡ ( S ' ⊂ S Λ | S ' |=| S | −1) ⇒ S ⊆/ closure( S ')
The equivalence means that C 'Free ( S ) is true iff CFree ( S ) is true.
•We need the closure of every subset of S of size |S|-1, then check if
TID
am
SATCam ΛCm
m
C
Incoporating constraints
z
z
Content
z Introduction
Assume we replace
z
C 'Free with
z
CFree
C 'FreeΛCm ( S ) ≡ ( S ' ⊂ S ΛCm ( S ')) ⇒ S ⊆/ closure( S ')
with CFreeΛC (S ) ≡ (S ' ⊂ S Λ | S ' |=| S | −1ΛCm (S ')) ⇒ S ⊆/ closure(S ')
m
Then: the constraints CFreeΛC and C 'FreeΛC are equivalent and
anti-monotone. The set SATC ΛC can be efficiently computed
using the same method as in CLOSE using SATC ΛC ΛC
i.e., the output of the generic algorithm with the constraint
m
m
am
m
am
m
FreeΛCm
C = CFreeΛCm ΛCam ΛCm
z Constrained
z
z
z
z
Apriori revisit
Anti-monotone constrains
Monotone constrains
Generic algorithm
z Frequent
z
Now we can find free-itemsets that verify conjunctions of anti-monotone
and monotone constraints
z
closed itemset mining
CLOSE algorithm
Incorporating constraints into Apriori
z Conclusion
Conclusion
z Frequent
itemset mining can be intractable
for a given support threshold and a particular
database
z Two issues to address this problem:
constraint-based itemset mining and
condensed representation of frequent
itemsets
z The generic algorithm can be used to achieve
constrained free-set mining when C=CamΛCm
itemset mining