efficient subset-lattice algorithms for mining closed frequent itemsets

Y. I. Chang, C.INTERNATIONAL
E. Li, W. H. PengJOURNAL
and S. Y.OF
Wang:
EfficientENGINEERING,
Subset-Lattice Algorithms
for2 Mining
Closed
Frequent Itemsets
ELECTRICAL
VOL. 20, NO.
PP. 51-63
(2013)
and Maximal Frequent Itemsets in Data Streams
51
Shortpaper
EFFICIENT SUBSET-LATTICE ALGORITHMS FOR
MINING CLOSED FREQUENT ITEMSETS AND
MAXIMAL FREQUENT ITEMSETS IN DATA STREAMS
*
Ye-In Chang, Chia-En Li, Wei-Hau Peng and Syuan-Yun Wang
Dept. of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan, R.O.C.
ABSTRACT
There are multiple applications for using association rules in data streams, such as
market analysis, sensor networks and web tracking. Because data streams are continuous,
high speed, unbounded and in real time, we can only scan once for data streams. Mining
closed frequent itemsets is a further work of mining association rules, where a closed
frequent itemset is a frequent itemset which has no superset with the same support. One
well-known algorithm for mining closed frequent itemsets based on the sliding window
model is the NewMoment algorithm. However, the NewMoment algorithm could not efficiently mine closed frequent itemsets in data streams, since they will generate closed
frequent itemsets and many unclosed frequent itemsets. Moreover, when data in the sliding window is incrementally updated, the NewMoment algorithm needs to reconstruct
the whole tree structure. On the other hand, a frequent itemset is called maximal, if it is
not a subset of any other frequent itemset. One well-known algorithm for mining maximal frequent itemsets based on the sliding window model is called the MFIoSSW algorithm. The MFIoSSW algorithm uses a compact structure to mine the maximal frequent
itemsets. However, when the new transaction comes, the number of comparisons between the new transaction and the old transactions is too much. Therefore, in this paper,
we propose the Subset-Lattice algorithm, which embed the property of subsets into the
lattice structure to efficiently mine closed frequent itemsets and maximal frequent itemsets over a data stream sliding window. Moreover, when data in the sliding window is
incrementally updated, our Subset-Lattice algorithms will not reconstruct the whole lattice structure. From our simulation results, we show that our algorithm for mining closed
frequent itemsets outperforms the NewMoment algorithm, and our algorithm for mining
maximal frequent itemsets also outperforms the MFIoSSW algorithm.
Key words: closed frequent itemsets, data streams, lattice structure, maximal frequent
itemsets, sliding window.
I. INTRODUCTION
In recent years, database and knowledge discovery
communities have focused on a new data model in
which data arrives in the form of continuous streams.
This is called data stream or streaming data. Many real-world applications data is better appropriately handled
by the data stream model than by traditional static databases. Therefore, mining frequent itemsets in data
streams has become a research area with increasing im-
portance [2, 3, 6, 7, 9, 14, 15, 17, 19, 20, 21, 22, 23].
There are three models for data streams, including the
landmark model, tilted-time window model and sliding
window model. The landmark model considers all the
data from a specific time point to the current time. All
data considered is treated equally and added to the mining, but considering all data equally results in bad performance and consumes too much space. The tilted-time
window is a variation of the landmark model. It also
considers all data from a specific time point to the current time, but the tilted-time window does not treat all
data equally. The time period is divided into several time
Manuscript received Feb. 26, 2013; and revised May 13, 2013; accepted May 13, 2013.
This research was supported in part by the National Science Council of Republic of China under Grant No. NSC-101-2221E-110-091-MY2.
52
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)
slots. The time slots in recent time have higher weights,
while those in ancient time have lower weights. The
frequent itemsets found by the way of tilted-time window is approximate, since it compresses the older data.
The sliding window model focuses on the recent data
from the current moment back to a specified time point,
which in the majority of real-world applications is more
important and relevant than old data [10, 13, 16].
One key issue of mining association rules is to find
the frequent itemsets, where a frequent itemset is a combination of items whose appearing times in the datasets
is greater than a given threshold [24]. In traditional algorithms of mining association rules, e.g., Apriori [1], all
frequent itemsets will be kept. The formal definitions are
defined as follows. Let the set of items : I = {i1, i2, …,
in}. Let the set of transactions: DB = {T1, T2, …, Tm}. A
set of items is referred as itemset and an itemset with m
items is termed as X = {X1, X2, …, Xm}. A transaction T
is said to support an itemset X ⊂ I, if it contains all items
of X. The fraction of the transaction in DB that supports
X is called the support of X, denoted as support(X). An
itemset is frequent, if its support is above some users
defined minimum support threshold. Otherwise, it is
infrequent. Take Transaction Database T in 0-(a) as an
example, 0-(b) shows frequent itemsets of Transaction
Database T, where the value of the minimum support is
2.
To achieve the same goal for computing the confidence for an association rule, a closed frequent itemset
has been proposed [11]. (Note that the confidence of an
association rule is defined as support(X ∪
Y )/support(X), which is used to evaluate the strength of
such association rules [8].) A closed frequent itemset is a
frequent itemset which has no superset with the same
support as it [5]. For the example shown in Figure 1-(a),
Figure 1-(c) shows closed frequent itemsets. Frequent
itemset “ab” is a closed large itemset, because all other
frequent itemsets (“abd”) that contain it do not have the
same support count with it. However, frequent itemset
“bd” is not a closed frequent itemset, since frequent
itemset “abd” contains “bd” and has the same support as
“bd”. Closed frequent itemsets can greatly reduce the
number of large itemsets, and do not lose any information.
The Moment algorithm [4] has been proposed for
mining closed frequent itemsets over a data stream slid-
Fig. 1 Example 1: (a) Transaction Database T; (b) frequent
itemsets with the minimum support = 2; (c) closed
large itemsets.
ing window. (Note that the sliding window model emphasizes the recent data.) It defines four types of conditions to each candidate node. However, it stores information including the current closed frequent itemsets
and many unclosed frequent itemsets, which consumes
much memory, especially, when the support threshold is
low. Moreover, the exploration and node type checking
are time consuming. The NewMoment algorithm [11]
also reserves the hash table used in the Moment algorithm to check whether each candidate node is a closed
frequent itemset or not. Although the NewMoment algorithm resolves the disadvantage of time and memory
space consumption of the Moment algorithm by using a
technique of bit-patterns to speed up the tree construction, it still has to scan the transactions in the sliding
window to generate the bit-sequence of each item and
generates unclosed frequent itemsets in the search structure. Moreover, the NewMoment algorithm needs to
reconstruct the whole tree structure, when the window
slides.
On the other hand, an itemset is a maximal frequent
itemset if none of its immediate supersets is frequent. By
the characteristic of a frequent itemset — all subsets of a
frequent itemset are also frequent itemsets, frequent
itemsets “ab”, “ad”, and “bd” seem obvious and are redundant in Figure 1-(b). For the reasons mentioned
above, the concept of maximal frequent itemsets is proposed to reduce the redundancy. For the example shown
in Figure 1-(a), “abd” are maximal frequent itemsets. As
a result, the maximal itemset is a more compact itemset
than the frequent itemset. Previous algorithms for mining maximal frequent itemsets in data streams have been
based on many different models. The MFIoSSW algorithm [12] uses a data structure which is similar to
INSTANT [18] but with a little difference. The difference
is that they add the step of deletion. When the window
slides, the deletion should be performed. They use an
array-based data structure A to maintain the maximal
frequent itemsets and other helpful itemsets.
Therefore, in this paper, first, we propose the Subset-Lattice algorithm for mining closed frequent itemsets, which will not produce unclosed nodes into the
lattice structure, to efficiently mine closed frequent
itemsets. Rather, we use properties of sets to compute
the closed frequent itemsets directly among transactions.
When the window is sliding, there will not be too much
fluctuation in the Subset-Lattice structure. Second, we
propose the Subset-Lattice algorithm for mining maximal frequent itemsets. In the relations of the set and the
subset, there are three advantages. (1) We do not have to
do the comparisons so many times; (2) When the support
increases, the Subset-Lattice structure remains the same;
and (3) We only have to compare with the maximal frequent itemsets which stores in Max set. We do not have
to compare all the itemsets in A(i-1) as in the MFIoSSW
algorithm. From our simulation results, first, we show
that our Subset-Lattice algorithm for mining closed frequent itemsets needs less memory and less processing
time than the NewMoment algorithm. When window
Y. I. Chang, C. E. Li, W. H. Peng and S. Y. Wang: Efficient Subset-Lattice Algorithms for Mining Closed Frequent Itemsets
and Maximal Frequent Itemsets in Data Streams
53
slides, the execution time could be saved up to 50%.
Second, we show that the Subset-Lattice algorithm for
mining maximal frequent itemsets provides better performance than the MFIoSSW algorithm. Because our
lattice structure can decrease the number of comparisons, our Subset-Lattice algorithm can save time.
The rest of the paper is organized as follows. Section 2 gives a survey of some algorithms. Section 3 presents the proposed Subset-Lattice algorithm to find
closed frequent itemsets. Section 4 presents the proposed
Subset-Lattice algorithm to find maximal frequent itemsets. In Section 5, we study the performance. Finally, we
give a conclusion.
For example, given the window size = 4 and itemset ”ab” appearing in transactions T1 and T3, the related
bit-sequence is represented as 1010. This means that
transactions T1 and T3 contain itemset “ab”.
The NewCET data structure consists of three parts.
(1) The bit-sequence of all 1- itemsets in the current
transaction-sensitive window TransSW. (2) A set of
closed frequent itemsets in TransSW. (3) A hash table
for checking whether a frequent itemset is closed or not,
and storing all closed frequent itemsets with their supports as keys, which is similar to the one used in the
Moment algorithm [4].
II. RELATED WORK
In this subsection, we describe the MFIoSSW algorithm [12]. This MFIoSSW algorithm uses a data structure which is similar to INSTANT but with a little difference. The difference is that they add the step of deletion.
When the window slides, the deletion should be performed. They use an array-based data structure A to
maintain the maximal frequent itemsets and other helpful itemsets. When new transaction T arrives, it is added
to the sliding window. Transaction T will be compared
with each itemset I in A(i-1) and generate the interaction
S = T ∩ I. If S is not one of the itemset in A(i), it will be
inserted into A(i) after all the itemset J ⊂ S in A(i) are
pruned. The earliest arrived transaction in sliding widow
will be deleted after new transaction is added. If transaction T is going to be deleted, this algorithm will consider
three aspects. First, if transaction T is a subset of one of
the elements in A(λ). That is, transaction T is a subset
of a maximal frequent itemset, and it can be ignored.
Second, if transaction T is an infrequent itemset, it is
definitely corresponding to an itemset in A(i)(1 ≤· i <
λ). Thus, we need to find and delete it, and sequentially
delete all its subsets. Third, if T is a maximal frequent
itemset, it will be deleted, and all its subsets will be
recomputed for deciding their type.
In this section, we describe the Moment algorithm
[4], the NewMoment algorithm [11] and the MFIoSSW
algorithm [12].
2.1 The Moment Algorithm
The Moment algorithm proposes an in-memory data
structure, the closed enumeration tree (CET) to maintain
a dynamically selected small set of itemsets. Similar to a
prefix tree, each node ni represents an itemset I. A child
node, nj, is obtained by adding a new item to I. The CET
only maintains a dynamically selected set of itemsets,
which include (1) closed frequent itemsets, and (2)
itemsets that form a boundary between closed frequent
itemsets and the rest of the itemsets. The Moment algorithm divides itemsets (on the boundary) into two categories: First, it corresponds to the boundary between
frequent and non-frequent itemsets; and Second, it corresponds to the boundary between closed and non-closed
itemsets.
2.2 The NewMoment Algorithm
In this subsection, we describe the NewMoment algorithm. A bit vector based representation of items is
used in the NewMoment algorithm to reduce the time
and memory needed to slide the windows. A new data
structure NewCET (New Closed Enumeration Tree)
based on a prefix tree structure is developed to maintain
the essential information of closed frequent itemsets in
the recent transaction of a data stream.
In the NewMoment algorithm, the bit-sequence has
three kinds of effect: (1) representation of items; (2)
window sliding; and (3) counting support. In the representation phase, for each item X in the current sliding
window, a bit-sequence with w bits, is constructed. If an
item X is in the ith transaction of the current window, the
bit of Bit(X) is set to be 1; otherwise, it is set to be 0. In
the window sliding phase, the process consists of two
steps: delete the oldest transaction and append the incoming transaction. The bitsequence of items is used to
left-shift one bit to delete the oldest transaction. The
concept of bit-sequence can be used to count support.
2.3 The MFIoSSW Algorithm
III. THE SUBSET-LATTICE ALGORITHM FOR
MINING CLOSED FREQUENT ITEMSETS
In this section, we first define basic definitions and
notations and then describe how to use these techniques
to find and represent closed frequent itemsets to organize
the closed frequent itemsets in a lattice.
In the Subset-Lattice algorithm, we first transfer the
transaction into bit-representation. Take transaction
{CD} as an example. We use the maximum length of
itemset as the length of bit-length and itemsets are ordered as lexical order. When the item appears in the
transaction, we set the bit to 1 according to the lexical
order position. The bit-represent of {CD} is denoted as
Bit(0011), since there are four items {ABCD}. By way
of the sliding window approach, we maintain the essential information of closed frequent itemsets in the recent
w transactions of a data stream. The sliding process con-
54
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)
sists of two steps: delete the oldest transaction and append the incoming transaction. Second, we use
set-relation checking to deal with each transaction. The
variables used in our algorithm are listed in Table 1.
The proposed data structure, called Subset-Lattice,
is an extended Lattice structure. The Subset-Lattice consists of two parts. (1) A lattice node which contains the
information of transaction itemset and transaction ID. (2)
A pointer which represents the relation between the superset and the subset. In the Subset-Lattice, a pointer is
pointed to the subset from the superset.
3.1 Data Insertion
The proposed algorithm is processed as follows.
First, we insert the transaction into the Subset-Lattice
including the transaction itemset X and transaction identifier Tid. When the next transaction comes, we start to
process the subsumption checking. In the subsumption
checking step, conditions are considered based on the
relation between the new transaction NewT and the old
transaction OldT . We use logical AND and XOR operators to make subsumption checking more efficient. (Note
that based on the bit-operations, the complexity of the
subsumption check is O(1), as compared to O(NlogN)
based on the itemset represented in an array with size N.
The reason of such complexity O(NlogN) is that those
two itemsets are needed to be sorted first, and then take
O(N) the time necessary to find the intersection of said
sorted itemsets.)
In Figure 2, Case equivalent indicates the
bit-operation Bit(NewT) ⊗ Bit(OldT ) = ø; that is,
itemset NewT and itemset OldT are the same itemset.
Case empty indicates the bit-operation Bit(NewT) ∧
Bit(OldT ) = ø; that is, there is no relation between NewT
and OldT . Case superset indicates the bit-operation
Bit(NewT) ∧ Bit(OldT ) = Bit(OldT ); that is, itemset
NewT is the superset of itemset OldT . Case subset indicates the bitoperation Bit(NewT) ∧ Bit(OldT ) =
Bit(NewT); that is, itemset NewT is the subset of itemset
OldT . Case intersection indicates the bit-operation
Bit(NewT) ∧ Bit(OldT ) = Bit(NewT ∩ OldT ); that is,
itemset NewT and itemset OldT have some common
items.
We use the following example to illustrate the process of data insertion. Table 2 shows an example of
stream data and window size = 5. When the data stream
comes, we process Procedure Insert as shown in Figure
3. Transaction T1 is the first transaction of the data
stream. The itemset {CD} and Tid = 1 are inserted into
the Subset-Lattice directly. Transaction T2 will call
Function FindT as shown in Figure 4 to check whether it
exists in the lattice or not. Transaction T2 will call Procedure CheckSet as shown in Figure 5 to check the relation among transaction T2 and previous transactions.
Transaction T2 will process Case empty, because {AB}
∩ {CD} is empty. Thus, a new node for itemset {AB} is
created and transaction T2 is inserted into the Subset-Lattice. Figure 5-(a) shows the result of inserting
transactions T1 and T2.
Table 1 Variables.
Notation
w
Description
The sliding window size
Tid
The transaction ID of data streams
NewT
A transaction which is going to be insterted into
the Subset-Lattice
OldT
A transaction which is already in the Subset-Latice
InterX
A common subset of transaction NewT and OldT
ODes
A descendant of transaction OldT
ChildT
A child lattice of transaction OldT
InterY
A common subset of transactions InterX and
ChildT
DelT
A transaction which is going to be deleted from
the Subset-Lattice
DesT
A child lattice of transaction DelT
ParT
The parent lattice of transaction DelT
Tidset
A set of Tid
Table 2 An example for the data stream DS1.
TID
Transactions
Bit-represent
Case
T1
CD
0011
empty
T2
AB
1100
empty
T3
AB
1100
equicalent
T4
ABCD
1111
superset
T5
ABD
1101
subset
T6
AC
1010
subset
Transaction T3 will utilize Function FindT to search
the Subset-Lattice, to check whether or not it can find
the itemset in the Subset-Lattice. In Function FindT, we
will process Case equivalent as shown in Figure 4. By
way of Function FindT, we can find itemset {AB},
which is already in the Subset-Lattice. Hence, we only
update T3 into itemset {AB}’s Tidset as shown in Figure
6-(b). Transaction T4 is the condition of Case superset.
Itemset {ABCD} is the superset of itemset {AB} and
itemset {CD}. Therefore, we will process Procedure
CheckSet as shown in Figure 5. Itemsets {AB} and
{CD} become itemset {ABCD}’s child. We then insert
transaction T4 into the Tidset of itemsets {AB} and
{CD} as shown in Figure 6-(c).
Transaction T5 will pass through conditions of Case
subset and Case intersection. First, itemset {ABD} encounters itemset {ABCD} and by way of Procedure
CheckSet, itemset {ABD} becomes a child of itemset
{ABCD} and a superset of itemset {AB} (Case subset).
We need to insert Tidset of itemset {ABCD} into itemset
{ABD} and insert Tidset of itemset {ABCD} into itemset {AB}. Second, through Procedure CheckSet, itemset
{ABD} and itemset {CD} have a common itemset {D}
(Case intersection). We create a new lattice node for
Y. I. Chang, C. E. Li, W. H. Peng and S. Y. Wang: Efficient Subset-Lattice Algorithms for Mining Closed Frequent Itemsets
and Maximal Frequent Itemsets in Data Streams
55
Fig. 2 Conditions of subsumption checking: (a) equivalent; (b)
empty; (c) superset; (d) subset; (e) intersection.
Fig. 5 Procedure CheckSet.
Fig. 3 Procedure InsertD.
Fig. 4 Function FindT.
itemset {D}, then insert the Tidset of itemset {ABD}
and itemset {CD} into the Tidset of itemset {D} as
shown in Figure 6-(d).
3.2 Data Deletion
In this step, we describe how to delete the oldest
transaction from the Subset-Lattice. The procedure is
shown in Figure 7. When a transaction is out of the current window, it should be deleted from the Subset-Lattice. In the Subset-Lattice, we also need to traverse the nodes which are relevant to the deleted transacttion, and update their Tidset. If a node is deleted from
Fig. 6 Subset-Lattice after data insertion: (a) transaction T1
{C, D} (Case empty) and transaction T2 {A, B} (Case
empty); (b) transaction T3 {A, B} (Case equivalent); (c)
transaction T4 {A, B, C, D} (Case superset); (d) transaction T5 {A, B, D} (Case subset).
the Subset-Lattice, a subset of the deleted node will be
influenced. Because the subset lattice node could be created by two itemsets. For example, itemset {D} is created by itemset {ABD} and itemset {CD}. Both itemset
{ABD} and itemset {CD} contain itemset {D}; there
56
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)
Fig. 7 Procedure DeleteD.
fore, the support of itemset {D} is larger than that of its
superset itemset {ABD} and itemset {CD}. In this case,
any itemset of these two itemsets is deleted, and the
subset lattice node will be deleted in the meantime.
To decide whether a lattice node in the Subset-Lattice is a closed frequent itemset, based on the
properties of closed frequent itemsets, is very straightforward. (Note that a closed frequent itemset is a frequent itemset which has no superset with the same support as it. In the Subset-Lattice, a pointer represents the
relation between the superset and the subset, and the
support of an itemset is the number of transaction ID
recorded in Tidset of a node. Therefore, we can dynamically identify all closed frequent itemsets.) We use an
example, as shown in Figure 8 to describe the procedure
of deletion. In this example, itemset {CD} is the oldest
transaction in the current window. According to the
window table, we can directly find the location of itemset {CD} in Figure 8-(a). (Note that the window table is
a copy of the part of the transaction table where the sliding window covers. It records the information of the
transactions which includes Tid and itemset. The position of itemset {CD} can be directly found, because the
window table records the Tid as an index and links to the
lattice tree.) First, we remove transaction T1 from Tidset
of itemset {CD} and all of its subsets recursively as
shown in Figure 8-(b). Second, we compare Tidset of
itemset {D} with that of its superset {ABD} and superset {CD}. There is a critical key point in this step. If the
Tidset of a subset is equal to its superset, then the subset
and superset must occur in the same transaction; that is,
the subset becomes a unclosed frequent itemset. Therefore, itemset {D} should be deleted from the Subset-Lattice. The itemset {CD} should be deleted in the
same way as shown in Figure 8-(c) and Figure 8-(d).
IV. THE SUBSET-LATTICE ALGORITHM FOR
MINING MAXIMAL FREQUENT ITEMSETS
Our algorithm has two steps. In the first step, we
use a bit-sequence to find 1-item frequent itemsets. The
number of transactions which equals the window size are
transformed into the representation of bit-sequences. For
each item X in the current window, a bit-sequence with
w bit is denoted as Bit(X). If an item X is in the ith
transaction of the current window, the ith bit of Bit(X) is
set to 1; otherwise, it is set to 0 [11]. There are four
Fig. 8 Subset-Lattice after deleting itemset {CD}: (a) the
original lattice; (b) updating the Tidset of itemset {CD}
and itemset {D}; (c) deleting itemset {D}; (d) deleting
itemset {CD}.
Table 3 Transactions Database TD2.
TID
Transactions
T1
a, b, c, f
T2
a, d
T3
a, b, c, d
T4
a, b, c,
Table 4 Bit-sequence.
Bit-sequence
Bit(a)=1111
Bit(b)=1011
Bit(c)=1011
Bit(d)=0110
Bit(f)=1000
transactions in Table 3. A bit-sequence transformation is
shown in Table 4. (Note that the meaning of this bitsequence is different from that used in Section 3.) After
the transformation, we delete the infrequent items from
the transactions. Therefore, the items in each transaction
are 1-item frequent itemsets. For the example shown in
Table 3, the nonfrequent itemset is “f”. (Note that the
bit-sequence of items are used to left-shift one bit to
delete the oldest transaction.)
In the second step, we check the relation between
the new transaction and the old transaction. There are
five cases of inserting itemsets into Max_set, where
Max_set is a set which stores the current maximal frequent itemsets and related support. (Note that itemsets
recorded in Max_set will be removed, if its superset is
inserted into Max_set. Moreover, the purpose of record-
Y. I. Chang, C. E. Li, W. H. Peng and S. Y. Wang: Efficient Subset-Lattice Algorithms for Mining Closed Frequent Itemsets
and Maximal Frequent Itemsets in Data Streams
ing related support is that it will be needed in data deletion discussed later.) In Case equivalent, we update the
support of the old transaction and insert the tid into the
Tidset of the old transaction. If the condition of the support of the new transaction ≥ min_support is true, we
add a new transaction to Max_set.
In Case subset, the new transaction will be the child
of the old transaction. Then, we apply the new transaction X to Procedure CheckMaxSet as shown in Figure 9,
where Procedure DeleteSubset called in Procedure
CheckMaxSet is to delete the subset of the new-added
itemset X from Max_set.
In Case intersection, there are two situations. In
Case intersection-1, the node which is the intersection of
the new transaction and the old transaction, intersect, is
already in the lattice structure. In this case, the new
transaction will be inserted into the lattice structure and
update the support of the node, intersect, which is the
intersection of the new transaction and the old transaction. Then, we apply the intersection (denoted as intersect) of the new transaction and the old transaction of
Procedure CheckMaxSet as shown in Figure 9. In Case
intersection-2, the other situation is the node which is
the intersection of the new transaction and the old transaction, intersect, which is not in the lattice structure. In
this case, the new transaction and intersect will be inserted into the lattice structure. Then, we apply the old
transaction and intersect to Procedure CheckRelation as
shown in Figure 10. And we then update the Tidset of
the new transaction and the old transaction.
In Case superset, we will update the support of the
old transaction, and the old transaction will be the child
of the new transaction. The new transaction will be inserted into the lattice structure. Then, we apply the old
transaction to Procedure CheckMaxSet.
When the current window becomes full, we then
move the window. We remove the first transaction from
SWTable, which stores all the transactions of the current
window.
57
Fig. 9 Procedure CheckMaxSet.
Fig. 10 Procedure CheckRelation.
Table 5 Transactions Database TD3.
TID
Transactions
Case
T1
a, b, c
empty
Max_set
T2
a, d
intersection
{a}
T3
a, b, c, d
superset
{a, b, c}, {a, d}
T4
a, b, c
subset
{a, b, c}, {a, d}
T5
a, b, c, d
equivalent
{a, b, c, d}
4.1 Data Insertion
We use an example to present our algorithm. Table
5 shows an example of the data stream. We assume that
the window size is 4. When the window is full, we use
the bit-sequence to find the 1-itemset frequent itemsets
from the current window.
The value of min_support is 2. We process Procedure Insert_T as shown in Figure 11. When the first
transaction T1 {a, b, c} comes, Case empty occurs. So,
the itemset {a, b, c} is inserted into the root’s child as
shown in Figure 12-(a). When the second transaction T2
{a, d} comes, Case intersection occurs. We will call
Function Find_T to check whether the itemset is in the
lattice structure or not. Obviously, the transaction T2 {a,
d} is not in the lattice structure. Then, we will call Procedure CheckCase as shown in Figure 13 to decide
which case of transaction T1 {a, b, c} and transaction T2
{a, d} belong to. We find that transaction T2 {a, d} in
Fig. 11 Procedure Insert_T.
tersects with transaction T1 {a, b, c}. The lattice will be
built as shown in Figure 12-(b). Because the support of
node {a} is equal to min_support, node {a} is added to
Max_set.
When the third transaction T3 {a, b, c, d} comes,
Case superset occurs. The lattice will be built as shown
in Figure 12-(c). We process Procedure CheckCase, as
shown in Figure 13. After adding transaction T3, the
support of node {a, b, c} and node {a, d} are increased
to 2. Hence, we add node {a, b, c} and node {a, d} to
Max_set. When adding an itemset w to Max_set, we
compare itemset w to the itemsets in Max_set. Node {a}
is in Max_set now, and it is a subset of node {a, b, c}
58
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)
Fig. 12 Subset-Lattice after data insertion: (a) transaction T1
{a, b, c} (Case empty); (b) transaction T2 {a, d} (Case
intersection); (c) transaction T3 {a, b, c, d} (Case superset); (d) transaction T4 {a, b, c} (Case subset).
and node {a, d}. Therefore, we delete node {a} from
Max_set and add node {a, b, c} and node {a, d} to
Max_set. As a result, node {a, b, c} and node {a, d} are
in Max_set now. When the fourth transaction T4 {a, b,
c} comes, Case subset occurs. The lattice will be built as
shown in Figure 12-(d). We process Function Find_T to
search node {a, b, c} and process Procedure CheckCase.
Also, we insert the Tidset of the transaction T4 {a, b, c}
into the Tidset of node fag. Because node {a, b, c} is
already in Max_set, we simply increase the support of
node {a, b, c} in Max_set.
Fig. 13 Procedure CheckCase.
4.2 Data Deletion
There are three data deletion steps. We describe
how to delete the oldest transaction from the lattice
structure and the current window. Step 1: We remove the
oldset transaction from SWTable which stores all the
transactions of the current window. Since the oldest
transaction is the first element of the SWTable, we delete
the first element of SWTable and then insert the new
coming transaction. As a result, the SWTable stores all
the transactions of the current window.
Step 2: Procedure Delete_lattice is shown in Figure
14. When a transaction is out of the current window, it
should be deleted from the lattice structure. In the lattice
structure, we need to traverse the nodes which are relevant to the deleted transaction. If a node is deleted from
the lattice structure, a subset of the deleted node will be
influenced. The reason is that the subset lattice node
could be created by two itemsets.
Step 3: If the itemset which is going to be deleted is
already in Max_set, then the support of the itemset in
Max_set is decreased. After the deletion, if the condition
Fig. 14 Procedure Delete_ lattice.
of the support of the itemset < min_support is true, then
the itemset is deleted from Max_set. If an itemset X is
deleted from Max_set, we apply X to Procedure
CheckMaxSet.
When the fifth transaction T5 {a, b, c, d} comes, the
current window is full. We have to remove the aged
transaction from the current window and insert the new
transaction into the current window. The first transaction
T1 {a, b, c} should be removed from the current window. We remove transaction T1 {a, b, c} and all of its
subsets. The lattice structure is shown in Figure 15-(a).
Y. I. Chang, C. E. Li, W. H. Peng and S. Y. Wang: Efficient Subset-Lattice Algorithms for Mining Closed Frequent Itemsets
and Maximal Frequent Itemsets in Data Streams
59
synthetic data which will be used in the simulation.
Second, we study the performance of algorithms for
mining closed frequent itemsets. Third, we study the
performance of algorithms for mining maximal frequent
itemsets.
5.1 The Simulation Model
Fig. 15 Subset-Lattice when the window slides: (a) deletion of
transaction T1 {a, b, c}; (b) insertion of transaction T5
{a, b, c, d}.
Fig. 16 Procedure Delete_Max.
We use Procedure Delete_Max as shown in Figure
16 to process the deletion. We delete the first transaction
T1 {a, b, c}, then decrease the support of node {a, b, c}
and its subset {a}. Because the deleted node {a, b, c} is
in Max_set, we decrease the support of it. After the decrement, the condition of the support of node {a, b, c} ≥
min_support is true. As a result, node {a, b, c} is kept in
Max_set.
After the deletion of transaction T1, we insert
transaction T5 {a, b, c} (Case equivalent) into the lattice
structure which is shown in Figure 15-(b). After the increment of the support of transaction T5, min_support is
true, we add transaction T5 to Max_set. When adding
itemsets to Max_set, we compare transaction T5 to the
itemsets in Max_set. Itemset {a, b, c} and {a, d} are in
Max_set now. Because both of them are the subsets of
transaction T5, {a, b, c, d}. As a result, itemset {a, b, c}
and {a, d} are removed from Max_set, and transaction
T5, {a, b, c, d} is added to Max_set.
V. PERFORMANCE
In this section, we describe a way to generate synthetic transaction data from the IBM synthetic data developed by Agrawal et al. [1]. These synthetic transaction data sets are used to evaluate the performance of
algorithms for mining closed frequent itemsets. The parameters used in the generation of the synthetic data are
shown in Table 6.
First, the length of a transaction is determined by
the Poisson Distribution with a mean which is μ equal
to |T|. The size of a transaction is between 1 and |MT|.
(Note that the performance complexity of the worst case
of the number of traversal of the subset-lattice in data
insertion/deletion by using our both algorithms is
O(2ˆ|MT|), since the total number of an itemset of size
|MT| and all of its subsets is bounded by 2ˆ|MT|.) The
transaction is repeatedly assigned items from a set of
potentially maximal frequent itemsets F, while the
length of the transaction does not exceed the generated
length. Then, the length of an itemset in F is determined
according to the Poisson Distribution with a meanμ
which is equal to |I|. The size of each potentially frequent itemset is between 1 and |MI|.
To demonstrate the phenomenon that frequent
itemsets often have common items, some fraction of
items in subsequent itemsets are chosen from the previous itemset generated. We use an exponentially distributed random variable with a mean which is equal to the
correlation level, to decide this fraction for each itemset.
The correlation level was set to 0.5. The remaining
items are chosen randomly. Each itemset in F has an
associated weight that determines the probability that
this itemset will be chosen. The weight is chosen from
Table 6 Parameters used in the experiment.
Parameters
T
MT
The average size of transactions
The maximum size of transactions
I
The average size of maximal potentially frequent
itemsets
D
The number of transactions
MI
The maximum size of potentially frequent itemsets
corr
The number of the correlation level
L
The number of the maximal potentially frequent
itemsets
N
The number of items
SW
In this section, we first show how to generate the
Meaning
S
The size of the sliding window
The minimum support
60
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)
an exponential distribution with a mean equal to 1. The
weights are normalized such that the sum of all weights
equal to 1.
Data from this generator mimics transactions from
retail stores. Here are some of the parameters that we
have controlled in the comparison: the size of the sliding
window |SW|, the minimum support |S|, and the maximum size of transactions |MT|.
5.2 Experimental Results for Mining Closed Frequent Itemsets
We show the comparison between our Subset-Lattice algorithm and the NewMoment algorithm
[11], when using the synthetic datasets as the input.
Figure 17 shows the comparison of the required
processing time of loading the first sliding window with
different minimum supports. The synthetic data is
T5.MT10.I10.MI20.D100k. The minimum support
threshold is changed from 1% to 0.1%, and the sliding
window size is 100K. When the minimum support decreases, the number of closed frequent itemsets and the
processing time increases. If the support is small enough,
the number of candidates of the NewMoment algorithm
increases rapidly. Therefore, the NewMoment algorithm
takes a longer processing time than the Subset-Lattice
algorithm.
Figure 18 shows the comparison of the average
processing time of the window sliding in each transaction with different minimum supports. The synthetic data
is T5.MT10.I10.MI20.D100k. The minimum support
threshold is changed from 1% to 0.1%, and the sliding
window size is 10K. Figure 19 shows the comparison of
the average processing time of the window sliding in
each transaction with different sliding window sizes. The
synthetic data is T5.MT10.I10.MI20.D100k. The sliding
window size is changed from 1K to 10K, and the minimum support threshold is 1%. The number of items of
these two experiment results are 1000. The time of the
window sliding in these two experiments are almost the
same. In general, the processing time of the Subset-Lattice algorithm is faster than that of the NewMoment algorithm. While the processing time of the
window sliding in the Subset-Lattice algorithm is saved
about 50%. The reason is that the NewMoment algorithm will regenerate the nodes of k-level (k > 1) in the
CET-Tree twice, when the window is sliding. The Subset-Lattice algorithm only needs to update the nodes that
are already in the lattice structure. (Note that this result
also can show the scalability of our algorithm as the increase of the window size. Our performance is stable as
the window size increases. Moreover, the difference
between the performance of loading the first window
and the performance of window sliding is that data deletion must be considered in the case of window sliding.)
In Figure 20, we compare the number of nodes that
are generated by loading the first sliding window. The
synthetic data is T2.MT10.I10.MI20.D100k with a
minimum support of 1% and the number of items |N| is
Fig. 17 A comparison of the processing time of loading the
first window with different minimum supports.
Fig. 18 A comparison of the average processing time of the
window sliding with different minimum supports.
Fig. 19 A comparison of the average processing time of the
window sliding with different sliding window size.
changed from 10 to 15. When the number of item |N|
increases, the closed frequent itemsets increases. The
number of nodes which need to be constructed in the
Subset-Lattice algorithm is much smaller than that of the
NewMoment algorithm. (Note that although the number
of nodes increases obviously with the increase of the
number of items in our subset-lattice before window
sliding, the percentage of the change of the number of
nodes will be small after window sliding regularly.)
Y. I. Chang, C. E. Li, W. H. Peng and S. Y. Wang: Efficient Subset-Lattice Algorithms for Mining Closed Frequent Itemsets
and Maximal Frequent Itemsets in Data Streams
5.3 Experimental Results for Mining Maximal
Frequent Itemsets
Figure 21 shows the comparison of different number of items. The minimum support is 1% and the window size is 1000. The synthetic data is
T5.MT10.I2.MI4.D30k. The result shows that the Subset-Lattice algorithm is faster than the MFIoSSW algorithm. When the number of items increases, the processing time of the MFIoSSW algorithm decreases. The
reason is that the number of comparisons decrease.
(Note that the processing time of our algorithm is stable
as the increase of the number of items in window sliding. The reason is that the data insertion or data deletion
has a small local effect to our subsetlattice, when the
window slides regularly.) Figure 22 shows the comparison of different data size. The synthetic data is
T5.MT10.I2.MI4.D(25-40)k. We can see that the Subset-Lattice algorithm is faster than the MFIoSSW algorithm. The reason is that when the number of itemsets
increases, the MFIoSSW algorithm will take more time
to compare with each itemset than the Subset-Lattice
algorithm.
Figure 23 shows the comparison of different minimum supports. When the window size is 1000, the synthetic data is T5.MT10.I2.MI4.D30k. This result shows
61
that the Subset-Lattice algorithm is faster than the
MFIoSSW algorithm. The processing time of the
MFIoSSW algorithm will increase with the increase of
the minimum support. Because with the increase of the
minimum support, the number of itemsets created by the
MFIoSSW algorithm will increase. As a result, the number of comparisons will increase and the processing time
will increase. (Note that when the minimum support is
greater than 1.5%, the processing time of the MFIoSSW
algorithm increases a lot.)
Figure 24 shows the comparison of the difference of
the window size. The minimum support is 1%. The synthetic data is T5.MT10.I2.MI4.D10k. The processing
time of the MFIoSSW algorithm is increases with the
increase of the window size. Because the min support is
relative to the window size (i.e. min_support = minimum
support * window size). The larger the window size is,
the bigger the min_support is. The size of array A will
increase. As a result, the processing time of the
MFIoSSW algorithm will increase, when the window
size increases. (Note that this result also can show the
scalability of our algorithm as the increase of the window size. Our performance is stable as the window size
increases.)
Fig. 20 A comparison of the number of nodes of loading the
first window with different size of items.
Fig. 22 A comparison of the processing time under different
data size (T5.MT10.I2.MI4.D(25-40)k).
Fig. 21 A comparison of the processing time under different
number of items (T5.MT10.I2.MI4.D30k).
Fig. 23 A comparison of the processing time under different
minimum supports (T5.MT10.I2.MI4.D30k).
62
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)
[5]
[6]
[7]
Fig. 24 A comparison of the processing time under different
window size (T5.MT10.I2.MI4.D10k).
[8]
VI. CONCLUSION
[9]
In this paper, we have firstly designed an efficient
algorithm for the problem of mining closed frequent
itemsets in data streams. Using the subsumption operation, the Subset-Lattice algorithm can directly judge the
closed frequent itemsets in the data streams. When window slides, the Subset-Lattice algorithm will not reconstruct the structure, but the NewMoment algorithm has
to reconstruct the whole tree structure. Second, we have
designed an efficient algorithm for mining maximal frequent itemsets in data streams. When window slides, the
Subset-Lattice algorithm will not reconstruct the structure, but the MFIoSSW algorithm has to take long time
to do the comparison. The simulation results have shown
that the proposed Subset-Lattice algorithm for mining
closed frequent itemsets outperforms the NewMoment
algorithm in all relational data settings. Moreover, the
results have shown that the proposed Subset-Lattice algorithm for mining maximal frequent itemsets has better
performance than the MFIoSSW algorithm.
REFERENCES
[1] R. Agrawal and R. Srikant, “Fast algorithms for
mining association rules in large databases,” Proc.
of the 20th Int. Conf. on Very Large Data Bases, pp.
490–501, 1994.
[2] T. Calders, N. Dexters, and B. Goethals, “Mining
frequent itemsets in a stream,” Proc. of IEEE Int.
Conf. on Data Mining, pp. 83–92, 2007.
[3] J. H. Chang and W. S. Lee, “A sliding window
method for finding recently frequent itemsets over
online data stream,” Journal of Information Science
and Engineering, Vol. 4, No. 11, pp. 753–762, July
2004.
[4] Y. Chi, H. Wung, P. S. Yu, and R. R. Muntz,
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
“Moment: maintaining closed frequent itemsets
over a stream sliding window,” Proc. of the 4th
IEEE Int. Conf. on Data Mining, pp. 59–66, 2004.
Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz,
“Catch the moment: maintaining closed frequent
itemsets over a data stream sliding window,”
Knowledge and Information Systems, Vol. 10, No.
3, pp. 265–294, 2006.
J. Dong and M. Han, “Bittablefi: an efficient mining frequent itemsets algorithm,” Knowledge-Based
Systems, Vol. 20, No. 4, pp. 329–335, 2007.
J. Han, D. Xin, and X. Yan, “Frequent pattern mining: Current status and future directions,” Proc. of
the 10th Int. Conf. on Data Mining and Knowledge
Discovery, pp. 55–86, 2007.
M. S. C. J. Han and P. Yu, “Data mining: an overview from a database perspective,” Knowledge and
Data Engineering, Vol. 8, No. 6, pp. 866–883,
1996.
J. L. Koh and S. N. Shin, “An approximate approach for mining recently frequent itemsets from
data streams,” Computer Science Data Warehousing and Knowledge Discovery, Vol. 4081, No. 1, pp.
352–362, Sept. 2006.
H. F. Li, C. C. Ho, and S. Y. Lee, “Incremental
updates of closed frequent itemsets over continuous
data streams,” Expert Systems with Applications,
Vol. 36, No. 2, pp. 2451– 2458, 2009.
H. F. Li, S. Y. Lee, and M. K. Shan, “DSM-PLW:
single-pass mining of path traversal patterns over
streaming web click-squences,” Proc. of Computer
Networks on Web Dynamics, pp. 1474–1487, 2006.
H. Li and N. Zhang, “Mining maximal frequent
itemsets over a stream sliding window,” Proc. of
Information Computing and Telecommunications,
pp. 110–113, 2010.
H. F. Li, “Interactive mining of top-k frequent
closed itemsets from data streams,” Expert Systems
with Applications, Vol. 36, No. 7, pp. 10779–10788,
2009.
H. F. Li and S. Y. Lee, “Mining frequent itemsets
over data streams using efficient window sliding
techniques,” Expert Systems with Applications, Vol.
36, No. 2, pp. 1466–1477, 2009.
J. Li and S. Gong, “Top-k-FCI: mining top-k frequent closed itemsets in data streams,” Journal of
Computational Information Systems, Vol. 7, No. 13,
pp. 4819–4826, 2011.
C. H. Lin, D. Y. Chiu, and Y. H. Wu, “Mining frequent itemsets from data streams with a
time-sensitive sliding window,” Proc. of the Int.
Conf. on SDM, pp. 68–79, 2005.
K. C. Lin, I. E. Liao, and Z. S. Chen, “An improved
frequent pattern growth method for mining associa-
Y. I. Chang, C. E. Li, W. H. Peng and S. Y. Wang: Efficient Subset-Lattice Algorithms for Mining Closed Frequent Itemsets
and Maximal Frequent Itemsets in Data Streams
[18]
[19]
[20]
[21]
[22]
[23]
[24]
tion rules,” Expert Systems with Applications, Vol.
38, No. 5, pp. 5154–5161, 2011.
G. Mao, X. Wu, X. Zhu, G. Chen, and C. Liu,
“Mining maximal frequent itemsets from data
streams,” Information Sciences, Vol. 33, No. 3, pp.
251–262, 2007.
B. Mozafar, H. Thakkar, and C. Zaniolo, “Verifying and mining frequent patterns from large windows over data streams,” Proc. of the Int. Conf. on
ICDE, pp. 179–188, 2008.
C. Raissi, P. Poncelet, and M. Teisseire, “Towards
a new approach for mining frequent itemsets on
data stream,” Intelligent Information Systems, Vol.
28, No. 1, pp. 23–36, Dec. 2007.
S. Tanbeer, C. Ahmed, B. Jeong, and Y. Lee, “Efficient single-pass frequent pattern mining using a
prefix-tree,” Information Sciences, Vol. 179, No. 5,
pp. 559–583, 2009.
S. K. Tanbeer, C. F. Ahmed, B.-S. Jeong, and Y.-K.
Lee, “Sliding window-based frequent pattern mining over data streams,” Information Sciences, Vol.
179, No. 22, pp. 3843–3865, 2009.
P. S. Tsai, “Mining top-k frequent closed itemsets
over data streams using the sliding window model,”
Expert Systems with Applications, Vol. 37, No. 10,
pp. 6968–6973, 2010.
W. Zhang, H. Liao, and N. Zhao, “Research on the
FP growth algorithm about association rule mining,” Proc. of the 10th Int. Conf. on Business and
Information Management, pp. 315–318, 2008.
Ye-In Chang was born in Taipei,
Taiwan, R.O.C. in 1964. She received her B.S. degree in Computer Science and Information
Engineering from National Taiwan
University, Taipei, Taiwan, in
1986. She received her M.S. and
Ph.D. degrees in Computer and Information
Science from Ohio State University, Columbus,
Ohio, in 1987 and 1991, respectively. From August 1991 to July 1999, she joined the
63
faculty of the department of applied mathematics
at National Sun Yat-Sen University, Kaohsiung,
Taiwan. From August 1997, she has been a Professor in the department of applied mathematics
at National Sun Yat-Sen University, Kaohsiung,
Taiwan. Since August 1999, she has been a Professor in the department of computer science
and engineering at National Sun Yat-Sen University, Kaohsiung, Taiwan. Her research interests
include database systems, distributed systems,
multimedia information systems, mobile information systems and data mining.
Chia-En Li received his B.S. degree in Information Management
from I-Shou University in 2007
and his M.S. degree in computer
science from National Pingtung
University of Education in 2009.
He is currently a Ph.D. student in
Department of Computer Science and Engineering at National Sun Yat-Sen University. His research interests include database systems and
data mining.
Wei-Hau Peng received his B.S.
degree in computer science from
Feng Chia University in 2007, and
his M.S. degree in computer science and engineering from National Sun Yat-Sen University in
2009. He is currently a system design engineer in Taiwan.
Syuan-Yun Wang received her
B.S. degree in computer science
from National Taichung University
of Education in 2010, and her M.S.
degree in computer science and
engineering from National Sun
Yat-Sen University in 2012. She is
currently a system design engineer
in Taiwan.
64
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING, VOL.20, NO.2 (2013)

Download Report

efficient subset-lattice algorithms for mining closed frequent itemsets

Paperzz.com

Your Paperzz