Mining Frequent Item Sets by Opportunistic Projection ~From: Special Interest Group onKnowledge Discovery in Data and Data Mining (SIGKDD 2002) Junqiang Liu, Yunhe Pan, Ke Wang, Jiawei Han 碩專二 69121507 阮士峰 Outline How to discover frequent item sets Previous works Our approach: Mining Frequent Item Sets by Opportunistic Projection Opportunistic Projection: Observations and Heuristics Performance evaluations Conclusions 2 What Are Frequent Items Sets What is a frequent item set? set of items, X, that occurs together frequently in a database, i.e., support(X) ≥ a given threshold Example tid items 01 a a b b a 02 03 04 05 c b f c c d c h k e f f j p f g i m p l m o o s l m n p Given support threshold 3, frequent item sets are as follows: a:3, b:3, c:4, f :4, m:3, p:3, ac:3, af :3, am:3, cf :3, cm:3, cp:3, fm:3, acf :3, acm:3, afm:3, cfm:3, acfm:3 3 How To Discover Frequent Item Sets Frequent item sets can be represented by a tree, which is not necessarily materailized. ( ,) (a,3) (b,3) (c,4) (f,4) (m,3) (p,3) (c,3) (f,3) (m,3) (f,3) (m,3) (p,3) (m,3) (f,3) (m,3) (m,3) (m,3) (m,3) Mining process: a process of tree construction, accompanied by a process of projecting transaction subsets 4 Frequent Item Set Tree - FIST FIST is an ordered tree each node: (item,weight) the following are imposed Frequent item set items ordered on a path (top-down) items ordered at children (left to right) a path starting from the FIST root its support is the ending node’s weight PTS - projected transaction subset Each FIST node has its own PTS, filtered or unfiltered All transactions that support the frequent item set represented by the node 5 Frequent Item Set Tree (example) ( ,) 01 02 03 04 05 (a,3) (f,3) 01 f m 02 f m 05 f m 01 m 02 m 05 m (f,3) (m,3) 01 m 02 m 05 m (m,3) c b f c c (b,3) 01 c f m p 02 b c f m 05 c f m p (c,3) a a b b a 02 c f m 03 f 04 c p (m,3) (m,3) d c h k e f f j p f g i m p l m o o s l m n p (c,4) 01 02 04 05 f m p f m p f m p (f,3) (m,3) 01 m p 02 m 05 m p 01 p 05 p (f,4) (m,3) 01 m p 02 m 05 m p 01 p 05 p (p,3) (p,3) (m,3) (m,3) (i,w): a FIST node : the PTS of the node 6 Factors relate to Mining Efficiency and Scalability The FIST construction strategy The PTS representation breadth first v.s. depth first Memory-based representation: array-based, tree-based, vertical bitmap, horizontal bitstring, etc. Disk-based representation PTS projecting method and item counting method 7 Our Approach: Mining Frequent Item Sets by Opportunistic Projection Philosophy: The algorithm must adapt the construction strategy of FIST, the representation of PTS, and the methods of item counting in and projection of PTSs to the features of PTSs. Main points: Mining sparse data by projecting array-based PTS Intelligent projecting tree-based PTS for dense data Heuristics for opportunistic projection 8 Mining sparse data by projecting array-based PTS TVLA – threaded varied length array for sparse PTS FIL– local frequent items list LQ – linked queues arrays FIL a 3 Each local frequent item has a FIL entry that consists of an item, a count, & a pointer. Each transaction is stored in an array that is threaded to FIL by LQ according to the heading item in the imposing order. b c f m p 3 4 4 3 3 01 LQ 02 a c f m p a b c f m 05 04 f 03 b f b c p array a c f m p filtered TVLA of the original DB in the example 9 How to project TVLA for PTS Arrays (transactions) that support a node’s first child are threaded by the LQ attached to the first entry of FIL. (see previous figure) TVLA for a child node’s PTS has its own FIL and LQ. A child TVLA is unfiltered if it shares arrays with its parent, filtered otherwise. parent TVLA FIL a 3 b 3 c 4 f 4 m 3 p 3 01 c 3 f 3 m 3 01 c 3 f 3 m 3 01 FIL(a) FIL(a) 02 a c f m p a b c f m 02 05 04 f 03 b f a c f m p b c p 05 unfiltered child TVLA 02 c f m 05 c c f f m m filtered child TVLA 10 How to project TVLA for PTS (cont.) Get next child’s PTS by shifting transactions threaded in the LQ currently explored (current child’s PTS) 3 3 4 4 3 3 a b c f m p 3 3 4 4 3 3 02 01 a c f m p 04 f 03 a b c f m 05 b f b c p a c f m p a b c f m p 3 3 4 4 3 3 01 a b c f m p 3 3 4 4 3 3 01 02 a c f m p 04 f a b c f m b f 05 a c f m p b c p a b c f m p 3 3 4 4 3 3 01 05 a c f m p a b c f m b f b c p a c f m p a b c f m p 02 a c f m p 05 a b c f m b f b c p a c f m p 11 Intelligent projecting tree-based PTS for dense data Tree-based Representation of dense PTS, inspired by FP-Growth Novel projecting methods, totally differ from FP-Growth Bottom up pseudo projection Top down pseudo projection 12 Tree-based Representation of dense PTS TTF - threaded transaction forest IL - item list: each entry consists of an item, a count, and a pointer. Forest: each node labeled by an item, associated with a weight. a,3 Each local item in PTS has an entry in the IL. Each transaction in the PTS is one path starting from a root in the forest. count is the number of transactions represented by the path. All nodes of the same item threaded by an IL entry. TTF is filtered if only local frequent items appear in TTF, otherwise unfiltered. a b c f m p 3 3 4 4 3 3 b,1 c,2 c,1 f,2 f,1 m,2 m,1 b,2 c,1 f,1 p,2 p,1 filtered TTF of original DB in the example 13 Bottom up pseudo projection of TTF (example) a,3 a,3 a b c f m p 3 3 4 4 3 3 b,1 c,2 c,1 f,2 f,1 m,2 m,1 b,2 c,1 f,1 p,2 a b c f m p 3 3 4 4 3 2 b,1 c,2 c,1 f,2 f,1 m,2 m,1 a,3 b,2 c,1 p,2 p,1 f,1 a b c f m p 3 3 4 4 3 2 p,1 b,1 c,2 c,1 f,2 f,1 m,2 m,1 b,2 c,1 f,1 p,2 p,1 a,3 3 1 3 3 3 2 b,1 c,2 c,1 f,2 f,1 m,2 m,1 p,2 b,2 c,1 a b c f m p f,1 p,1 a b c f m p 3 3 2 2 1 1 b,1 c,2 c,1 f,2 f,1 m,2 m,1 p,2 a,3 b,2 c,1 a,3 f,1 p,1 a b c f m p 3 3 4 3 3 3 b,2 c,2 c,1 f,2 f,1 m,2 m,1 p,2 b,2 c,1 f,1 p,1 14 a,3 a b c f m p 3 3 4 3 3 3 b,2 c,2 c,1 f,2 f,1 m,2 m,1 p,2 b,2 c,1 f,1 p,1 15 Top down pseudo projection of TTF (example) a,3 a b c f m p 3 3 4 4 3 3 b,1 c,2 c,1 f,2 f,1 m,2 m,1 a,1 b,2 c,1 f,1 p,2 a b c f m p 1 3 4 4 3 3 p,1 b,1 c,2 c,1 f,2 f,1 m,2 m,1 a,3 b,2 c,1 f,1 a b c f m p 3 2 4 4 3 3 p,1 p,2 b,1 c,2 c,1 f,2 f,1 m,2 m,1 b,1 c,1 f,1 p,1 p,2 a,3 2 1 3 2 2 3 b,1 c,2 c,1 f,2 f,1 m,2 m,1 p,2 b,1 c,1 a b c f m p f,1 p,1 a b c f m p 3 1 3 3 3 3 b,1 c,2 c,1 f,2 f,1 m,2 m,1 p,2 a,3 b,2 c,1 a,2 f,1 p,1 a b c f m p 3 2 3 4 3 3 b,1 c,2 c,1 f,2 f,1 m,2 m,1 p,2 b,1 c,1 f,1 p,1 16 Top down pseudo projection of TTF (example) ( ,) (a,3) (b,3) (c,4) (f,4) (m,3) (p,3) (a,3) (a,3) (c,3) (a,3) (c,3) (f,3) (c,3) (a,3) (a,3) (a,3) (c,3) (a,3) Build FIST by top down pesudo projecting TTF 17 Opportunistic Projection: Observations and Heuristics Observation 1: Upper portion of a FIST can fit in memory. Transactions’ Number that support length k item sets decreases sharply when k is greater than 2. Heuristic 1: Grow the upper portion of a FIST breadth first. Grow the lower portion under level k depth first, whenever the reduced transaction set can be represented by a memory based structure, either TVLA or TTF. 18 Opportunistic Projection: Observations and Heuristics(2) Observation 2: TTF compresses well at lower levels or denser branches, where there are fewer local frequent items in PTSs and the relative support is larger. TTF is space expensive relative to TVLA if its compression ratio is less than 6-t/n ( t: number of transactions, n: number of items in a PTS). Heuristic 2: Represent PTSs by TVLA at high levels on FIST, unless the estimated compression ratio of TTF is sufficiently high. 19 Opportunistic Projection: Observations and Heuristics(3) Observation 3: PTSs shrink very quickly at high levels or sparse branches on FIST where filtered PTSs are usually in form of TVLA. PTSs at lower levels or dense branches shrink slowly where PTSs are represented by TTF. The creation of filtered TTF involves expensive pattern matching. Heuristic 3: Make a filtered copy for the child TVLA as long as there is free memory when projecting a parent TVLA. Delimitate the pseudo child TTF first and then make a filtered copy if it shrinks substantially sharp when projecting a parent TTF. 20 Algorithm OpportuneProject OpportuneProject(Database: D) begin create a null root for frequent item set tree T; L=1 D’= BreadthFirst(T,L, D); v = the null root of T; GuidedDepthFirst(v, D’); end 21 Performance Evaluation: Efficiency on BMS-POS (sparse) 22 Performance Evaluation: Efficiency on BMS-WebView1 (sparse) 23 Performance Evaluation: Efficiency on BMS-WebView2 (sparse) 24 Performance Evaluation: Efficiency on Connect4 (dense) 25 Performance Evaluation: Efficiency on T25I20D100kN20kL5k 26 Performance Evaluation: Scalability on T25I20D1mN20kL5k 27 Performance Evaluation: Scalability on T25I20D10mN20kL5k 28 Performance Evaluation: Scalability on T25I20D100k~15mN20kL5k 29 Conclusions OpportuneProject maximize efficiency and scalability for all data features by combining depth first with breadth first search strategies array-based and tree-based representation for projected transaction subsets unfiltered, and filetered projections 30
© Copyright 2026 Paperzz