CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets z Zhiyu Wang Gene expression – Consists of large number of genes Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department of Computing Science University of Alberta 1 2 Biological Datasets z Overview…… Lung Cancer dataset (gene expression) – 181 samples – Each sample is described by 12533 genes z z z z – How can we find frequent patterns in such dataset? CARPENTER – – z z 3 Motivation Problem statement Preliminaries CARPENTER algorithm 4 Transpose table Row enumeration tree Prune methods Performance Comments and Conclusion Motivation z Motivation Challenge to find the closed patterns from biological datasets that contains large number of columns with small number of rows – z Running time of most existing algorithms increases exponentially with increasing average row length2 i – For example, 10,000 – 100,000 columns with 100 – 1,000 rows – For example, in a dataset potential 2i frequent itemsets, where i is the maximum row size. What if i=12533? 212533 5 6.44 ×103772 (Hugh Search Space) 6 Problem Statement z Preliminaries Discover all the frequent closed patterns with respect to user specified support threshold in such biological datasets efficiently. z Features f i – z 8 Items in the dataset Feature support set R(F ′) – 7 = i 1 2 3 4 Maximal set of rows contain a set of features F ′ r_i a, b, c b, c, d b, c, d d Features: {a, b, c, d} Feature support set F’={b,c}, then R(F ′) ={1,2,3} Preliminaries z Row support set F ( R′) – z i 1 2 3 4 Proposed by A. K. H. Tung et.al, in ACM SIGKDD 2003. z Main idea is to find frequent closed pattern in depth-first row-wise enumeration. z Assumption: Assume dataset satisfies the condition: Maximal set of features common to a set of rows There is no superset with the same support value R′ r_i a, b, c b, c, d b, c, d d Row support set R’={1,2}, then F (R′) ={b,c} R << F Frequent Closed patterns: {b,c}, {d}, {b,c,d}…….. 10 CARPENTER z z Frequent closed pattern – 9 CARPENTER algorithm Transpose table There are two phases: 1. Transpose the dataset transpose 2. Row enumeration tree – original table Recursively search in conditional transposed table Projection {2, 3} 23-Conditional transposed table 11 12 transposed table 4 3 7 Row enumeration tree 2 9 10 z z Bottom-up row enumeration tree is based on conditional table. Each node is a conditional table. – 8 Not a real tree structure 14 CARPENTER z 15 6 1 23-conditional table represents node 23. 13 5 Example Recursively generation of conditional transposed table, performing a depth-first traversal of row-enumeration tree in order to find the frequent closed patterns. z 16 Without pruning strategies, minsup=3 Example z Prune methods Frequent closed patterns Minsup=3 a 1,2,3,4 l 1,2,5 aeh z It is obvious that complete traversal of row enumerations tree is not efficient. z CARPENTER proposes 3 prune methods. 2,3,4 17 18 Prune method 1 z 19 Prune out the branch which can never generate closed pattern over minsup threshold If minsup=4, then these branches will prune out 20 Prune method 2 z Prune method 3 If rows appear in all tuples of the conditional transposed table, then such branch needs to prune and reconstruct 21 z 22 Performance CARPENTER is comparing with CHARM and CLOSET – z z 181 samples with 12533 features Two parameters: minsup and length ratio – Length ratio =60%, varying minsup 100000 Both CHARM and CLOSET use column enumeration approach Use lung cancer dataset – z Performance R u n tim e (sec.) z 23 In each node, if corresponding support features is found, prune out the branch. 10000 1000 100 10 1 0.1 4 5 6 7 8 minsup Length ratio is the percentage of column from original dataset carpenter (sec) charm (sec) closet (sec) 24 9 10 Performance z Comments Minsup=4% varying length ratio z Bottom-up approach of CARPENTER is not efficient. Runtime (sec.) 100000 10000 1000 minsup=3 100 10 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Length Ratio carpenter (sec) charm (sec) closet (sec) 25 26 Comments z Conclusion TD-Close uses top-down approach. z z minsup=3 z 27 28 CARPENTER is used to find the frequent closed pattern in biological dataset. CARPENTER uses row enumeration instead of column enumeration to overcome the high dimensionality of biological datasets. Not very efficient somehow References z 29 Thank you! Questions? A. K. H. Tung J. Yang F. Pan, G. Cong and M. J. Zaki. CARPENTER: Finding closed patterns in long biological datasets. In In Proc. 2003 ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining, 2003. 30
© Copyright 2025 Paperzz