CARPENTER Biological Datasets Biological Datasets Overview……

CARPENTER
Biological Datasets
Find Closed Patterns in Long Biological Datasets
z
Zhiyu Wang
Gene expression
–
Consists of large number of genes
Knowledge Discovery and Data Mining
Dr. Osmar Zaiane
Department of Computing Science
University of Alberta
1
2
Biological Datasets
z
Overview……
Lung Cancer dataset (gene expression)
– 181 samples
– Each sample is described by 12533 genes
z
z
z
z
–
How can we find frequent patterns in such
dataset?
CARPENTER
–
–
z
z
3
Motivation
Problem statement
Preliminaries
CARPENTER algorithm
4
Transpose table
Row enumeration tree
Prune methods
Performance
Comments and Conclusion
Motivation
z
Motivation
Challenge to find the closed patterns from
biological datasets that contains large
number of columns with small number of
rows
–
z
Running time of most existing algorithms
increases exponentially with increasing
average row length2
i
–
For example,
10,000 – 100,000 columns with 100 – 1,000 rows
–
For example, in a dataset
potential 2i frequent itemsets, where i is the
maximum row size.
What if i=12533?
212533
5
6.44 ×103772
(Hugh Search Space)
6
Problem Statement
z
Preliminaries
Discover all the frequent closed patterns with
respect to user specified support threshold in
such biological datasets efficiently.
z
Features f i
–
z
8
Items in the dataset
Feature support set R(F ′)
–
7
=
i
1
2
3
4
Maximal set of rows contain a set of features F ′
r_i
a, b, c
b, c, d
b, c, d
d
Features: {a, b, c, d}
Feature support set
F’={b,c}, then R(F ′) ={1,2,3}
Preliminaries
z
Row support set F ( R′)
–
z
i
1
2
3
4
Proposed by A. K. H. Tung et.al, in ACM
SIGKDD 2003.
z
Main idea is to find frequent closed pattern
in depth-first row-wise enumeration.
z
Assumption: Assume dataset satisfies the
condition:
Maximal set of features common to a set of rows
There is no superset with the same support value R′
r_i
a, b, c
b, c, d
b, c, d
d
Row support set
R’={1,2}, then F (R′) ={b,c}
R << F
Frequent Closed patterns:
{b,c}, {d}, {b,c,d}……..
10
CARPENTER
z
z
Frequent closed pattern
–
9
CARPENTER algorithm
Transpose table
There are two phases:
1.
Transpose the dataset
transpose
2.
Row enumeration tree
–
original table
Recursively search in conditional
transposed table
Projection {2, 3}
23-Conditional transposed table
11
12
transposed table
4
3
7
Row enumeration tree
2
9
10
z
z
Bottom-up row
enumeration tree is
based on conditional
table.
Each node is a
conditional table.
–
8
Not a real tree
structure
14
CARPENTER
z
15
6
1
23-conditional table
represents node 23.
13
5
Example
Recursively generation of conditional
transposed table, performing a depth-first
traversal of row-enumeration tree in order to
find the frequent closed patterns.
z
16
Without pruning strategies, minsup=3
Example
z
Prune methods
Frequent closed
patterns
Minsup=3
a
1,2,3,4
l
1,2,5
aeh
z
It is obvious that complete traversal of row
enumerations tree is not efficient.
z
CARPENTER proposes 3 prune methods.
2,3,4
17
18
Prune method 1
z
19
Prune out the branch which can never
generate closed pattern over minsup threshold
If minsup=4, then these
branches will prune out
20
Prune method 2
z
Prune method 3
If rows appear in all tuples of the conditional
transposed table, then such branch needs to prune
and reconstruct
21
z
22
Performance
CARPENTER is comparing with CHARM and
CLOSET
–
z
z
181 samples with 12533 features
Two parameters: minsup and length ratio
–
Length ratio =60%, varying minsup
100000
Both CHARM and CLOSET use column
enumeration approach
Use lung cancer dataset
–
z
Performance
R u n tim e (sec.)
z
23
In each node, if corresponding support features
is found, prune out the branch.
10000
1000
100
10
1
0.1 4
5
6
7
8
minsup
Length ratio is the percentage of column from
original dataset
carpenter (sec)
charm (sec)
closet (sec)
24
9
10
Performance
z
Comments
Minsup=4% varying length ratio
z
Bottom-up approach of CARPENTER is not efficient.
Runtime (sec.)
100000
10000
1000
minsup=3
100
10
1
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Length Ratio
carpenter (sec)
charm (sec)
closet (sec)
25
26
Comments
z
Conclusion
TD-Close uses top-down approach.
z
z
minsup=3
z
27
28
CARPENTER is used to find the frequent
closed pattern in biological dataset.
CARPENTER uses row enumeration instead of
column enumeration to overcome the high
dimensionality of biological datasets.
Not very efficient somehow
References
z
29
Thank you!
Questions?
A. K. H. Tung J. Yang F. Pan, G. Cong and M. J.
Zaki. CARPENTER: Finding closed patterns in
long biological datasets. In In Proc. 2003 ACM
SIGKDD Int. Conf. On Knowledge Discovery and
Data Mining, 2003.
30