MOTIF MINING ON STRUCTURED AND SEMI

MOTIF MINING ON STRUCTURED AND
SEMI-STRUCTURED BIOLOGICAL DATA
by
WEI SU
Submitted in partial fulfillment of the requirements
For the Degree of Doctor of Philosophy
Dissertation Advisors: Dr. Mehmet Koyuturk, Dr. Jiong Yang
Committee Members: Dr. Andy Podgurski, Dr. Xiang Zhang,
Dr. Wojbor Woyczynski
Department of Electrical Engineering and Computer Science
CASE WESTERN RESERVE UNIVERSITY
May, 2013
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis/dissertation of
Wei Su
candidate for the
(signed)
Doctor of Philosophy
degree∗ .
Mehmet Koyuturk
(chair of committee)
Andy Podgurski
Xiang Zhang
Wojbor Woyczynski
(date)
∗
Aug. 8th, 2012
We also certify that written approval has been obtained for any pro-
prietary material contained therein.
ii
Table of Contents
List of Tables
vii
List of Figures
viii
Acknowledgments
x
Abstract
xii
Chapter 1.
Introduction
1
Chapter 2.
ARCS-Motif: Discovering Motifs with Dependency
from Unaligned Biological Sequences
3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.4.1 Seed Initialization . . . . . . . . . . . . . . . . . . . . .
12
2.4.1.1 Sequence Similarity Measure . . . . . . . . . . .
12
2.4.1.2 Segment Clustering . . . . . . . . . . . . . . . .
14
iii
2.4.1.3 Selection of Seeds . . . . . . . . . . . . . . . . .
2.4.2 Growing Procedure
15
. . . . . . . . . . . . . . . . . . . .
17
2.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . .
19
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .
19
2.5.1 Accuracy of ARCS-motif Finder . . . . . . . . . . . . .
20
2.5.2 Efficiency of ARCS-motif Finder . . . . . . . . . . . . .
23
Chapter 3.
Permu-Motif: Discovery of Interchangeable Permutation Motifs with Proximity Constraint
26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.4.1 Scanning for Frequent Interchangeable Sets and Reachable Cases . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.3 Final Verification . . . . . . . . . . . . . . . . . . . . . .
47
3.4.4 Correctness of Algorithm . . . . . . . . . . . . . . . . .
48
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .
49
3.5.1 Effectiveness of Permu-Motif Model . . . . . . . . . . .
50
iv
3.5.2 Efficiency of Permu-Motif Algorithm . . . . . . . . . . .
53
3.5.3 Another Application of Permu-Motif Algorithm . . . . .
58
3.5.3.1 Discovery of Gene Motifs from Four New Genomes 61
3.5.3.2 Application of Gene Motifs to Ortholog Prediction 62
Chapter 4.
WIGM: Discovery of Subgraph Motifs in a Large
Weighted Graph
72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .
81
4.5 1-Extension Property . . . . . . . . . . . . . . . . . . . . . . .
84
4.6 Threshold-WIGM: Mining with Weighted Support . . . . . . .
92
4.6.1 Main Algorithm . . . . . . . . . . . . . . . . . . . . . .
93
4.6.2 Subgraph Motif Combination . . . . . . . . . . . . . . .
97
4.6.3 Support Computation . . . . . . . . . . . . . . . . . . .
98
4.6.4 Algorithm Analysis . . . . . . . . . . . . . . . . . . . .
100
4.6.5 Top-K-WIGM: Mining for Top-k Motifs . . . . . . . . .
101
4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .
104
4.7.1 Biological Networks . . . . . . . . . . . . . . . . . . . .
105
4.7.2 Synthetic Graphs . . . . . . . . . . . . . . . . . . . . . .
107
v
Chapter 5.
Conclusions and Discussion
Bibliography
113
116
vi
List of Tables
2.1
Minimum Edit Distance from Discovered Motifs . . . . . . . .
21
2.2
Average Edit Distance from Discovered Motifs . . . . . . . . .
21
3.1
Support of Each Interchangeable Set in the Running Example
40
3.2
All Reachable Cases . . . . . . . . . . . . . . . . . . . . . . .
46
3.3
Summary of Results on 120 Genome Sequences of 97 Species .
51
3.4
Motif Length Comparison . . . . . . . . . . . . . . . . . . . .
52
3.5
Efficiency on genome dataset . . . . . . . . . . . . . . . . . . .
54
3.6
Results on two example target genomes . . . . . . . . . . . . .
69
3.7
Example genes correctly predicted by Permu-Motif but not predicted by BBH . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8
70
Example genes correctly predicted by Permu-Motif but not predicted by BBH . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.1
Results on Biological Network (time in sec.) . . . . . . . . . .
107
4.2
Default Parameter Value . . . . . . . . . . . . . . . . . . . . .
108
vii
List of Figures
2.1
A Set of Unaligned Sequences and An Example of Motif . . .
2.2
Number of Best Motifs Discovered by ARCS-motif and Five
Alternative Methods . . . . . . . . . . . . . . . . . . . . . . .
2.3
7
22
Distribution of the Error Rate of the Motifs Discovered by
ARCS-motif . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.4
Execution Time of ARCS-motif and Five Alternative Methods
25
3.1
Example of Permutation Motif . . . . . . . . . . . . . . . . . .
34
3.2
Flowchart of Permu-Motif Algorithm . . . . . . . . . . . . . .
38
3.3
Reachable Cases in a Sequence . . . . . . . . . . . . . . . . . .
43
3.4
Scalability on Tsup , The Support Threshold . . . . . . . . . . .
55
3.5
Scalability on L, The Average Number of Symbols in A Sequence 56
3.6
Scalability on S, The Number of Sequences . . . . . . . . . . .
57
3.7
Scalability on M, The Number of Interchangeable Sets . . . .
58
3.8
Accuracy of Permu-Motif w.r.t. support. . . . . . . . . . . . .
67
3.9
Accuracy of Permu-Motif w.r.t. Top-k Ortholog Candidates .
68
4.1
Example of A Database Graph G and A Subgraph Motif g . .
75
viii
4.2
Example of a Database Graph G and Multiple Subgraph Motifs
g1 , g2 and g3 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.3
Example of Partitioning a Tree . . . . . . . . . . . . . . . . .
90
4.4
Example of Combining Two Graphs . . . . . . . . . . . . . . .
98
4.5
Execution Time w.r.t. Number of Vertices in G . . . . . . . .
109
4.6
Execution Time w.r.t. Average Vertex Degree of G . . . . . .
110
4.7
Execution Time w.r.t. Distinct Labels in G . . . . . . . . . . .
111
4.8
Execution Time w.r.t. K . . . . . . . . . . . . . . . . . . . . .
112
ix
Acknowledgments
All these years I have been picturing myself writing the acknowledgement page of my dissertation, for I owe my gratitude to many people in my
life. They have made this dissertation possible and because of them the long
journey of pursuing a Ph.D. has been a rewarding and cherishable experience.
First, I would like to thank my advisor, Dr. Jiong Yang. I have benefited greatly from his outstanding knowledge and talent in data mining and
algorithms. I consider it a great opportunity to do my doctoral research under
his guidance.
My deepest gratitude also goes to my advisor, Dr. Mehmet Koyuturk,
who has been an extraordinary, responsible and patient mentor. His support
has helped me overcome many crisis situations and finish this dissertation.
Many thanks for carefully reading and commenting on countless revisions of
this manuscript.
I must also thank my committee members: Dr. Andy Podgurski, Dr.
Xiang Zhang and Dr. Wojbor Woyczynski for taking their precious time to
provide insightful comments and guidance. Their suggestions have helped my
research be more complete and sound.
I am also indebted to Dr. Sun Kim, Dr. Mehmet Dalkilic and Dr.
Kwangmin Choi for their contributions to various parts of my research and
x
dissertation with their expertise in bioinformatics.
Many thanks to my co-workers from CWRU: Meng Hu, Dr. Shijie
Zhang, Shirong Li, Dr. Jin Wei and Lunde Jiang. Particularly, I would like
to acknowledge Meng Hu, Dr. Shijie Zhang and Shirong Li for many valuable
and stimulating inputs to my research.
I deeply appreciate my dearest friends for their support. Particularly,
many thanks to Yunxiang Zhang, Aunt Jianling Yan and Uncle Xianyou Liu
for their constant prayer and to Jia Peng, one of my best friends, for keeping
me sane through difficult times.
Most importantly, none of this would have been possible without the
love from my family. Dad and Mom, thank you for bringing me to this beautiful
world, loving me and always having faith on me.
xi
Motif Mining on Structured and Semi-Structured
Biological Data
Abstract
by
WEI SU
Motif discovery in biological data is very important to biologists because they are conjectured to have biological significance. Despite the considerate effort put in this topic, it remains a challenging and difficult problem.
There are several advances in biology such that the data is structured (graphs)
and semi-structured (sequences).
A challenge of motif mining in sequences is the existence of variations
including substitutions and permutations. Taking into account the existence
of these two kinds of variations, we propose a novel sequential motif model
and devise an algorithm to discover those motifs. A reachability property
is identified to prune the search space. We also demonstrate that the discovered motifs have biological significance. Motivated by another biological
phenomenal called compensational mutation in biological sequences, we propose another model for motif discovery in biological sequences. Considering
compensational mutation adds more degree of freedom in the search space.
A novel algorithm is proposed to solve the problem. It is shown that the
proposed algorithm can discover effective motifs in a timely manner.
xii
Besides sequential data, a huge amount of biological data can be naturally represented as graphs, e.g., protein interaction networks and gene regulatory networks. Most of the existing graph mining research focuses on mining
unweighted graphs. However, weighted graphs are actually more common. To
address the problem of motif discovery in weighted graphs, a weighted subgraph motif model is proposed to capture the importance of a subgraph motif
in a single large weighted graph. We study two related problems about subgraph motif discovery in a large weighted graph: (1) discovering all motifs with
respect to a given minimum weight threshold and (2) finding top k motifs with
the largest weights. We identify a property called 1-extension property so that
a bounded search can be achieved. Last but not least, real and synthetic data
sets are used to show the effectiveness and efficiency of the proposed model
and algorithm.
xiii
Chapter 1
Introduction
Biological data can be represented in two forms as sequences e.g. amino
acid sequences of proteins and gene sequences and as graphs e.g. proteinprotein interaction (PPI) networks and gene regulatory networks.
Finding motifs in biological sequences is very important to biologists
because they are conjectured to have biological significance. For example, an
occurrence of a conserved motif in a DNA sequence may be binding sites for
transcription factor and influence the regulation of gene expression [19][71].
For another instance, an occurrence of a conserved motif in a protein may
indicate this region play important roles in determining the biological behavior
of the protein [52]. The region “may be involved in the interaction with some
other proteins, may comprise the active site of an enzyme, or may be important
for the teriary structure of the protein [47].” And motifs can be used in but
not limited to protein structure and functional prediction, characterization of
protein families [47].
A sequential motif can be represented in many ways. For example, it
can be represented as an ordered sequence of symbols or a matrix. There are
different ways of measuring the importance of motifs in different applications.
1
For example, if a motif is represented as a string, one way to measure the
importance of motifs is by its frequency. If a motif is represented as a matrix,
we can define the weight of a matrix in terms of the similarity between rows.
In its simplest form, sequential motif finding can be defined as: given
a set of sequences, we would like to find motifs which occur frequently. But
identification of motifs in biological sequences is not that simple. During evolution, biological sequences may undergo some degree of mutations(variations)
including insertions, deletions and substitutions. The functions of a motif may
not be altered but sequences of occurrences of the motif in different species
may be different because of mutations during evolution. It is desirable to
model those mutations and other evolutionary events. Therefore, we propose
two motif model considering two different kinds of biological phenomenon.
Other than sequential biological data, an important part of biological
data are represented as graphs. For example, in a graph representing a proteinprotein interaction (PPI) network, a vertex presents a protein and an edge
between two proteins exists if these two proteins interact. Subgraph motif
mining aims to find subgraph motifs representative of a large single graph or a
set of graphs. In reality, edges in biological networks are commonly associated
with a weight representing the likelihood of the existence of the edge. Since
little literature is available for subgraph motif mining in a single large weighted
graph, we propose a novel motif model and algorithm to solve this problem.
2
Chapter 2
ARCS-Motif: Discovering Motifs with
Dependency from Unaligned Biological
Sequences
2.1
Introduction
The motivation of ARCS-motif finder is the phenomenon of compen-
sational mutation (it is also called correlated mutation) which is commonly
observed in protein sequences. Compensational mutation is a kind of coevolution at the microscopic level of protein modules. Co-evolution is the
phenomenon that “the change of a biological object triggers the correspond
change of another biological object [92].” In amino acid sequences of proteins,
some residues exhibit compensational mutations. It is known that amino acid
sequence of a protein specifies its three-dimensional structure and biological
functions [78]. The mutation of some residue may alter the biological functions
of the protein. Under the pressure of maintaining its biological functions and
structure, when a residue mutates, there is a high probability that another
residue mutates correspondingly [29][55][75][86][83][78]. We would like to take
this phenomenon into account to discover biologically meaningful motifs.
Recently, [79] proposed a new approach, an aggregated related column
scoring scheme (ARCS), to detect conserved regions in aligned biological se3
quences. ARCS is a new scoring system which not only takes into account the
similarity between rows but also consider the dependency between columns.
In [79], the authors use ARCS value to measure the importance of motifs (represented as matrices). ARCS is a new scoring system which not only take
into account similarity between sequences but also dependency between positions(bases) of sequences. [79] demonstrates that ARCS measure is superior
to pervious measures.
Although ARCS is a scoring system which can better captures the
motif importance, it is difficult to apply it to general motif discovery (from
aligned and unaligned sequences). First, the method in [79] is designed to
detect conserved regions from aligned sequences. However, most global alignment methods [59] can align function domains together in less than 50% of
PROSITE families. Thus, it is desirable to extend ARCS measure to general
motif discovery from unaligned sequences.
If want to use ARCS as the measurement of importance of motif in
unaligned sequences, none of the traditional sequential motif discovery method
is applicable here. And it is impossible to enumerate all the candidate motifs
because after adding dependency between positions(bases) of sequences into
the measurement, there are more degrees of freedom in the search space.
Therefore, we propose ARCS motif model for general motif discovery.
Then, we develop a ARCS-motif finder to efficiently ARCS-motifs from protein sequences. ARCS-motif finer identifies short seeds and grows the seeds to
longer motifs. However, two challenges still remain: (1) how to quickly iden4
tify the seed motifs with high ARCS values, and (2) how to grow the seeds
efficiently.
Fortunately, there are a limited number of short segments. In protein
sequences, there are twenty elements in the alphabet. Thus, the number of
distinct short segments is quite small, e.g., sixty-four million distinct short
segments of size six (having six symbols). These short segments will be partitioned into groups where segments within a group may yield high ARCS
values. Segments from the same cluster may serve as a seed motif.
2.2
Related Work
There are many motif finding algorithms for biological sequences.
Some of the algorithms try to solve the problem by exhaustive search.
Basically, exhaustive search algorithms enumerate all possible motifs and then
refine them based on some threshold. They may do some optimization to
prune the search space. The advantage of these algorithms is that they can
guarantee to find the optimal motifs but the drawback is that they may run in
exponential time in the worst case therefore are only suitable for discovering
short motifs.
It is hard to find long motifs using exhaustive search. Some researchers
aim to avoid exhaustive search while guarantee completeness by starting with
a much smaller set of seeds and extending them. TEIRESIAS algorithm [69] is
one of them. TEIRESIAS has two steps. The first step called scanning phase
5
is to find all the elementary motifs with sufficient support. The second phase
is conducted progressively to combine those elementary motifs into longer and
longer motifs until all the maximal motifs have been generated. In another
paper[46], the authors propose a MULTIPROFILER algorithm. It utilizes
multiprofiles which generalize the notion of a profile used to detect subtle
motifs that might escape detection by the standard profiles.
So far we have described mainly algorithms guaranteed to find the
optimal motifs. However, for more complicated types of motifs, we cannot hope
to do so. We have to use heuristic approaches that do not necessarily find the
optimal motifs, but may converge to a local maximum. One of the examples
of such techniques is Gibbs sampling [51]. It uses a randomized hill-climbing
algorithm to approach a locally optimal solution of un-gapped alignments. In
each iteration, a sequence is randomly selected and aligned according to the
maximal alignment score. In general, any sequence alignment heuristic (e.g.,
a conserved region discovery score) may be used in this method. MEME [6]
is developed based on Gibbs sampler. However, none of the algorithms above
consider dependency between residues of the sequences.
2.3
Preliminaries
In this subsection, we introduce the definition of ARCS value and ARCS
motif. Before that, we first introduce the definition of sequence set, the representation of motifs used in this chapter.
Definition 1. A sequence set S is a set of sequences, S = {s1 , s2 , . . . , sm }.
6
We define si (x, y) as the subsequence of si from position x to y. Let |S| be
the number of sequences in S, i.e., |S| = m. Given sequence set S, a set of
segments P = {si1 (a1 , a1 + c), si2 (a2 , a2 + c), . . . , sik (ak , ak + c)}is a motif of
S, iff
1. 1 ≤ i1 ≤ i2 ≤ . . . ≤ ik ≤ m and i1 , i2 , . . . , ik are distinct.
2. ∀j, 1 ≤ j ≤ k, 1 ≤ aj , aj + c ≤ |sij |.
Let PD be the number of segments in motif P , and PL be the length of
each segment. Therefore, we can convert motif P to a PL × PD matrix MP ,
where MP [x, y] = six [ax + y], 1 ≤ x ≤ PD , 0 ≤ y < PL .
Example 1. Figure 2.1 shows a set of unaligned sequence set S composed of
five sequences S1 , S2 , S3 , S4 , S5 .
P = {S1 (2, 4), S2(3, 6), S3(1, 4), S5 (2, 5)}
= {MLQW, MT QL, W LLL, W QQL} is a motif of S.
⎡
⎤
M L Q W
⎢M T Q L ⎥
⎥
MP = ⎢
⎣W L L L ⎦
W Q Q L
Figure 2.1: A Set of Unaligned Sequences and An Example of Motif
7
Next, we introduce the definition of ARCS value and ARCS motif.
ARCS value is used to measure the importance of a motif. It is composed by
two parts. The LOGOS [73] value is used to measure the degree of order in
each column. Functional dependency [28] is used to measure the dependency
between columns.
Definition 2. For any column ci in a matrix M, the LOGOS value of the
column is defined as
LOGOS(ci ) = HM ax − H(ci )
H(ci ) represents the disorder degree (entropy) of the i-th column. It is
defined as
H(ci ) = −Σe
nie
nie
log
n
n
where n is the number of rows in matrix M and nie is the observed
count for symbol e in column i; nie = Σj δ(M[i, j] = e), where δ(M[i, j] = e)
is 1 if M[i, j] = e and 0 otherwise. HM ax is defined as
HM ax = log(min(ns, n))
where ns denotes the number of symbols
⎡
M
⎢M
Example 2. For a matrix M = ⎢
⎣W
W
n = 4) with 4 distinct symbols (MLQW)
8
in matrix M.
⎤
L Q W
L Q L⎥
⎥ there are 4 rows (i.e.,
L L L⎦
L Q L
(i.e., NL = 4).
HM ax = log(min(ns, n)) = log(min(4, 4)) = 2;
H(c1 ) = − 12 log 21 − 12 log 12 = 1;
LOGOS(c1) = 2 − 1 = 1;
H(c2 ) = −1log1 = 0;
LOGOS(c2) = 2 − 0 = 2.
For matrix M, column c2 has a very high degree of order because it has
four identical symbols in each row. Column c1 has a lower degree of order
than column c2 . After calculating their LOGOS values, column c2 has a higher
LOGOS value than column c1 , which indicates that column c1 has a higher
degree of order.
Definition 3. The functional dependency (FD) of column j given column i is
defined as
FD(cj |ci ) = 1 −
H(cj |ci)
log(n)
H(cj |ci ) is the conditional entropy of column j given the information
of column i, that is,
H(cj |ci ) = Σp
nip nip,jq
nip
Σq
log
n
nip
nip,jq
Given a matrix M, for the i-th column and the j-th column in M, nip ,
njq is the observed count for symbol p in column i and q in column j, i.e.,
nip,jq = Σk δ(M[i, k] = p, M[j, k] = q).
9
⎡
⎤
A B
⎢A B ⎥
1
1
⎥
Example 3. For a matrix M1 = ⎢
⎣C D⎦ H(c2 |c1 ) = 2 (1log1)+ 2 (1log1) = 0;
C D
0
F D(c2 |c1 ) = 1 − log4 = 1.
⎡
⎤
A B
⎢A C ⎥
1
1
1
⎥
For another matrix M2 = ⎢
⎣A D ⎦ H(c2 |c1 ) = 1( 4 log4+ 4 log4+ 4 log4+
A E
1
2
log4) = 2; F D(c2 |c1 ) = 1 − log4 = 0.
4
Looking at matrices M1 and M2 , in matrix M1 , there is a perfect mapping from column c1 to column c2 , or we could say that column c2 has a high
dependency on column c1 . In matrix M2 , there doesn’t exist a mapping from
column c1 to column c2 . The FD value of column c2 given column c1 in matrix
M1 is higher than that in M2 . FD value can be regarded as an indicator of
existence of such mapping or dependency.
Given a matrix M, ARCS value is calculated from each column of M.
Intuitively, a column has a high ARCS value if its neighboring columns have
a high dependency on it and its neighboring columns and itself have a low
degree of disorder.
Definition 4. Given a matrix M, the Aggregated Related Column Score
(ARCS) for column i is defined as
ARCS(ci ) =
Σj∈N (i) FD(ci |cj )LOGOS(cj )
(|N(i)|)
10
where N(i) is the set of neighboring columns of column i. We define a
neighborhood size N = |N(i)|. N is usually a small integer. Column j belongs
to set N(i) if |j − i| ≤ (N − 1)/2. More details about ARCS measure can be
found in [79].
Definition 5. The ARCS value for a matrix M is defined as
ARCS(M) =
Σi ARCS(ci )
ML
where ML is the number of columns in M.
Definition 6. A motif P is represented as a matrix MP . The ARCS value of
a motif P is defined as ARCS(MP ).
Here, we aim to find non-gapped motifs. Our objective is to find motifs
P with high ARCS value. However, if we only pay attention to ARCS value
of motifs, we may end up discovering many motifs with high ARCS values but
having few sequence occurrences because a motif P is more likely to have a
high ARCS value if MP has fewer rows. For example , if there is only one
row in MP , P has a perfect ARCS value, however, such motifs are not useful.
Therefore we want to find motifs P satisfying both conditions (1) P has a
ARCS value larger than the minimum ARCS value) and (2) the number of
row in MP is larger than the minimum number of rows.
Problem Statement: Given a sequence set and two thresholds which
are minimum ARCS value T and minimum number of rows R, the problem of
11
ARCS motif discovery is to find ARCS motifs. A motif P is called a ARCS
motif if it satisfies the requirements that (1) ARCS(MP )≥ T and (2) PD ≥ R
where PD is the number of rows in MP .
2.4
Algorithm
We introduce a heuristic algorithm ARCS-motif finder for ARCS motif
discovery. ARCS-motif finder consists of two steps: (1) generating short seeds
via clustering, and (2) growing short seeds to longer motifs.
2.4.1
Seed Initialization
In previous motif searching methods, since no dependency information
is considered, it is rather straightforward to generate seeds. We can select
short motifs with high frequency as effective seeds. However, in our model,
since we need to consider dependency, segments (short subsequences) with
less frequency may also yield high ARCS value. To balance dependency and
frequency, we cluster segments that may yield high ARCS values.
2.4.1.1
Sequence Similarity Measure
In any clustering method, the similarity/dissimilarity measure is essential. Given two sequences s1 , s2 , |s1 | = |s2 | and neighborhood size N, we define
the dissimilarity between s1 and s2 to be D(s1 , s2 ) = DL (s1 , s2 ) + DF (s1 , s2 ),
where DL and DF indicates the dissimilarity between rows and independence
between columns respectively.
12
We define d(x, y) as
1. d(x, y) = 1 when x = y;
2. d(x, y) = 0 when x = y.
DL (s1 , s2 ) is the hamming distance between two sequences of the same length.
We define DL = Σi d(s1 [i], s2 [i]). DL is a good indicator of LOGOS value. For
a matrix M composed of s1 and s2 , the higher DL (s1 , s2 ), the lower the sum
of LOGOS values of columns in M.
For example, for two sequences s1 = abc and s2 = abd, the matrix
composed by these
a b
M=
a b
two sequences is
c
d
In the matrix M composed by these two sequences, LOGOS(c1) = 1,
LOGOS(c2) = 0 and LOGOS(c3) = 0, for these two sequences, DL (s1 , s2 ) =
0 + 0 + 1 = 1.
DF (s1 , s2 ) represents another type of difference between s1 and s2 and
is closely related to functional dependency. Considering the functional dependency from column i to column j, there are four scenarios:
1. s1 [i] = s2 [i] and s1 [j] = s2 [j];
2. s1 [i] = s2 [i] and s1 [j] = s2 [j];
3. s1 [i] = s2 [i] and s1 [j] = s2 [j];
13
4. s1 [i] = s2 [i] and s1 [j] = s2 [j].
In the first three cases, functional dependency is perfect from column
i to column j. Only in the last case, functional dependency from column i to
column j is low. Thus, DF (s1 , s2 ) is essentially the number of occurrences of
the fourth case. Formally,
DF = Σi,j,|i−j|≤N,s1[j]=s2[j]d(s1 [i], s2 [i]).
2.4.1.2
Segment Clustering
In this subsection, we present the algorithm for clustering small segments into seeds. The main idea is to first find all segments (subsequences)
of length l from the sequence set and then cluster these segments based on
the dissimilarity measure defined above. Segments from the same cluster are
selected as seeds for the growing phase.
Given a sequence s = s[1]s[2]...s[n] and parameter l, the segments set
generated from s is
Seg(s) =
n−l+1
s[i][i + 1]...s[i + l − 1],
i=1
There are many ways to set the value of l. In the domain of protein sequences,
seed segments of three symbol are commonly used for computational efficiency,
e.g., in BLAST [3]. Thus, we set l as three in our method. Given a sequence
set S = {s1 , s2 , . . . , sn }, the segment set generated from S is:
Seg(S) =
n
i=1
14
Seg(si )
After we generate the segment set from the sequence set, we group
these segments into a number of clusters. Since the number of clusters may be
difficult to determine, we use a clustering method without knowing the exact
number of clusters. At the beginning, we choose a a large k0 as the initial value
of the number of clusters and we assign each segment to a cluster randomly.
Then the algorithms proceeds in iterations. In the ith iteration, we start with
ki−1 centroids from the previous iteration. Then we assign each segment to
the closest centroid. Next, we recompute the new set of ki−1 centroids. If
some cluster has fewer than τ segments, we remove the cluster. At the end
of the i + 1th iteration, we have ki clusters and ki centroids. The procedure
repeats until the centroids do not change from one iteration to another. In this
method, there are two parameters: k0 and τ . In an ideal scenario, a cluster
contains one seed and the seed may contain one segment from each sequence.
Thus, we set k0 as the average length of the input sequences and τ as the
input threshold R. Any general clustering algorithm is applicable here and
we are using this algorithm because it is easy to implement and we want to
incorporate the prior estimate of k.
2.4.1.3
Selection of Seeds
A seed consists of a set of short segments of length three. None of
the segments in the same seed are from the same sequence. After we group
segments into different clusters, we randomly select sets of segments from the
same cluster as seeds.
15
If a sequence si has a segment in cluster C, then we say that C covers
si . At most one segment from a sequence is selected into the seed. For each
sequence s covered by C, if there exist j segments {sg1 , sg2 , ..., sgj } in the
cluster from sequence s, the probability of sgx being selected is
1−
Prob(sgx ) =
D(cen(C),sgx )
Σjy=1 D(cen(C),sgy )
j−1
where cen(C) denotes the centroid of cluster C and here the centroid of
a cluster is composed of all segments in the cluster. This ensures that segments
closer to the centroid have higher probability of being selected. After we
process all the sequences, if the number of segments in the seed is no less than
threshold R and the ARCS value of the seed is above threshold T , the seed
is effective. Otherwise, we throw the seed away. We show the seed selection
algorithm below.
Algorithm 1 Seeds Selection
Input: Cluster C, Sequence set S, threshold R and T .
Output: A Seed SD.
1: SD ← ∅
2: S ← {s|s ∈ S, Seg(s) ∩ C = ∅}
3: Calculate the selection probability of each segment
4: for each sequence s in S do
5:
Choose one segment d at random from Seg(s) ∩ C
6:
SD ← SD + d
7: end for
8: if |SDD | > R and ARCS(SD) > T then
9:
return SD
10: end if
16
2.4.2
Growing Procedure
After a set of seeds are obtained, they are expanded to longer motifs
via a growing algorithm.
The basic strategy is to grow one column at a time to each side of
segments in the seed. Each time, the ARCS value of the (new) candidate motif
is computed. If the ARCS value of the new motif with one more column is
still higher than or equal to the threshold T , we continue growing. Otherwise,
some segment in the candidate motif would be removed to increase the ARCS
value of the resulting candidate motif.
For a candidate motif P , ARCS(P ) < T , let us suppose that the
segments contained in P are sg1 , sg2 , ..., sgPD , we first calculate the ARCS
values for the PD motifs with one segment absent, i.e., we compute motifs
P − {sg1 }, P − {sg2 }, . . . , P − {sgPD }. Then we remove one segment sgi with
the highest ARCS(P − {sgi }). If the ARCS value of the resulting motif is
higher than or equal to T while PD is still greater than or equal to threshold
R, we continue to grow the column on the new motif. Otherwise, we stop the
growing phase and output the motif before P , i.e., the motif with one column
less than P .
In the case when some segment in the motif reaches the end of the
sequence, we simply remove the segment from the motif and continue to grow
the motif if PD ≥ R. The pseudo code of the grow algorithm is shown as
follows.
17
Algorithm 2 Grow Algorithm
Input: A Seed SD, Sequence set S, threshold T , R and L.
Output: Motif P .
1: P ← SD
2: while PL ≤ L do
3:
P ← P
4:
Add one column on P .
5:
if segment sgi reaches the end of sequence then
6:
P ← P − sgi .
7:
end if
8:
if ARCS(P ) > T then
9:
continue
10:
end if
11:
delete segment sgi with the largest ARCS(P − sgi ) from P .
12:
if ARCS(P ) ≥ T && PD ≥ R then
13:
continue
14:
end if
15:
P ← P
return P
16: end while
18
2.4.3
Complexity Analysis
Let n denote the number of sequences, m be average sequence length,
d be alphabet size, k be the number of clusters.
Slicing sequences into segments takes O(nm) time. Since calculating
the distance between a segment and a cluster takes O(d), it takes O(nmdk)
time to cluster all the segments in each iteration.
Let L denote the maximum length of motifs and N denote neighborhood size. For the growing phase of the algorithm, it takes O(nm N) time to
compute the ARCS value of a n × m matrix. The complexity of recomputing the score after one column expansion is O(nN). To remove a sequence
from a motif, we need to recompute the score n’ times but with appropriate
data structures, it takes only O(m N) time for each re-computation because
we remove one sequence from a motif with known ARCS value. Thus, the
complexity of segment removal is O(n m N). The overall complexity of motif
growth is O(nmNL).
2.5
Experimental Results
In this section, we analyze the accuracy and the efficiency of ARCS-
motif method. To better understand the performance of ARCS-motif, we apply
it to data sets from PROSITE database [97]. PROSITE is a protein database.
For each protein family in the database, it contains a motif determined by lab
tests or by other biological knowledge and we consider it as the correct motif.
19
We compare predicated motifs with PROSITE motifs.
In addition, five alternative methods, CONSENSUS [38], Gibbs sampler
[51], MEME [6], SPLASH [15], and DIALIGN-TX [82] are also applied to the
same data set. DIALIGN-TX is a multiple sequence local alignment method.
When applying this method, we first align sequences via DIALIGN-TX and
then find regions with highest ARCS value on the aligned sequences. To
perform the tests fairly, it is assumed that the motif length is known for all
six methods. All six methods are used for discovering motifs without gaps.
There are five parameters for ARCS-motif method. We set minimum
ARCS value as 1.0, minimum sequence percentage as 0.4, neighborhood size as
5 and motif length as the same value as that for MEME, Gibbs, CONSENSUS,
SPLASH and DIALIGN-TX. We set the length of motif as the length of real
sites for each method. Finally we try to discover 5 and 20 motifs from each
family using each method.
2.5.1
Accuracy of ARCS-motif Finder
We randomly choose 400 families proteins from PROSITE database.
We apply ARCS-motif method and the four alternative methods to the data
sets. Each data set is associated a PROSITE motif which is considered as the
correct motif. We measure the accuracy by comparing predicted motifs with
PROSITE motifs. The more similar to the PROSITE motif a predicted motif
is, the more accurate we consider the predicted motif is. And we measure the
similarity here using edit distance.
20
Table 2.1: Minimum Edit Distance from Discovered Motifs
ARCS-motif MEME Gibbs CON. SPLASH DIALIGN
Top 5
6.50
8.33
10.07 7.05
13.7
11.21
Top 20
5.52
8.12
8.04
8.02
11.63
9.85
Table 2.2: Average Edit Distance
ARCS-motif MEME Gibbs
Top 5
8.99
10.11
11.31
Top 20
9.60
11.08
10.99
from Discovered Motifs
CON. SPLASH DIALIGN
10.04
17.00
13.68
10.45
16.47
12.48
We use all the methods to discover five and 20 motifs from each family.
The edit distance from a discovered motif to the PROSITE motif of the protein
family is used to access how good the discovered motif is. For each protein
family, we consider a motif among the top five or 20 discovered motifs as
“best” motif if it has the minimum edit distance from the PROSITE motif of
the protein family. Table 2.1 shows that on average, the edit distance from
“best” discovered motifs to PROSITE motifs. On average, ARCS-motif is able
to produce motifs which are at least 10% better than the best of the alternative
methods.
Besides, for each protein family, we compute average edit distance from
the top five or 20 discovered motifs to the PROSITE motif of the protein family.
Table 2.2 shows the average results for the 400 hundred families. The results
are similar to what is demonstrated in Table 1.1. Both tables demonstrate
that ARCS-motif method can produce better motifs with regard to similarity
to correct motifs.
Figure 2.2 shows among the six competing methods, ARCS-motif, MEME,
21
CONSENSUS, Gibbs sampler, SPLASH and DIALIGN-TX, which method can
produce motifs most similar to the correct motifs. Figure 2.2(a) shows that
among 400 families, when it is set to discover top five motifs, ARCS-motif
produces the best motifs for about 200 families according to the minimum and
average edit distance measures, which is much better than the other methods.
(a) Top-Five
(b) Top-Twenty
Figure 2.2: Number of Best Motifs Discovered by ARCS-motif and Five Alternative Methods
22
We also analyze the error rate of the ARCS-motif method. We define
the error rate of a discovered motif as the edit distance from it to its corresponding PROSITE motif divided by the length of the motif. Figure 2.3
shows the distribution of the error rate of the motifs discovered by ARCSmotif method. For most families, the motifs discovered by ARCS-motif have
an error rate below 50%.
2.5.2
Efficiency of ARCS-motif Finder
Besides accuracy, we also compare the efficiency of ARCS-motif with
the other methods based on two factors: the number of sequences and average
sequence length. Figure 2.4 shows the execution time of all the methods. In
Figure 2.4(a), we fix the average sequence length and vary the number of sequences while in Figure 2.4(b), the number of sequences is fixed and the average
sequence length varies. Compared with MEME, Gibbs sampling, CONSENSUS, and DIALIGN-TX, ARCS-motif can produce results much faster, and in
addition, its execution time grows at a slower rate with respect to the number
of sequences and the average sequence length. One thing worth noting is that
the execution time of SPLASH is shorter than that of ARCS-motif because
SPLASH is used to find simple sequential motifs. The accuracy of SPLASH
is much lower than that of ARCS-motif. As a result, ARCS-motif method is
suitable for finding motifs considering dependency within a reasonable time.
23
Figure 2.3: Distribution of the Error Rate of the Motifs Discovered by ARCSmotif
24
(a) Execution Time w.r.t. Number of Sequences
(b) Execution Time w.r.t. Average Sequence Length
Figure 2.4: Execution Time of ARCS-motif and Five Alternative Methods
25
Chapter 3
Permu-Motif: Discovery of Interchangeable
Permutation Motifs with Proximity Constraint
3.1
Introduction
Another challenge in sequential motif mining is the existence of vari-
ations. Variations in related sequences can be classified into two categories:
substitutions and permutations. In some applications, some symbols are interchangeable with others. For example, in genetic sequence analysis, orthologous
genes [30], which are genes in different species with close evolutionary ancestry,
hence with the similar biological functionality, form such groups in biological
context. This kind of variations are called substitutions. What’s more, in some
situations, the order of symbols could be permuted in different sequences. For
example, in bacteria, related genes often appear in each other’s neighborhood
however the order of genes may not be the same [30, 61]. This kind of variations are called permutations. Those two kinds of variations could prevent
biologically interesting motifs from being discovered by most of existing frequent sequential motif mining methods. In most of existing frequent sequential
motif models, a sequence is said to support a motif only if it contains symbols
of the motif in the exact same order as the motif.
26
Discovering motifs that exhibit these two kinds of variations is important in biological applications. In many applications, it is desirable to
find functionally related genes from genomes of different species, i.e., genes
in different organisms that perform similar biological functions within each
organism. First of all, orthologous genes have likely evolved from the same
ancestor, and their sequences may diverse in different species during evolution
although their biological functionality may remain similar. Furthermore, due
to transposition, inversion and duplication events, some genes may change
their positions in the genome during evolution. However, as illustrated in
[21, 61], in bacteria and archaea, functionally related genes are often closely
located together in genomes of different species. Thus finding physically clustered genes is an effective way to generate candidates for functionally related
gene groups. Techniques such as bi-directional best hit(BBH) can find pairs
of such genes in different species. However, finding groups of such pairs also
reveal functionally related genes in the same species; thereby enabling identification of protein-protein interactions, protein complexes and functional models
[18][42][60][54].
We assume that the input sequence data set is in the form of <s1 , g1 ,
s2 , g2 , . . . , sn > where si is a symbol and gi is the gap between two consecutive
symbols si and si+1 . For instance, a genome sequence can be represented in this
form where si is a gene and gi is the gap between two consecutive genes in the
genome. The gap between two genes can be the number of base fairs between
them. A motif in this context is a gene cluster (a group of physically clustered
27
genes). A gene cluster is a good candidate for functionally related gene groups.
In text mining, si can be a keyword and gi can be the number of words between
two keywords si and si+1 . A motif is a group of physically clustered words.
If this motif occurs frequently, it might be useful for summarizing texts and
extracting the meaning of raw texts.
A key observation on genomic sequences of bacteria is that: Although
the order of genes in a motif may be permutated in different sequences, the
positions of motif symbols still remain close in those sequences where the motif
occurs. In [61], the authors examine 10,583 cases of functionally related genes
and show that, in the organisms studied, it is reasonable to define two genes
as approximate if the gaps between them are 300 bp or less. Since the number
of genes participating in the biological function may vary from a few to more
than a couple hundred, the total portion of the genome that this group of
genes spans may not be bounded. Therefore, we only measure the proximity
by the gap between consecutive genes of an occurrence of motif in a sequence
not by the gap between the starting gene and the end gene of the occurrence.
Based on the above observations, we propose a interchangeable permutation motif model. To represent the substitution of sequence symbols, each
position in a motif is a interchangeable set. A interchangeable set can be a
single symbol or a set of symbols that are mutually interchangeable. In the
analysis of biological sequences, for example, genetic sequence analysis, we can
consider genes within the same orthologous group as interchangeable.
Since the order of symbols of a motif could be altered in sequences, we
28
loosen the definition of motifs to allow symbols in a motif to occur in any order
in sequences, as long as they occur in close enough proximity. We call our motif
model as interchangeable permutation motif (or permutation motif for short).
The advantage of our permutation motif model is that the permutation motif
model can capture not only the total order of symbols but also the permutated
order of symbols.
This problem cannot be solved by algorithms based on Apriori property
because of the proximity constraints applied on the motifs. This is because,
due to the loose definition of interchangeable permutation motifs, the downward closure property [2][1] for the sequential motifs does not hold. Therefore,
the bottom-up growth algorithm such as the famous Apriori algorithm cannot
be directly applied to mining permutation motifs. For this reason, we propose
a novel algorithm to discover frequent permutation motifs from large sequence
databases. Instead of using the Apriori property, we use an alternative property called reachability property composed by two sub-properties to prune the
candidate set for frequent permutation motifs. The Permu-Motif algorithm
is not a level-wise Apriori algorithm since it does not grow from short motifs
to longer motifs and does not utilize the Apriori property. The advantage
of the reachability property and the Permu-Motif algorithm comes from their
pruning power which can reduce the size of candidate set significantly.
29
3.2
Related Work
Frequent sequential motif mining [2, 35, 34, 44, 80, 89] has been an
active research area in the data mining community for years. The problem is
first introduced by the authors of [2] in the form of association rule mining.
It is aimed to find associations between items customers usually buy together.
For instance, if a customer buys beer, it is very likely that he also buys diaper.
Two algorithms, AprioriAll and AprioriSome are proposed in [2] and a famous
downward closure property, called Apriori property (also called anti-monotonic
property sometimes), is identified: “A k-itemset is frequent only if all of its
sub-itemsets are frequent”. In a more general form, in models holding Apriori
property, a motif is frequent only if all of its sub-motifs are frequent. Therefore,
a large portion of candidate motifs can be pruned.
General Sequential Pattern(GSP) [80] is another Apriori-based algorithm for sequential motif mining which integrates a time constraint and knowledge of taxonomies. Although [80] introduces a time constraint and knowledge
of taxonomies, the Apriori property still holds for the problem. It is different from the problem in this chapter. PrefixSpan[64] describe an algorithm
to mine sequential motifs by prefix-projected motif growth. SPADE [94] is
an algorithm proposed to find frequent sequences using efficient lattice search
techniques and simple joins. In [44], the authors outline a more general formulation of sequential motifs, and develop a modification of the GSP algorithm
[80] to discover universal sequential motifs. Recently [35] introduce a parallel
algorithm to mine closed sequential motifs on a distributed memory system.
30
However, all the works above assume that neither substitutions or permutations exist. The model proposed in [91] take into account substitutions of
sequence symbols, but it still does not consider permutations in sequences.
Mining sequential motifs with constraints has received much attention
in different angles. In [26], the authors propose to use regular expressions as
constraints for sequential motif mining and developed a family of SPIRIT algorithms. [65] systematically studies the problem of pushing various constraints
into sequential motif mining. In addition, an efficient algorithm, prefix-growth,
is developed to mine sequential motifs with prefix-monotone constraints. In
[66], a Frecpo algorithm is proposed to mine frequent closed partial orders from
large sequence datasets. However, the problem of mining sequential motifs we
are trying to solve is more complicated than these proposed works: first the
proximity constraint is not considered in their models but in our model; second, we take into account the existence of variations including substitutions
and permutations.
3.3
Preliminaries
In this chapter, we are interested in discovering motifs from a sequence
database in which both permutations and substitutions exist. We assume
that a sequence is an ordered list of symbols along with their positions in
the sequence, which can be represented as {s1 , g1 , s2 , g2 , . . . gl−1 , sl }, where si
denotes a symbol in the sequence, and gi denotes the gap between symbols si
and si+1 . For example, in text mining, si can be a keyword and gi can be the
31
number of words between two keywords si and si+1 . In genetic sequences, si
can represent a gene and gi is the number of base pairs between two genes si
and si+1 .
In addition to the sequences, we assume that there is a symbol similarity
matrix indicating which symbols are similar. Symbols with high similarity
are likely to be substituted into each other. This similarity matrix can be
obtained in different ways for different applications. For example, in text
mining, the word similarity matrix can be constructed from dictionaries or
lexical databases such as WordNet [23]. In biological sequence analysis, the
gene similarity matrix can be constructed by sequence similarity via BLAST
[95] or other tools. Currently, COG database [85] and the PFAM database [7]
with a large number of orthologs (sets of gene or protein families) are publicly
available. To handle the differences in sequences due to substitutions, we
define the interchangeable symbol set in our model as follows:
Definition 7. A interchangeable symbol set, or interchangeable set
for short, is a non-empty set of symbols interchangeable with each other.
Once the symbol similarity matrix is obtained, the interchangeable symbol sets can be easily derived by setting a similarity threshold. A interchangeable symbol set can consist of a single symbol or a set of symbols. For example,
the interchangeable set {police, cop} means that the words police and cop are
synonyms and thus interchangeable. If a interchangeable set contains only one
symbol, it means that this symbol is not interchangeable to any other symbol.
32
For example, the interchangeable set {politics} indicates that the word politics
has no synonyms. Each non-interchangeable symbol can be represented as a
interchangeable set containing this symbol only. One symbol could belong to
multiple interchangeable sets, e.g., a word could have different meanings.
In most of existing sequential motif mining applications, a sequence
is said to support a motif only when the total order defined by the motif is
contained by the total order defined by the sequence. However, in the problem
studied, we are interested in finding groups of symbols appearing in close
proximity to each other in many sequences. The exact order is not important
since the order of symbols in sequences can be permutated. To address this
problem, we introduce a new definition of support. Before giving the definition
of support, we introduce a new motif model called interchangeable permutation
motif. Interchangeable sets serve as basic building blocks of interchangeable
permutation motifs.
Definition 8. A interchangeable order-permutation motif, or permutation motif for short, is a collection of interchangeable symbol sets in the
format P = {MSi |1 ≤ i ≤ m}, where MSi are interchangeable sets. The number of interchangeable sets in a permutation motif, m, is called the length of
the permutation motif.
Definition 9. An unordered symbol set {si |1 ≤ i ≤ m} is called an unordered instance of a permutation motif P if si ∈ MSi for every i(1 ≤ i ≤
m). An ordered symbol set (si |1 ≤ i ≤ m) is called an ordered instance of
33
mutable set: {c,d}
permutation pattern: {{a}, {c, d}, {f}}
2 unordered instances: {a, c, f}
{a, d, f}
12 ordered instances: (a, c, f)
(a, d, f)
(a, f, c)
(a, f, d)
(c, a, f)
(d, a, f)
(c, f, a)
(d, f, a)
(f, a, c)
(f, a, d)
(f, c, a)
(f, d, a)
example sequences: ………a…c…f……………
…..f…d…a……………….
……d…f…a……………...
…………….……………...
Figure 3.1: Example of Permutation Motif
P if si ∈ MSi for every i(1 ≤ i ≤ m). Any permutation of an unordered instance of P is an ordered instance of P . We use {} to represent an unordered
instance and ( ) to represent an ordered instance.
Example 4. P = {{a}, {c, d}, {f }} is a permutation motif, which contains
three interchangeable sets: {a}, {c, d} and {f }. P has two unordered instances:
{a, c, f } and {a, d, f }. (a, f, c) is an ordered instance of P , because it is a
permutation of an unordered instance {a, c, f }. P along with all its unordered
instances and ordered instances are illustrated in Figure 3.1.
Definition 10. A sequence S= (s1 , g1 , s2 , g2, . . . gl−1 , sl ) supports a interchangeable permutation motif P = {MSi |1 ≤ i ≤ m} if and only if there exist
34
integers j1 , j2 , ..., jm , such that
• A subsequence of S, Ssub = (sj1 , sj2 , ... , sjm ), is an ordered instance of
P.
• For every ji (1 ≤ i < m) and ji+1 ,
ji+1 −1
k=ji
gk ≤ Tgap , where Tgap is a
pre-defined value called the gap threshold.
Definition 11. We call a motif a frequent permutation motif if at least
Tsup sequences in a sequence database support this motif. Tsup is called the
support threshold.
That is to say, if in a sequence there exist m positions, where m is the
length of a permutation motif, the symbols at these positions in the sequence
form an ordered instance of the permutation motif, meanwhile in this sequence
the distance between any two consecutive symbols of these m symbols is no
larger than the gap threshold Tgap , then this sequence is said to support the
permutation motif. Here the proximity constraint is imposed on the gap between two consecutive motif members, rather than the gap between the first
and last motif members. This is because the number of members in a motif is
not known. Applying the proximity constraint on successive motif members
allows us to find longer motifs which are often more interesting.
Example 5. Given a permutation motif P = {{a}, {c, d}, {f }} and Tgap =
500, then sequence S1 = {a, 100, f , 300, d} supports P , because S1 contains
an ordered instance of P : (a, f, d) as a subsequence, and the gaps between a
35
and f , f and d are both below 500. However sequence S2 = (a, 600, f , 300,
d) does not support P because the gap between a and f is 600, which is larger
than Tgap .
Definition 12. A interchangeable permutation motif P1 is a super-motif of
another permutation motif P2 if the interchangeable sets in P2 are super-sets
of the interchangeable sets in P1 .
Example 3: Permutation motif P = {{s1 }, {s2 , s3 }, {s4 }} is a supermotif of motif P 1 = {{s1 }, {s4 }}, but is not a super-motif of motif {{s1 },
{s2 , s5 }, {s4 }}.
Definition 13. A interchangeable permutation motif P is called a maximal
frequent permutation motif if P is frequent and none of its super-motifs
is frequent.
Problem Statement: Given a support threshold Tsup and a gap
threshold Tgap , we want to find all maximal frequent interchangeable permutation motifs in a sequence database.
3.4
Algorithm
In this section, we propose a Permu-Motif algorithm to solve the prob-
lem of finding maximal frequent interchangeable permutation motifs from a
sequence database. As we mention earlier, the Apriori property for frequent
sequential motifs and association rule mining does not hold for the permutation motifs. For example, given a sequence database S1 = (a, 100, c, 100, b) and
36
S2 = (b, 100, c, 100, a). If the minimum support threshold Tsup = 2 and the
maximum gap threshold Tgap = 150, then {a, b} is not a frequent interchangeable permutation motif because neither sequence S1 or S2 supports motif a, b
due to the fact that in both S1 and S2 , the gap between a, b is larger than Tgap .
However, its super-motif {a, b, c} is a frequent permutation motif because in
both sequences, symbol c, another symbol of the motif a, b, c, locates between
a and b and the gap between each pair of consecutive symbol is below Tgap .
Therefore, instead of using Apriori-based algorithms, we need to develop a
new algorithm to discover frequent permutation motifs.
In theory, any combination of interchangeable sets can be a candidate
for frequent permutation motifs. If we want to enumerate all candidate motifs
by enumerating all the combinations of interchangeable sets, a set enumeration lattice can be built. The top level nodes of the lattice are single interchangeable sets and lower level nodes at the lattice are combinations of single
interchangeable sets. Then we can systematically search the set enumeration
tree for frequent permutation motif. However in the set enumeration tree, the
number of candidates grows exponentially with the number of interchangeable
sets. Thus, the exhaustive search of the set enumeration tree is not efficient or
even not possible when the number of interchangeable sets is large. Therefore,
we devise a more efficient algorithm to solve the problem.
In our proposed algorithm, possible candidates for frequent permutation motifs are identified in two scans of database sequences, then are pruned
by a reachability property composed by to sub-properties. The algorithms
37
works as follows: first we record the information of the sequences in a data
structure called reachable cases. Armed with reachability property, we can first
identify some pairs of interchangeable sets that can not be in any frequent permutation motif, then use these pairs to prune the set of reachable cases. After
pruning some reachable cases, we can prune more pairs of interchangeable sets.
This two-way pruning process in conducted iteratively. Finally the remaining
reachable cases are verified to produce all frequent permutation motifs.
Scan database for
Scan database fo
f r
frequent interchangeable
frequent interchangeable
sets
sets
Scan database for
reachable cases
Prune interchangeable set
Prune interchangeable set
pairs
pairs
Any pair pruned?
Any pair pruned?
yes
yes
Prune reachable cases
Prune reachable cases
no
Any case pruned?
Any case pruned?
no
Termination
Termination
Figure 3.2: Flowchart of Permu-Motif Algorithm
The flowchart of Permu-Motif algorithm is given in Figure 3.2. Three
main phases of Permu-Motif algorithm are explained in details with a running
example. The example database is composed by three example sequences:
38
S1 = (s2 , 100, s1, 500, s4, 100, s6, 100, s7, 800, s3)
S2 = (s1 , 50, s4, 100, s2, 300, s7, 150, s4, 150, s6)
S3 = (s2 , 100, s1, 150, s4, 500, s5, 100, s6, 100, s7)
In addition to a sequence database, we are given a interchangeable set
{s3 , s7 } and each single symbol other that s3 and s7 forms an interchangeable
set containing only one symbol. The support threshold Tsup is set to 3 and
Tgap is 200.
3.4.1
Scanning for Frequent Interchangeable Sets and Reachable
Cases
In the first phase of our algorithm, two scans of database are conducted.
With the first scan, the frequency of each interchangeable set in the sequence
database is collected. If a symbol belongs to multiple interchangeable sets, its
occurrence counts for all of them. Infrequent interchangeable sets, which will
not participate in a frequent permutation motif, can be pruned after the first
scan.
Running Example 1. After the first scan of the example database, the support of each interchangeable set is collected as in Table 3.1.
Since s5 only occurs in one sequences S3 and the Tsup is 3, s5 can
be pruned as an infrequent interchangeable set. Now interchangeable sets s1 ,
s2 , s3 , s7 , s4 and s6 are kept and may serve as building blocks of frequent
permutation motifs.
39
{s1 } {s2 }
3
3
{s3 , s7 } {s4 } {s5 }
3
3
1
{s6 }
3
Table 3.1: Support of Each Interchangeable Set in the Running Example
Next we define the terms of “reachable” and “reachable case” given a
gap threshold Tgap .
Definition 14. In a sequence S = (s1 , g1 , s2 , . . . , gl−1, sn ), symbols si and
sj (1 ≤ i < j ≤ n) are said to be reachable in the sequence, if one of the
following two cases happens:
•
j−1
k=i
gk ≤ Tgap , which means the gap between si and sj is no larger than
Tgap . We say si and sj are directly reachable.
• si and sj are indirectly reachable but there exist integers m1 , m2
... mp (1 ≤ i < m1 < m2 ... < mp < j ≤ n) such that for each q
(1 ≤ q < p), symbols smq and smq+1 are directly reachable. Meanwhile
symbols si and sm1 are directly reachable, and symbols smp and sj are
directly reachable. Symbols sm1 , sm2 ... smp are called the intermediate
set of si and sj . We also say si and sj are reachable through sm1 , sm2
... smp .
Definition 15. Two interchangeable sets MS1 and MS2 are said to be reachable in a sequence if there exist s1 (s1 ∈ MS1 ), s2 (s2 ∈ MS2 ), and s1 , s2 are
reachable in the sequence.
40
Definition 16. A reachable case is thus a case in which two interchangeable sets are directly reachable or indirectly reachable through some intermediate sets in a sequence. Each reachable case including the pair of reachable
interchangeable sets, the intermediate sets and the sequence id in which the
reachable case occurs is recorded.
The second scan of the sequence database records all reachable cases
in all sequences. The pseudo-code of scanning database for reachable cases is
given in Algorithm 3:
Running Example 2. Let us use sequence S1 in the example database as an
example, S1 = (s2 , 100, s1 , 500, s4 , 100, s6 , 100, s7 , 800, s3 ). If Tgap is 200,
then s4 and s6 are reachable directly, since the distance(gap) between them is
100. Due to the same reason s6 and s7 are reachable as well. Meanwhile this
makes s4 and s7 satisfy the second case of Definition 6; thus s4 and s7 are
reachable through s6 . In addition, s4 and s7 also directly reachable since their
distance is 200, satisfying the first case of Definition 6. However, s1 and s4
are not reachable because their distance is larger than Tgap . Figure 3.3 shows
all reachable cases in this sequence.
For each pair of interchangeable sets, we only record any intermediate
set once and its associated sequence ids where the reachable case occurs. This
saves the memory space to store the duplicate intermediate sets. For instance,
s1 and s2 are reachable directly in all three sequences. The empty set {} is
recorded as an intermediate set of pair s1 and s2 , and sequence ids S1 , S2 and
41
Algorithm 3 Scan for reachable cases
Input: the gap threshold: Tgap and a set of sequences: seqs
Output: a set of reachable cases: globalcases
1: globalcases ← empty
2: for S ∈ seqs do
3:
tempcases ← empty
4:
previous symbol ← null
5:
for symbol ∈ S do
6:
gap ← gap between previous symbol and symbol
7:
if symbol is not in a frequent interchangeable set or gap > Tgap then
8:
tempcases ← empty
9:
else
10:
for case ∈tempcases do
11:
gap ← symbol.pos−1
k=case.tail symbol.pos gk
12:
if gap ≤ Tgap then
13:
create a new case by adding symbol to case
14:
add new case to globalcases
15:
add new case to tempcases
16:
end if
17:
end for
18:
create a new case for symbol and previous symbol
19:
add new case to tempcases
20:
add new case to globalcases
21:
end if
22:
previous symbol ← symbol
23:
end for
24: end for
25: return globalcases
42
S1:
s2
s1
100
s4
500
s6
100
s7
s3
100
800
T_gap = 200 5 reachable cases:
s2
s2 <−> s1
s1
100
s4
s4 <−> s6 <−> s7
s6
100
s7
100
s7
s4
s4 <−> s7
200
s7
s6
s6 <−> s7
100
s4
s4 <−> s6
s6
100
Figure 3.3: Reachable Cases in a Sequence
S3 are recorded. Meanwhile, in sequence S2 , interchangeable sets s1 and s2
are also reachable through s4 . Thus intermediate set s2 and sequence id S2 is
recorded. All reachable cases recorded are illustrated in Table 3.2 in a tabular
form. Please note that infrequent interchangeable sets pruned in the first scan
of database will not occur in any frequent permutation motifs, therefore, none
of them appears in the table.
3.4.2
Pruning
At this phase, we prune reachable cases shown in Table 3.2.
Given the definition of support and reachable, a clear observation can
be made that if a sequence supports a permutation motif, then in the sequence,
any two interchangeable sets of the motif must be reachable either directly or
indirectly through other interchangeable sets of the permutation motif. Based
on this observation, we infer the following property.
43
Property 1. For any two interchangeable sets of a frequent permutation motif
P , they must be reachable in at least Tsup sequences. We call this property as
minimum reachability property.
Property 2. If a sequence S supports a permutation motif P , for any two
interchangeable sets of P , they must be either directly reachable or indirectly
reachable using only other interchangeable sets of P as intermediate sets. We
call this property restrained reachability property.
Proof. The proof is straightforward given the definition of support, reachability and frequent permutation motif. According to the definition of frequent
permutation motif, a frequent permutation motif P must have a support of at
least Tsup . According to the definition of support, if a sequence supports a permutation motif, then in the sequence, any two interchangeable sets of the motif
must be reachable either directly or indirectly through other interchangeable
sets of the permutation motif. Therefore, for any two interchangeable sets
of a frequent permutation motif P , they must be reachable in at least Tsup
sequences. And if a sequence S supports P , for any two interchangeable sets
of P , they must be either directly reachable or indirectly reachable using only
other interchangeable sets of P as intermediate sets.
Armed with those two properties, we will do a two-way pruning on the
set of candidate motifs. First, based on minimum reachability property, we
can prune a pair of interchangeable sets if they are reachable in less than Tsup
sequences using any interchangeable sets as intermediate sets. Next, after the
44
first stage of pruning, we can further prune reachable cases. The following
lemma is easily derived from restrained reachability property to support this
pruning step.
Lemma 1. In a sequence S, if two interchangeable sets of a permutation
motif P is only reachable through certain interchangeable sets not belonging to
P , then S cannot be a support sequence of P .
The above lemma, which is easy to derive from restrained reachability
property, will be used to prune reachable cases. The intuition behind this
pruning is that in order to confirm two interchangeable sets MS1 and MS2 are
in a frequent permutation motif P , we need to identify at least Tsup sequences
supporting P . If we know that another interchangeable set MS3 cannot be in
any frequent permutation motif with MS1 and in a sequence S, MS1 is only
reachable to MS2 through MS3 , then S cannot be a support sequence of P .
Thus, we can prune the reachable case that MS1 is reachable to MS2 through
MS3 in S.
After pruning some reachable cases, two interchangeable sets may be
reachable in fewer sequences. We can further prune pairs of interchangeable
sets reachable in less than Tsup sequences. After identifying those pairs of
interchangeable sets known to be unable to be in any frequent permutation
motifs together, we can continue pruning reachable cases. This two-way pruning is conducted iteratively until we are not able to prune any more pairs of
interchangeable sets or reachable cases.
45
{s1 }
{s2 }
{s2 }
{} : S1 , S2 , S3
{s4 } : S2
{s3 , s7 }
{s4 }
NULL {} : S2 , S3
{s6 }
NULL
{} : S2
{s1 } : S3
{s6 } : S1
{} : S1 , S2
NULL
NULL
{s3 , s7 }
{s4 }
{} : S1 , S3
{s4 } : S2
{} : S1 , S2
Table 3.2: All Reachable Cases
Running Example 3. First we prune interchangeable set pairs that are not
reachable in enough sequences. According to Table 3.2, {s1 } and {s6 } are not
reachable in any sequence, thus they cannot be in a frequent permutation motif.
For the same reason, the pairs {s1 } ↔ {s3 , s7 }, {s2 } ↔ {s3 , s7 } and {s2 } ↔
{s6 } cannot be in any frequent permutation motif. In addition, {s1 } and {s4 }
are only reachable in two sequences, which is less than Tsup . Therefore, {s1 }
and {s4 } cannot coexist in a frequent permutation motif either. The pairs
{s2 } ↔ {s4 }, {s3 , s7 } ↔ {s4 } and {s4 } ↔ {s6 } are also pruned through the
same reasoning.
Next we use the interchangeable set pairs pruned in last round to prune
reachable cases. First we look at the upper-left cell, which records the reachable
cases between {s1 } and {s2 }. We already know that {s1 } and {s4 } cannot be
in a frequent permutation motif, thus the second row of this cell, which indicates {s1 } and {s2 } are reachable through {s4 } in sequence S2 , can be pruned.
Then we examine the reachable cases between {s3 , s7 } and {s6 }. Now they are
reachable in three sequences, however, we know that {s4 } and {s6 } cannot be
46
in a frequent permutation motif, thus the reachable case in which {s3 , s7 } and
{s6 } are reachable through {s4 } can be pruned. After this pruning, {s3 , s7 } and
{s6 } are only reachable in two sequences, which is below the support threshold.
Therefore, the pair of {s3 , s7 } and {s6 } can be pruned.
3.4.3
Final Verification
In the last verification step, for each remaining reachable case, we sort
the interchangeable set labels by their lexical order. By sorting interchangeable
set labels, we can transform the different permutations of a motif into a single
unique representation. For example, the case that interchangeable set MS1
is reachable to MS2 through MS3 and the case that MS3 is reachable to
MS2 through MS1 are both transformed to the same list MS1 MS2 MS3 . Now
we count the number of sequences that each list occurs in by traversing all
reachable cases. If a list of interchangeable sets occurs in Tsup sequences, then
these interchangeable sets form a frequent permutation motif.
Running Example 4. After the pruning in the third step, the following reachable cases are kept:
case 1: {s2 } and {s1 } are reachable directly in S1
case 2: {s1 } and {s2 } are reachable directly in S2
case 2: {s2 } and {s1 } are reachable directly in S3
After sorting all interchangeable set labels in each case, we only have
one sorted list ({s1 }, {s2 }). We know that this list occurs in three sequences
47
by a traversal of all reachable cases. This means that {{s1 }, {s2}} qualifies as
a frequent permutation motif.
3.4.4
Correctness of Algorithm
In this subsection, we analyze the correctness of our algorithm. We
will outline the proof as following. First we prove that our proposed algorithm will not produce any false negatives, that is to say, it will not prune
any frequent permutation motif. Note that there are actually two types of
pruning conducted iteratively during the pruning process. One is to prune the
interchangeable set pairs that cannot be in any frequent permutation motif
together, the other one is to prune reachable cases.
For the first type of pruning, we prune interchangeable set pairs that
are reachable using any interchangeable sets as intermediate sets in less than
Tsup sequences. According to the minimum reachability property proven in
the section above, for any two interchangeable sets of a frequent permutation
motif P, they must be reachable in at least Tsup sequences. Therefore, this
step of pruning will not produce any false negatives.
For the second type of pruning, we prune reachable cases in which a
pair of interchangeable sets are reachable through some interchangeable set
that is known not to be in any frequent permutation motif with either one of
the interchangeable set pair. According to Lemma 1 proven the section above,
a reachable case like this cannot be an occurrence of any frequent permutation
motif. Therefore, this type of pruning will not produce any false negative
48
either.
Next, we prove that our algorithm will not produce any false positives.
In the phase of verification, for any candidate frequent permutation motif
generated, we verify if the sequences supporting the motif is larger than Tsup .
We only identify a motif as frequent permutation motif if it has a support
larger than Tsup . Therefore, the algorithm will not produce any false positives.
3.5
Experimental Results
We implement Permu-Motif algorithm in C++ using STL(Standard
Template Libraries). All experiments are run on a Linux PC with a 3.2 GHz
Pentium-4 processor and 1GB main memory. We use both real and synthetical
data to analyze the performance of the Permu-Motif algorithm.
First, to illustrate the usefulness of the permutation motif model, we
use Permu-Motif algorithm to discover frequent permutation motifs from 120
genomes of 97 species. Each frequent permutation motif can be regarded as a
gene cluster. A gene cluster is a frequent permutation motif whose members
are genes in genomes. The discovered gene clusters are interpreted by biologists
to show that the permutation motif model can reveal important biological
themes hidden in genomes, thus the Permu-Motif model is of great use in
comparative genome analysis.
Second, we use different synthetic datasets to evaluate the efficiency
of our Permu-Motif algorithm. Two other approaches are compared with our
49
Permu-Motif algorithm. The first approach is the π motif discovery algorithm
[22].The other method is adapted from the set enumeration tree search [70].
First the sequence database is scanned for dense segments, which are sequence
segments in which the first and last symbol are reachable. Then we enumerate
all candidate motifs formed by symbols in these segments. A set enumeration
tree can be built and searched for frequent permutation motif. The size of the
set enumeration tree is bounded since the maximal number of symbols in a
candidate motif can not exceed the maximal size of a dense segment.
Last but not least, we explore another application of Permu-Motif
model and algorithm to show its usefulness. In reality, there are many genomes
whose genes have not been classified into COG groups. Therefore, another
application of Permu-Motif algorithm is to find gene motifs from genome
databases (particularly from prokarytote genomes) when the COG annotation is not available for some genes. What”s more, these gene motifs can be
used to predict the orthologous groups of unknown genes.
3.5.1
Effectiveness of Permu-Motif Model
In this section, we present the experimental results on the genome
dataset to demonstrate the usefulness of the permutation motif model. 120
genome sequences of 97 species are downloaded from the NCBI (National Center for Biotechnology Information) website. Each genome contains thousands
of genes along with their positions at the genome. Genes are classified into
Cluster of Orthologous Groups (COG). Genes within a COG are considered as
50
interchangeable, thus the COGs form the interchangeable sets. Genes that are
not in any COG are considered non-interchangeable. The average number of
genes in these 120 genomes is 6533, and there are 4638 cluster of orthologous
groups containing more than one gene.
In this experiment, the gap threshold Tgap is set to 300 [61]. When the
support threshold Tsup is set to 20, 209 maximal frequent gene clusters can be
discovered in less than 50 seconds. Table 3.3 is a summary of experimental
results on different support thresholds.
Tsup
# of Motifs
Max. Length
Ave. Length
Discovery Time
5
15
20
1623 711 209
25
13
13
3.71 3.77 3.44
80 s 61 s 45 s
Table 3.3: Summary of Results on 120 Genome Sequences of 97 Species
To show the usefulness of the permutation motif model, we analyze
the results by interpreting the biological meanings of discovered motifs(gene
clusters). Roughly speaking the discovered gene clusters represent several
biological themes:
• subunits of multi-unit proteins (e.g. ribosomal complex, transcription
factors), which form physical complexes.
• operon components, which form a physical complex or a biochemical
network
51
• signal transduction components, which form loose complexes on cell
membrane or biochemical networks.
• de novo amino acid synthesis
On the other hand, we compare the results of our model against the
traditional sequential motif model not taking permutations into account. Since
the order of genes might be permuted during evolution in genomes of different
organisms, our permutation motif model is able to identify more and longer
motifs. Table 3.4 shows that the average and maximal lengths of discovered
permutation motifs are larger than those of the traditional sequential motifs.
Tsup
Average Permu-Motif
Length
Sequential
Maximum Permu-Motif
Length
Sequential
5
15
20
3.71 3.77 3.44
2.23 2.54 2.61
25
13
13
7
5
4
Table 3.4: Motif Length Comparison
In addition, when the support threshold is set to 20, the sequential
motif model missed 184 motifs (88%) found by the permutation motif model,
178 out of which are linked to various biological themes. In addition, 91%
of the discovered sequential motifs are also found by the permutation motif
model, but with greater lengths. A large portion of motifs missed by the traditional sequential motif model actually have important biological meanings.
For example,
52
• One gene cluster {COG0091 COG0092 COG0093 COG0185 COG0186
COG0197} reflects a sub-unit of a ribosomal protein. This gene cluster
is discovered by Permu-Motif model, but is not found by traditional
sequential motif model. COGs in this gene cluster appear in different
order in every genome, while still keeping proximity.
• The other gene cluster {COG0055 COG0056 COG0224 COG0355 COG0356
COG0636 COG0711 COG0712 }, which is discovered by Permu-Motif
model, forms an ATP synthase chain. Membrane-bound ATP synthases
(F0F1-ATPases) of bacteria serve two important physiological functions.
The ATP synthase of Escherichia coli, which has been the most intensively studied one, is composed of eight different subunits, five of
which belong to F1, subunits alpha, beta, gamma, delta, and epsilon
(3:3:1:1:1), and three to F0, subunits a, b, and c (1:2:10 +/- 1) [20]. Our
Permu-Motif model successfully discovered all eight sub-units. However
only shorter segments of this gene cluster, e.g. {COG0055, COG0224,
COG0711}, are discovered by the traditional sequential motif model.
3.5.2
Efficiency of Permu-Motif Algorithm
To further analyze the efficiency of our permutation motif discovery
algorithm, two alternative methods are compared to our Permu-Motif algorithm. The efficiency of set enumeration method is highly dependent on the
length of the dense segments in the sequences. If the dense segments are long
and contain many different symbols, the set enumeration tree could be very
53
large. Searching for frequent sets in a enumeration tree like this could be very
inefficient.
First we compare the performance of Permu-Motif algorithm on the 120
genome sequences of 97 species used in the previous section against the two
other methods. Table 3.5 shows the execution time of the three algorithms
spent on frequent permutation motifs from 120 genomes in terms of different
settings of the support threshold. Permu-Motif algorithm is faster than the
other two algorithms in all the scenarios in this experiment.
Tsup
Permu-Motif
π motif
set enumeration
5
15
20
80 s 61 s 45 s
120 s 74 s 50 s
757 s 521 s 310 s
Table 3.5: Efficiency on genome dataset
Next we utilize a large set of synthetic data to further analyze the
scalability of our algorithm. A sequence generator is written in Perl to generate
sequences of various scales. First an arbitrary gap threshold Tgap is defined,
then the synthetic data is generated according to the following parameters:
the average number of symbols in a sequence L, and the number of sequences
S, and the number of interchangeable sets M.
A series of experiments are conducted on synthetic data to compare the
efficiency of Permu-Motif algorithm against the two other alternative methods.
In these experiments, the default parameters are as below: the average number
of symbols in a sequence L is 5,000, the number of sequences S is 1,000, and
54
the number of interchangeable sets M is 3,000. We also set the default support
threshold to 20% of the total number of sequences. The experimental results
show that Permu-Motif algorithms outperforms other approaches by a wide
margin.
Figure 3.4: Scalability on Tsup , The Support Threshold
First we evaluate the performance with respect to the support threshold
Tsup . Figure 3.4 shows the average execution time of the three algorithms with
respect to Tsup . Although the response time of Permu-Motif algorithm and
set enumeration algorithm both reduce linearly with the increase of Tsup , the
response time of Permu-Motif algorithm drops at a faster pace than that of set
55
enumeration approach for the following reason. With the increase of Tsup , more
interchangeable sets will be pruned during the first scan of sequence database,
and fewer reachable cases will be recorded for further pruning. Thus PermuMotif algorithm is much more scalable than set enumeration algorithm as Tsup
decreases.
Figure 3.5: Scalability on L, The Average Number of Symbols in A Sequence
The second aspect that we investigate is the scalability with the average
number of symbols in a sequence L. The empirical results from Figure 3.5
have shown that the time complexity of Permu-Motif algorithm is linearly
proportional to L, while the execution time of the two other algorithms increase
56
at much faster rates.
Third, we study the effects of the number of sequences S on the results.
As seen in Figure 3.6, S has similar effects on the three algorithms as the
average length of sequences L. Permu-Motif algorithm clearly benefits from
the pruning power of the reachability property.
Figure 3.6: Scalability on S, The Number of Sequences
Last, we examine the response time with varying numbers of interchangeable sets M. The increase of the number of interchangeable sets has a
similar effect on all the three algorithms since a similar approach is used to
handle interchangeable sets in these algorithms. However, in every case the
57
Figure 3.7: Scalability on M, The Number of Interchangeable Sets
Permu-Motif algorithm outperforms the two other.
3.5.3
Another Application of Permu-Motif Algorithm
In reality, there are many genomes whose genes have not been classified
into families (i.e. COG groups). For example, only about 100 genomes are
well characterized as COG teams among more than 300 genomes available at
GenBank. This requires putative family classification of genes as well as gene
motifs discovery. Therefore, another application of Permu-Motif algorithm is
to find gene motifs from genome databases (prokarytote genomes) when the
58
family classification information (i.e. COG annotation) is not available for
some genes. What’s more, these gene motifs can be used to predict the COG
annotation of unknown genes.
There are several well developed algorithms for predicting gene motifs
in a pair of genomes, such as FISH[14] and DAGchainer[32]. These algorithms
use the optimization problem formulation which is solved using the dynamic
programming technique. For example, FISH finds all maximal k-clumps for
a directed acyclic graph where vertices are genes. If two genes are within a
neighborhood distance, then there is an edge between the corresponding vertices. Since all gene matches are treated equally, the score of a clump is simply
the number of points(gene matches), thus this problem can be easily formulated as a recursive scoring function, which can be solved using a dynamic
programming technique. Unfortunately, extending these algorithms to multiple genome cases is not trivial due to the rapid increase in time and space
complexity.
On the other hand, in [61] and [60], the authors present a method to
detect conversed clusters of genes, called PCBBH(Pair of Close Bidirectional
Best Hits), which are pairs of genes that appear close to each other in multiple
genomes. However, the problem of this approach is that it only identifies pairs
of genes that are conserved clustered in genomes. The number of discovered
pairs is too large for interpreting.
In this section, we will apply Permu-Motif algorithm on genome sequences where the COG annotation of some genes are not available. We will
59
show that gene motifs discovered by Permu-Motif algorithm in those kinds of
genome sequences still have important biological information. To examine the
effectiveness of Permu-Motif algorithm further, we propose an ortholog prediction method based on our Permu-Motif algorithm and compare our method
to the bi-directional best hit (BBH) technique and shows that our ortholog
prediction method outforms BBH in ortholog identification which shows the
usefulness of the gene motifs discoverd by our Permu-Motif algorithm.
In the experiments in the previous section, two genes are considered
interchangeable if they belong to the same orthologous family (e.g., COG).
In the case that this type of annotation information is unavailable, sequence
comparison can be used for this purpose. If the pairwise sequence similarity
of two genes is above a certain threshold, these two genes are considered as
interchangeable. A more relaxed requirement may be applied for assessing
whether two genes are interchangeable. For example, the triangle merge in
[84] can be used for this purpose. If genes g1 , g2 and g3 from three genomes
are interchangeable, then they form a triangle. If genes g1 , g2 and g4 form
another such triangle, then these two triangles can be merged. So genes g1 ,
g2 , g3 and g4 will be put into the same interchangeable gene sets. Under
this relaxation, triangle merges will be conducted until no triangle can be
merged. The interchangeable gene sets are therefore maximal. This relaxation
could give better flexibility for computing interchangeable gene sets. On the
other hand, one gene can be in several interchangeable sets, which enables the
proposed algorithm to handle multi-domain genes.
60
3.5.3.1
Discovery of Gene Motifs from Four New Genomes
We perform our experiments using four genomes, Azotobacter vinelandii,
Bdellovibrio bacteriovorus, Myxococcus xanthus, and Rhodospirillum centenum.
These four genomes are good for testing our algorithm since they belong to
Proteobacteria and they are not well characterized, that is, COG annotation may not be available. One genome, Rhodospirillum centenum, is a new
genome that has not been published but obtained from our collaboration at
Indiana University. Since COG assignments are not available for these four
new genomes, we use best hits and triangle merging method to construct the
interchangeable gene sets in this case. After forming interchangeable groups,
we discover gene motifs from those genomes using our Permu-Motif algorithm.
Our algorithm successfully predict many biologically meaningful gene
motifs according to the current annotations of the genomes. Predicted gene
motifs can be used either to confirm functions of genes that are predicted by
similarity match or to predict de novo functions, especially for genes with annotation of hypothetical or unknown functions. For example, one gene motif
covers three cell division proteins FtsZ, FtsA and FtsQ among three genomes.
Another gene motif contains electron transfer flavoprotein subunits which appear in Bdellovibrio bacteriovorus,Myxococcus xanthus, and Rhodospirillum
centenum.
61
3.5.3.2
Application of Gene Motifs to Ortholog Prediction
To further show the effectiveness of our gene motif model, we use the
discovered gene motifs to predict orthologous groups of genes.
Our method of ortholog discovery consists of three steps. First, pairwise protein sequence comparisons are conducted to assign corresponding genes
whose family information are unknown to some candidate orthologous groups.
Then using these candidate orthologous groups as interchangeable sets, we
discover gene motifs from genome sequences by our Permu-Motif algorithm.
Finally, the discovered motifs are used to predict the orthologous groups of
genes. These three phases are explained in details as follows.
1. In the first step, the BLAST program [95] is used to perform pair-wise
protein sequence comparison. The unassigned genes, whose gene family
information is unknown, are classified into some candidate orthologous
groups according to the protein sequence similarity. Two types of orthologous groups are used as candidate groups for an unassigned gene.
First, the orthologous groups of the unassigned gene’s best hit on any
genome sequence will be used. Second, if the unassigned gene is the
best hit of a gene of another genome, that gene’s orthologous groups will
also be used. In this way, an unassigned gene is classified to multiple
existing orthologous groups. We choose the top-K most common groups
as the candidate groups of this gene, where K is usually set to a small
number (3 to 5). If K is set to 1, then each gene will be assigned to the
62
annotation to which the gene is most similar. However, in many cases,
the gene should not be assigned to the most similar annotation. These
candidate orthologous groups will be further pruned by the gene motifs
discovered in the next step.
2. After assigning each gene into its candidate orthologous groups, we use
these candidate orthologous groups as interchangeable sets in our PermuMotif algorithm. We run Permu-Motif algorithm to discover frequent
gene motifs from these genomes. For each interchangeable set in the
discovered gene motifs, we also record which gene appears in each genome
for the interchangeable set.
3. Once the gene motifs are discovered from genome sequences, we analyze
these gene motifs to assign genes to their orthologous groups. If a gene
appears in a gene motif as a member of an orthologous group, then this
gene is assigned to this orthologous group. Other candidate orthologous
groups assigned to this gene previously will be pruned. If one gene
appears in multiple gene motifs as different orthologous groups, we use
the following rules to solve conflicts:
• If multiple motifs cover the same gene, the longest motif will be
used for prediction
• If all motifs are of the same size, the motif with the highest support
will be used for prediction
63
We discover that in many cases, the gene will be assigned to the annotation to which it is not most similar. If a gene never participates in any gene
motifs (This is possible since the gene motifs will not cover every gene in the
genomes), then we use bi-directional best hits(BBH) to assign this gene to an
orthologous group. In BBH, to predict the orthologous group of a gene, we
count the COG assignments of this gene’s bi-directional best hits (BBHs) in
other genomes. The most common COG assignment of its BBHs is considered
as the predicted orthologous group.
As BBH only uses the best bi-directional hits as the ortholog predictions, it misses a significant number of orthologs. By introducing the gene
motifs, we allow more ortholog candidates at the first stage for each gene to
be predicted. The gene motifs help to decide which ortholog candidate is more
likely to be the correct prediction. The role of gene motifs is important since it
filters out incorrect predictions using the intrinsic biological meaning of these
motifs. As the experimental results show, the gene motifs will correct a large
portion of mistakes made by BBH method.
In this section, we present the experimental results of our proposed computational ortholog discovery approach based on gene motifs. To validate the
effectiveness of this approach, the Cluster of Orthologous Groups(COGs) were
used as the benchmark. We compare our results against those Bi-directional
Best Hits(BBH), which is another computational method for ortholog prediction and it is used at the first stage of COG. In the experiments, we remove the
COG information of certain genes, then use our method and BBH to predict
64
these genes into orthologous groups. We evaluate the results using the original COG assignments in terms of recall and precision. The recall is defined as
Ccorrect/Cremoved , where Ccorrect is the number of genes correctly assigned by
our method, and Cremoved is the total number of genes whose COG information is removed. The precision is defined as Ccorrect/Cassigned, where Cassigned
is the total number of ortholog assignment our method predicts. In essence,
the recall measures how likely that a gene can be correctly annotated while
the precision measures how likely an assignment is correct. In our method,
since we use the above rules to solve conflicts, one gene is predicted to exactly
one orthologous group, therefore the recall of our method is the same as the
precision.
Two types of experiments were conducted to evaluate our proposed
method. The first scenario was to remove the COG information of randomly
chosen genes from multiple genomes, then try to recover their orthologous
groups. The second scenario was to remove the COG assignments of all genes
in a single genome, then try to predict the orthologous groups of every gene.
In both scenarios, we download ten genomes from NCBI web site. (Each
genomes contains thousands of genes along with their positions at the genome.)
These genes are pairwise compared by the BLAST program. For Permu-Motif
algorithm, the gap threshold Tgap is set to 300 base pairs and the support
threshold Tsup is set to 3.
Next, we present the experimental results in both scenarios.
65
Predicting Genes on Multiple Genomes In this setting, we randomly
remove the COGs of 2000 genes from any genome. BBH results in an average
recall of 87%, while our model achieves a recall of 91.2%. As stated above,
our approach also has a precision of 91.2%, while BBH’s precision is 87%.
Predicting Genes on Single Genome In this setting, we remove the
COGs of all genes on one genome (called the target genome), then keep the
COG information of other genomes as reference genomes. The approach proposed above is then used to predict the orthologous groups of genes on the
target genomes. We conduct the experiments with regard to different target
genomes and compute the average recall and precision of both BBH and our
method. On average our method achieves a recall/precision of 90.1% while
BBH has a recall and precision of 87.3%. The average improvement on recall
and precision of our method is thus 3%.
Parameters In this subsection, we examine how the parameter settings affect the results or ortholog prediction. Two parameters can be varied in our ortholog discovery approach proposed above: the support threshold for frequent
permutation motifs (gene motifs) Tsup and the top-K value when choosing the
candidate orthologous groups for a gene.
Varying the support threshold Tsup will vary the number of gene motifs
discovered. The lower the Tsup is, the more gene motifs will be discovered. The
recall and precision of our propsed ortholog discovery approach change with
66
different support threshold Tsup which is plotted in figure 3.8. In this test, a
gene is mapped to one ortholog group, thus the recall is equal to the precision.
Figure 3.8: Accuracy of Permu-Motif w.r.t. support.
To assign a gene to its candidate orthologous groups, we choose the
top-K most common orthologous groups of this gene’s hits on other genomes.
When K is set to 1, it is essentially the same as BBH. When K is larger, more
candidate groups will be assigned to each gene. The effect of the K value on
the precision and recall of the ortholog discovery method is shown in figure
6. The recall/precision increases when K changes from 1 to 4. However, the
recall and precision will decrease slightly when K becomes larger than 4. The
reason for this is that assigning each gene to too many candidate orthologous
groups introduces too much noise. Even when the pruning step is conducted
later by applying the discovered motifs, the errors introduced by the noise
will not be fully removed. In this experiment, the BHH is used as the baseline
model. Since the BHH always assigns a gene to the COG group with best hits,
67
the recall of BHH stays constant. (BHH also has the same recall and precision
since it assigns a gene to only one COG.) The benefit of our method comes
from the fact that the gene motif can be used as a tool in finding the correct
orthologous group if a gene can be grouped into multiple groups, i.e., the gene
has high similar with genes in multiple groups. When K = 1, each gene is
assigned to one orthologous group. Thus, there is no benefit of our method.
However, with a larger K, i.e., a gene may be associated with multiple groups,
our method can prune out the incorrect group assignment. For instance, if
gene g has higher similarity with genes in group A than genes in group B, the
BBH will assign A to g. However, g’s true annotation may be B. In this case,
the gene motif may be able to correctly assign g to B based on the location of
gene g.
Figure 3.9: Accuracy of Permu-Motif w.r.t. Top-k Ortholog Candidates
Comparison on Example Genomes Here we present the detailed comparison of our method and BBH on a few example target genomes. The
68
first genome to test is the complete genome of Mycobacterium tuberculosis
H37Rv (NC 000962). The second genome is Bordetella pertussis Tohama I
(NC 002929). The summary of the results is presented in Table 3.6.
In the complete genome of Mycobacterium tuberculosis H37Rv there
are 3927 genes in total, in which 2756 genes have COG assignments. BBH
correctly predicted 2394 genes out of 2756, which resulted in a recall of 86.8%.
Our approach outperformed BBH by successfully predicting 2576 genes which
leads to a recall of 93.5%. 186 genes were correctly predicted by our method
but not by BBH. Table 3.7 are some example genes correctly predicted by our
method but not predicted by BBH.
In the complete genome of Bordetella pertussis Tohama I, there are
3436 genes in total, in which 2723 genes have COG assignments. BBH had a
recall of 89.4% (2436 correctly predicted genes out of 2723), while our approach
predicted 2492 correctly out of 2723 genes, which has a recall of 91.5%.
Table 3.8 are some example genes correctly predicted by our method
but not predicted by BBH.
Table 3.6: Results on two example target genomes
Genome
NC 000962 NC 002929
No. of genes
2756
2723
Recall of BBH
86.8%
89.4%
Recall of Permu-Motif
93.5%
91.0%
Precision of BBH
86.8%
89.4%
Precision of Permu-Motif
93.5%
91.0%
No. of errors by BBH
186
106
No. of errors by Permu-Motif
66
61
69
Table 3.7: Example genes correctly predicted by Permu-Motif but not predicted by BBH
Gene(id)
COG
Motif
Occurrence
{COG0167 NC 002755
Rv2141c
COG0624 COG0624 NC 002945
(id: 57116951)
COG1881} NC 000962
NC 002935
Rv0558
{COG0438 NC 002944
(id: 57116754)
COG2226} NC 002945
COG2226
NC 004369
NC 002755
NC 000962
{COG1637 NC 002755
Rv1319c
COG2114 NC 002945
(id: 15608459) COG2114
COG2114 NC 000962
COG2114}
{COG0642 NC 002755
Rv1034c
COG0745 NC 002945
(id: 15608174)
COG3039 COG2156 NC 000962
COG2216
COG3039}
{COG0227
nrp
NC 002755
COG0474
(id: 15607243)
NC 002945
COG0523
NC 000962
COG3320 COG0664
COG2217
COG3320
COG3336}
70
Table 3.8: Example genes correctly predicted by Permu-Motif but not predicted by BBH
Gene(id)
COG
Motif
Occurrence
BP0202 (id: 33591446),
NC 005090
{COG2801
BP0203 (id: 33591447),
NC 004369
BP0210 (id: 33591454),
COG2801} NC 003295
BP0211 (id: 33591455)
NC 004741
COG2801
NC 002662
NC 002929
NC 003902
NC 004337
{COG0111 NC 002929
BP0153 (id: 33591402)
COG0583 NC 002928
COG3565 COG0642 NC 002927
COG3019
COG3565}
{COG0318 NC 002929
BP0778 (id: 33591402)
COG0604 NC 002928
COG0318
COG1802 NC 002927
COG4625}
71
Chapter 4
WIGM: Discovery of Subgraph Motifs in a
Large Weighted Graph
4.1
Introduction
Other than sequential biological data, an important part of biologi-
cal data are represented as graphs. A large amount of data mining research
has been devoted to analyzing these data, e.g. subgraph motif mining and
subgraph indexing and matching. These graph analysis tools can discover important inherent motifs or characteristic graphs that represent biological networks. Many of these tools have been proven very useful in various application
domains ranging from chemical structure search in chemical compounds, bug
detection in software engineering to community identification in social networks. However, most of current research has focused on unweighted graphs.
In reality, weighted graphs are common in many applications such as biological networks. Among the most common uses of weighted graphs in biology is
for constructing what are called “protein-protein interaction (PPI) networks”,
weighted graphs that endeavor to capture and help explain physical and functional interactions among proteins in the cell [53, 17]. Each vertex in a PPI
network represents a protein, and an edge weight is the likelihood or logarithmic transformation of likelihood that the two proteins physically interact or are
72
functionally related. This likelihood is generally obtained through the integration of experimental, textual, and electronically inferred information that indicates two proteins are functionally related [17]. A motif, in this scenario, might
be a subgraph that is frequent in a collection of networks. These subgraphs
may represent functionally modular process that are conserved throughout
evolution, e.g. a metabolic pathway, i.e., a series of metabolic reactions [45]
or a signaling pathway, i.e., a series of interactions that transfer an external
signal into the cell to activate cellular response [74, 9].
To solve these problems, a weighted subgraph motif model is more
suitable than a unweighted subgraph motif. In many applications, the weight
of an edge represents the interestingness of the edge or the probability of the
existence of the edge. From a biological perspective, the importance of a motif
g should be proportional to the weights of the occurrences of g, since weights
present the likelihood that the subgraph exists. In a large graph, there are
two issues worth consideration while defining the model: (1) When a motif has
more edges, the occurrences of this motif will have a larger number of edges
and thus, the occurrences would carry a larger weight. This could exaggerate
the importance of a motif with more edges. focus on these motifs. (2) In a
large graph, the matches or occurrences of a motif may heavily overlap with
each other. For instance, in Figure 4.1, the motif g has three edges: (v0 , v1 ),
(v1 , v2 ) and (v2 , v3 ). g occurs three times in the database graph G and all these
occurrences only differ at one edge. Thus, two edges are shared by all three
occurrences. For example, the edge (u2 , u4 ) and (u4 , u5 ) in G will be counted
73
three times. As a result, the weights of these overlapping edge occurrences can
be over-counted.
Several methods [87, 13, 24, 76, 43] have been proposed to address the
problem of quantifying the importance(support) of a motif in a set of unweighted graphs, single unweighted graph or a set of weighted graphs. Details
about the definition of support in these models are provided in the section
of related work. However, since they are not designed for a single weighted
graph, these models may not be directly applicable to the problem studied in
this chapter. Therefore, we design a new support model, called normalized
weight model. Intuitively, the importance of a pattern in a single weighted
graphs should be equal to the aggregated weights of its occurrences in the
graph. However, many occurrences of the pattern in the graph may overlap.
Our support measure handles overlaps in a special way. What’s more, since
we do not want to over-exaggerate the importance of a pattern of large size,
we normalize the support of a pattern by its size. We explain the motivations
and characteristics of the new support model in more detail later.
Although the proposed weighted subgraph motif model is meaningful,
it does not possess the useful downward-closure property (i.e. apriori or antimonotone property). If a subgraph support model possesses downward closure
property, we can say that if the support of a graph is above a threshold, the
support of any of its subgraph must be above the threshold. Many existing
subgraph mining algorithms utilizing the downward-closure property prune
out the search space effectively by starting the search from smaller subgraphs
74
0
0
1.0
3
1
2
1.2
0.9
1
0.1
4
2
2.5
5
3
g
G
Figure 4.1: Example of A Database Graph G and A Subgraph Motif g
and extending a subgraph only if its support is above the threshold. However,
in this case, those algorithms utilizing the downward-closure property cannot
be directly applied to this problem. Fortunately, we are able to identify an
alternative weaker property: 1-extension property. Namely, we call a motif
with weight over a given threshold a strong motif. In contrast to the antimonotonicity property, the 1-extension property states that a strong subgraph
motif can be partitioned into two motifs one of which is strong and the other
75
is either a strong motif or a 1-extension subgraph motif, where a 1-extension
subgraph motif can be obtained by adding an edge to a strong motif. This
novel property can be used to prune the search space and guide the mining process efficiently. In this chapter, we describe a novel algorithm, called
WIGM, to mine the weighted subgraph motifs. Two versions of WIGM are
designed: threshold-WIGM (t-WIGM for short) is designed for mining motifs
with weights above a user-defined threshold and top-k-WIGM (k-WIGM for
short) is designed to discover the top k motifs with the laregest weights.
4.2
Related Work
There are three categories of work related to the problem studied in this
chapter: (1) subgraph motif mining from a set of graphs, (2) subgraph motif
mining in a single graph, and (3) subgraph motif mining in a set of weighted
graphs. In recent years, a large number of algorithms have been proposed for
frequent subgraph mining (e.g., [90][40][11][58][49]). These algorithms focus
on mining frequent subgraph motifs from a set of graphs not carrying weights.
Since there are a large number of subgraph mining algorithms, we will not
elaborate on these work in this category.
The second category of the related work is on mining frequent subgraph
motifs from a single large graph. The main challenge relies on computing
the support of a motif. Support measures that simply count the occurrences
of a motif may violate the anti-monotonic property (i.e. Apriori property)
since occurrences of the motif may overlap with each other. In [48], a support
76
measure called maximum independent set support measure (MIS) is proposed.
In this model, the support of a subgraph motif g is the maximum number of
non-overlapping occurrences of g. It has been proven in [87] that this support
measure possesses the anti-monotonic property. The authors of [24] propose
a variant of MIS support measure. The only difference is the definition of
occurrence overlapping. Two occurrences of a motif is defined as overlapped
if they share any common vertex. In [87], the authors provide a tentative
support measure to count each occurrence partially depending on how many
overlaps it has with other occurrences. An overlapping graph is built based on
the overlap of occurrences. Each node in the overlapping graph represents the
occurrence of a motif. If two occurrences of a motif overlaps in G, there will be
an edge in the overlapping graph between the respective occurrences(nodes).
The weight of each node is a reciprocal of its degree. The support of a motif
is the total weight on all nodes in the overlapping graph for the motif. In this
model, the degree of overlap (the number of edges in the overlap and the weight
on these edges) is not considered. In [13], for each vertex v in g, let M(v) be
the number of distinct vertices that vertex v is mapped to. The support of g is
the minimum M(v) for all v in g. Although several of these models have been
shown useful and hold the anti-monotone property, but none of these models
can be applied to the scenario of subgraph motif mining in a single weighted
graph.
Recently, researchers have been working in the area of mining subgraph
motifs from a set of weighted graphs. In [76], each database graph has internal
77
weights associated with each vertex and an external weight representing the
importance of the graph itself. Accordingly, a motif g has an external weight
defined as the accumulated external weights of database graphs that are subgraph isomorphic to g and an internal weight which is generated by counting
only the occurrences with highest aggregated internal weights. The weighted
support of a motif g is either weighted sum of the two weights or its external
weight under an internal weight constraint. In [43], the weight of a motif g is
defined as the sum of the weights of graphs containing g. These methods are
proposed for computing the weighted support of a motif in a set of weighted
graphs. However, they cannot be applied directly to the context of a single
large weighted graph.
Overall, there exists research work on mining subgraph motifs in a
single graph or in a set of weighted graphs. However, not much has been done
on discovering subgraph motifs in a single weighted graph, which is the focus
of chapter. Due to the difficulty of this problem, we develop a new model and
algorithm.
4.3
Background
This section provides some terminologies related to subgraph motif min-
ing. Generally speaking, a graph is an abstract representation of a set of objects (referred as vertices or nodes) where some pair of vertices are connected
by links (referred as edges).
Definition 17. A labeled graph G is a five element tuple G = ( V , E, ΣV ,
78
ΣE , LG ) where V is the set of vertices and E ⊆ V × V is a set of edges. ΣV is
the set of vertex labels and ΣE is the set of edge labels. The labeling function
LG defines the mappings V → ΣV and E → ΣE .
Definition 18. Given two graphs G1 = (V1 , E1 , ΣV1 , ΣE1 , LG1 ) and G2 =
(V2 , E2 , ΣV2 , ΣE2 , LG2 ), G1 is a subgraph of G2 if both of the following conditions are satisfied.
• V1 ⊆ V2 and ∀v ∈ V1 , LG1 (v) = LG2 (v)
• E1 ⊆ E2 and ∀(u, v) ∈ E1 , LG1 (u, v) = LG2 (u, v)
Likewise, G2 is called a supergraph of G1 .
Definition 19. A graph G1 = (V1 , E1 , ΣV1 , ΣE , LG1 ) is isomorphic to another
graph G2 = (V2 , E2 , ΣV2 , ΣE2 , LG2 ), denoted as G1 ≈ G2 , if and only if a
bijection f : V1 → V2 exists, such that
• ∀u ∈ V1 , LG1 (u) = LG2 (f (u)),
• ∀u, v ∈ V1 , (u, v) ∈ E1 ↔ (f (u), f (v)) ∈ E2 , and
• ∀(u, v) ∈ E1 , LG1 (u, v) = LG2 (f (u), f (v)).
Definition 20. A graph G1 is subgraph isomorphic to another graph G2 ,
denoted as G1 ⊆ G2 , if and only if G1 is isomorphic to any subgraph g of G1 .
The subgraph g is called an matching or occurrence of G1 in G2 .
79
Definition 21. A weighted graph is same as a labeled graph with one addition: there is a real number weight we associated with each edge e in G.
For example, in protein-protein interaction (PPI) networks, the weight
of an edge may be the logarithmic transformation of the likelihood that the
two proteins interact. Similarly, in social networks, the weight of an edge may
represent the strength of the relationship between two people. The weight
of a weighted labeled graph G, representing the logarithmic transformation of
the likelihood of G and denoted as W (G), is equal to the sum of all weights
on every edge in G.
In a similar way, a labeled graph g can be defined as isomorphic
(or subgraph isomorphic) to a weighted labeled graph G by ignoring the
weights in the weighted labeled graph. For example, in Figure 4.1, graph g is
subgraph isomorphic to graph G.
Here is a list of terms we will use in this chapter.
• An undirected graph is a graph in which edges have no orientation.
Each edge in an undirected graph is an unordered pair of vertices. Edge
(a,b) is same as edge (b,a).
• A path in an undirected graph is a ordered sequence of vertices and there
is an edge connecting each pair of consecutive vertices in the sequence.
• A connected graph is a graph in which there is exactly one path between any two vertices.
80
• A tree is an undirected connected graph in which any two vertices are
connected exactly one path.
4.4
Problem Statement
In this section, we define the problem we are trying to solve and present
some preliminaries. Without loss of generality, graphs are assumed to be
undirected since it is very easy to extend the problem setting to directed
graphs. In addition, we focus on the discovery of connected subgraph motifs
since in most applications, only connected subgraph motifs are interesting to
the users.
There are two general formulations of the problem of subgraph motif
mining. (1) subgraph motif mining from a set of graphs (2) subgraph motif
mining in a single large graph. In the problem of subgraph motif mining from
a set of graphs, the input is a set of graphs with a relatively small size. In the
problem of subgraph mining in a single large graph, the input is a single large
graph.
The importance of a subgraph motif is measured by its support. The
way the support of a motif is defined depend on the nature of the problem.
For example, in the context of discovering frequent subgraph motifs from a set
of unweighted graphs, the support of a subgraph motif g is usually defined as
the the number of input graphs g is subgraph isomorphic to, regardless of how
many times the subgraph motif actually occurs in a particular input graph.
More commonly, it is defined as the fraction of database graphs g is subgraph
81
isomorphic to.
However, as we have mentioned, in many of real applications, graphs
have weights associated with their edges and an edge weight is a real number
representing probability or logarithmic transformation of probability of the
existence of the edge. We adopt the term weighted graphs to represent graphs
having edge weights in this chapter. Our aim in this chapter is to discover
subgraph motifs from a single large weighted graph. Most of support measure
defined for other situations is not suitable for this problem. To avoid confusion, it worth mentioning that the input graph (referred as database graph
sometimes in this chapter) has weight associated with each of its edges but the
subgraph motif does not have edge weight because the weight of a subgraph
motif depends on the edge weights of its occurrences in the input graph.
In the context of subgraph mining from a single large graph, the most
straightforward and traditionally used support measure is occurrence based.
The support of a subgraph motif g in a single large graph G is determined by
the number of occurrences of g in G. However, many occurrences of g may
overlap. This could cause a problem if the overlap is high. As in Figure 4.1,
the edge (u2 , u4 ) and (u4, u5 ) may be counted three times and their frequencies
could be over-amplified. As discussed in the previous section, although several
models have been proposed to quantify the support of a motif in a single graph,
none of them can be applied directly to the scenario of a single large weighted
graph. Thus in this chapter, a new support model is proposed. The union of
all edges in all occurrences of a motif g forms a support set for g. Therefore,
82
the weight of every edge e in all occurrences has the same contribution to the
overall importance of a subgraph motif, regardless of how many occurrences e
participates in.
Definition 22. Given a weighted labeled graph G and a labeled graph g, the
support set of g in G, denoted as Sup(G, g), is the set of distinct subgraphs
in G which are isomorphic to g. These subgraphs are called occurrences
of g in G. Two subgraphs g and g are considered distinct if they differ on
at least one vertex or one edge. The support edge set of g in G, denoted
as Sup edge(G, g), is defined as the union of edge sets of all subgraphs in
Sup(G, g).
Notice that the subgraph motif g is not weighted, i.e., there is no weight
associated with any edge in g while the large graph G is weighted. In the
example of Figure 4.1, Sup(G, g1) consists of three occurrences of g while
Sup edge(G, g1 ) consists of five edges: (u0 , u2 ), (u1 , u2 ), (u2 , u3 ), (u2 , u4 ), and
(u4 , u5 ).
Definition 23. Given a weighted labeled graph G and a connected labeled graph
g, the weighted support of g in G, denoted as W Sup(G, g), is the sum of
the weights of all edges in Sup edge(G, g), i.e., Σe∈Sup edge(G,g) W (e).
For example, in Figure 4.1, the weighted support of g in G is 5.7.
However, with the definition of weighted support, we may give unreasonably
high weights to motifs with more edges. For example, if a pattern with one
83
edge occurs 100 times, then the edge support set has 100 edges. On the other
hand, if a pattern with 200 edges occurs 1 time, then the edge support set has
200 edges. As a result, a motif with more edges is more likely to have a higher
weighted support than motifs with fewer edges. Thus, the overall weighted
support of a motif should be normalized by the number of edges in the motif.
Definition 24. Given a weighted labeled graph G and a connected labeled graph
g, the normalized weighted support of g in G (NW Sup(G, g)), is equal
to
W Sup(G,g)
|E(g)|
where |E(g)| is the number of edges in g.
In Figure 4.1, the normalized weighted support of g is 5.7/3=1.9 in G.
Problem Statement: In this paper, we aim to solve the following two problems. Given a weighted labeled graph G, the first problem is to find all connected labeled subgraphs whose normalized weighted support in G is larger
than or equal to some user specified threshold t. Since it may be difficult to
specify the threshold t in some applications, the alternative problem formulation is provided as follows. Given an integer k, we want to find k connected
subgraphs which have the highest normalized weighted supports in G.
4.5
1-Extension Property
The anti-monotonic property (i.e., Apriori property) has been one of
the most widely applied properties to guide data mining algorithms. It enables effective pruning of the search space by bounding the support statistic of larger (more specific) motifs based on the statistics of smaller (more84
Figure 4.2: Example of a Database Graph G and Multiple Subgraph Motifs
g1 , g2 and g3
general) sub-motifs. Unfortunately, the normalized weighted support model
does not possess this property. For example, in Figure 4.2, g1 is a supergraph of g2 . g1 occurs four times and W Sup(G, g1) = 13.6 while g2 occurs twice and W Sup(G, g2) = 3.6. Since g1 has two edges and g2 has one
edge, NW Sup(G, g1 ) = 6.8 and NW sup(G, g2) = 3.6. This violates the antimonotonicity property since g2 is a subgraph of g1 but the support of g2 is
smaller than that of g1 .
Fortunately, the weighted support model possesses another weaker property, which is called the 1-extension property. Before presenting the property,
we first define some terminologies.
Definition 25. Given a weighted labeled graph G and a normalized weight
support threshold t, a connected labeled subgraph motif g is called strong if
NW Sup(G, g) ≥ t. Otherwise, g is called a weak motif. g (with at least two
edges) is called a 1-extension motif of a strong motif (1-extension motif
for short) if (1) g is a weak motif and (2) there exists a connected subgraph g of g where g has one less edge than g and g is a strong motif. Any weak
85
graph motif with a single edge is defined as a 1-extension motif.
In other words, a 1-extension motif can be obtained by adding one edge
into a strong motif. For example, in Figure 4.2, with threshold t = 8, g3 is a
strong motif because NW Sup(G, g3) = 10. g2 is a 1-extension motif since it
consists of only one edge and it is a weak motif. g1 is also a 1-extension motif
because it can be obtained by adding edge (b, d) to g3 .
Let Cont(G, g, E) be the sum of weights of all edges in the edge set E
where E is a subset of the edge set of g (E ⊆ E(g)) and we count the weight
of an occurrence of an edge e only if the occurrence of e is in an occurrence
of g. For example, in Figure 4.2, the support edge set of g1 in G is {(u0, u2 ),
(u1 , u2 ), (u2 , u4), (u4 , u5 ), (u5 , u6), (u5, u7 )}. Let E = {(v1 , v2 )}. Among
these six edges in G, (v1 , v2 ) accounts for two edges, (u2 , u4 ) and (u4 , u5 ). As
a result, Cont(G, g1 , E) = 3.6. In summary, we calculate Cont(G, g, E) as
follows. Firstly, we get sup edge(G, g), the support edge set of g in G. Then
we get sup edge(G, E), the support edge set of E in G where E ⊆ E(g). Next,
we get the intersection of the two edge sets sup edge(G, g) ∪ sup edge(G, E).
Cont(G, g, E) is the sum of the weights of all edges in the intersection set.
It is obvious that Cont(G, g, E(g)) = W Sup(G, g). To facilitate the
proof of 1-extension property, we first prove the following lemmas.
Lemma 2. For a given weighted labeled graph G and a connected subgraph
motif g, if g is a connected subgraph of g and E is a subset of edges in g’,
then Cont(G, g, E) ≤ Cont(G, g , E).
86
Proof. Since g is a subgraph of g, each occurrence of g must contain an occurrence of g . Therefore, given that E ⊂ E(g ), we have
(sup edge(G, g) ∩ E) ⊂ (sup edge(G, g ) ∩ E)
(4.1)
Therefore, Cont(G, g, E) ≤ Cont(G, g , E).
Lemma 3. For a given weighted labeled graph G and a connected subgraph
motif g, if g1 and g2 are two connected subgraphs of g and the set of edges
in g is equal to the union of the edges in g1 and g2 , then W Sup(G, g) ≤
W Sup(G, g1) + W Sup(G, g2).
Proof. Since g1 and g2 are both subgraphs of g, each occurrence of g in G
must contain an occurrence of g1 and an occurrence of g2 . What’s more, we
are given that the set of edges in g is equal to the union of the edges in g1
and g2 . Thus, the support edge set of g must be a subset of the union of the
support edge set of g1 and the support edge set of g2
sup edge(G, g) ⊆ sup edge(G, g1) ∪ sup edge(G, g2)
(4.2)
Thus, W Sup(G, g) ≤ W Sup(G, g1) + W Sup(G, g2).
Lemma 4. Given a weighted labeled graph G, a normalized weighted support
threshold t and a connected subgraph motif g, let g1 and g2 be two connected
subgraphs of g satisfying the following conditions (1) there is no overlapping
edge between g1 and g2 , (2) the set of edges in g is equal to the union of the
edges in g1 and g2 . Then if g is a strong motif, it is impossible that g1 and g2
are both weak motifs.
87
Proof. We prove the lemma by contradiction. We assume that g1 and g2 are
both weak. Then we have
NW S(G, g1) =
W Sup(G, g1)
≤t
|E(g1 )|
(4.3)
NW S(G, g2) =
W Sup(G, g2)
≤t
|E(g2 )|
(4.4)
and
By Lemma 3, we have
W Sup(G, g) ≤ W Sup(G, g1) + W Sup(G, g2)
(4.5)
Given that there is no overlapping edge between g1 and g2 , and the set of edges
in g is equal to the union of the edges in g1 and g2 , we have
|E(g)| = |E(g1 )| + |E(g2 )|
(4.6)
Then, we have
NW S(G, g) =
W Sup(G, g)
W Sup(G, g1) + W Sup(G, g2)
≤
≤t
|E(g)|
|E(g1 )| + |E(g2)|
(4.7)
It contradicts with the given condition that g is strong. Therefore, the statement is proved that at least one of g1 or g2 must be strong.
Property 3. (1-Extension Property:) Given a weighted labeled graph G, a
normalized weighted support threshold t and a connected strong subgraph motif
g, there must exist two connected subgraphs g1 and g2 of g satisfying all of the
following conditions:
• there is no overlapping edge between g1 and g2 ,
88
• the set of edges in g is equal to the union of the edges in g1 and g2 , and
• either g1 and g2 are both strong motifs, or one is a strong motif and the
other is a 1-extension motif.
Proof. The proof of this property is a little tedious. Thus, we will give a formal
proof for the case where g is a tree and a sketch of the proof for the case where
g is a general graph.
Let g be a tree and r be the root of g. g is partitioned into x disjoint
branches (motifs), g1 , g2 , . . . , gx , where x = deg(r). Let gi be the motif with
the lowest
Cont(G,g,E(gi ))
|E(gi )|
among the x motifs (1 ≤ i ≤ x).
Now we partition g into two motifs: gi and g = g −gi . Figure 4.3 shows
an example of gi and g for a tree gT . Both gi and g are connected because
both of them are subtrees of g.
Firstly, we can prove that if
Cont(G,g,E(gi ))
|E(gi )|
≥ t, then both g and gi are
strong.
Since gi is the motif with the lowest
Cont(G,g,E(gi ))
|E(gi )|
and
Cont(G,g,E(gi ))
|E(gi )
≥ t,
we have
Cont(G, g, E(gi))
Cont(G, g, E(g ))
≥
≥t
|E(g )|
|E(gi)|
(4.8)
By Lemma 2, we have
Cont(G, g , E(g )) ≥ Cont(G, g, E(g ))
(4.9)
Cont(G, gi , E(gi)) ≥ Cont(G, g, E(gi)).
(4.10)
and
89
By 4.8, 4.9, 4.10 and definition of normalized weighted support,
NW Sup(G, g ) =
Cont(G, g , E(g )
Cont(G, g, E(g )
W Sup(G, g )
=
≥
≥t
|E(g )|
|E(g )|
|E(g )|
(4.11)
and
NW Sup(G, gi) =
Therefore, if
Cont(G, gi , E(gi )
Cont(G, g, E(gi)
W Sup(G, gi)
=
≥
≥t
|E(gi )|
|E(gi)|
|E(gi)|
(4.12)
Cont(G,g,E(gi ))
|E(gi )|
≥ t, then both g and gi are strong. The property
holds.
Otherwise, if gi is a weak motif, g must be strong according to Lemma
4. If gi is weak, we travel down branch gi from its root and recursively divide
gi .
Figure 4.3: Example of Partitioning a Tree
In the following part of proof, we assign u as the root of gi and v as the
child of u (u has only one child because gi is always a branch).
There are three cases of v based on the degree of v: deg(v) = 1, deg(v) =
2, and deg(v) > 2.
When deg(v) = 1, v is a leaf and there is only one edge in branch gi .
90
Thus, g is a strong motif and gi is a 1-extension motif by definition. Therefore,
the property holds.
When deg(v) = 2, we will move the edge (u, v) from gi to g . If gi is
strong now, then g is either a strong or a 1-extension motif since g is strong
before edge (u, v) is added. If gi is still weak, g must be strong according to
Lemma 4. Then, we continue traversing down gi
For deg(v) > 2, there are at least two downward branches starting at
vertex v in addition to the edge (u, v). Firstly, edge (u, v) is moved from gi
to g . If gi becomes strong, the property holds. Otherwise, gi is partitioned
based on its downward branches. The branch with the lowest normalized
weighted support remains in gi while other branches are moved to g . Since gi
is still weak now, g must remain strong according to Lemma 4. The procedure
continues.
In this procedure, both gi and g are always connected and one of two
termination conditions must occur sooner or later: (1) gi becomes strong or
(2) deg(v) = 1. In case (1), gi is strong and g is either strong or 1-extension
strong (we have explicitly shown that g remains either strong or 1-extension
strong at each step of the procedure). The property holds. In case (2), gi has
one single edge which is a 1-extension motif by definition. The property also
holds.
Next, we will give a sketch proof for the case where g is a general
connected graph. Let gT be a spanning tree of g. Then a similar process as
91
the one for the tree is performed on gT . There are three modifications for the
process to keep g and gi as connected subgraphs. (1) gi and g are graphs
instead of trees. (2) When taking one branch of the motif, we need to take
both the branch in the spanning tree and the edges having both ending points
in the branch. The edges between gi and g are assigned to g . (3) After moving
edge (u, v) from gi to g , we need to move all edges between u and vertices in
gi from the subgraph gi to the subgraph g one at a time before traversing
downward.
During this procedure, when an edge is moved from gi to g , if g changes
from a strong motif to a weak motif, then gi must have become strong and
g is a 1-extension motif. Otherwise, when gi contains only one edge, it is a
1-extension motif. Therefore, the property holds.
4.6
Threshold-WIGM: Mining with Weighted Support
To find motifs whose normalized weighted support is above a user-
specified threshold t, one of the straight-forward methods is to start from a
small motif and grow by adding one edge at a time. However, the search space
is exponential and it is not straightforward to effectively prune out the search
space by bounding the normalized weighted support of larger motifs based on
their sub-motifs. In particular, due to the lack of the anti-monotonicity property, it is possible that the normalized weighted support of a connected motif
with m edges may be larger than any of its connected sub-motifs with m − 1
92
edges, which does not provide any termination condition on the search. For
example, in Figure 4.1, NW Sup(g) = 1.9 and g has two connected subgraphs
with 2 edges. The normalized weighted support of these two sub-motifs are
1.6 and 1.3. In fact, we can only say that for a connected motif P with m
edges, there exists a connected sub-motif P with [m/2] edges, such that the
normalized weighted support of P is larger than or equal to that of P . For
this reason, in order to use the existing depth-first subgraph mining methods,
the following modification has to be made: When a strong motif (a connected
subgraph motif having the support larger than or equal to t) with m edge is
found, it has to grow one edge at a time and the search on this motif can
terminate only if none of its connected super-motif with up to 2m edges is
strong. We name this method the base algorithm (t-base). It is obviously
that t-base is an inefficient algorithm. Looking ahead, the base algorithms are
compared with our proposed WIGM algorithms empirically in a later section.
In this section, a more efficient algorithm to find the minimum normalized weighted support t is presented, which is referred to as the thresholdWIGM (t-WIGM for short). The formal description of this algorithm is presented in Algorithm 4.
4.6.1
Main Algorithm
Since all strong motifs can be generated by combining two strong motifs
or combining a strong motif with a 1-extension motif, the following procedure
is employed. The t-WIGM algorithm proceeds iteratively. The main data
93
structure in this algorithm consists of four sets: S, W, SN, and WN. S stores
all strong motifs while W stores all 1-extension motifs discovered so far. SN
and WN store the newly generated strong and 1-extension motifs discovered
in the previous round, respectively.
Initially, the normalized weighted support of every edge is computed.
If the edge is strong, then it is put into both S and SN. Otherwise, it is put
into W and WN since every weak single edge graph motif is defined as a 1extension motif. Notice that initially, S = SN and W = WN. In later rounds,
S is a superset of SN while W is a superset of WN.
In each of the later rounds, we first generate new strong motifs. A
strong motif may be obtained in two ways: (1) combining two strong motifs
or (2) combining a 1-extension motif and another strong motif. The first case
is equivalent to combining a motif in SN with a motif in S, while the second
case is to combine a motif in S with a motif in WN or combine a motif in SN
with a motif in W. It is not necessary to combine motifs in SN with motifs in
WN because SN is a subset of S. The combination procedure is described in
a later subsection.
For each of these newly generated candidate motifs g, we first test
whether g is already in S, which can be done by using the canonical form of g.
There are many types of canonical forms and any canonical form would work
here. Without loss of generality, the canonical form used in [39] is chosen.
Each graph g can be represented as an adjacency matrix M. Slightly different
from ordinary adjacency matrices, each diagonal entry of M is filled with the
94
Algorithm 4 Threshold-WIGM
Input: Graph G, minimum normalized weighted support t
Output: A set S of motifs with normalized weighted support in G greater than
or equal to t.
1: S ← ∅, W ← ∅, SN ← ∅, WN ← ∅.
2: for each unique edge e in G do
3:
Calculate N W Sup(G, e).
4:
if N W Sup(G, e) ≥ t then
5:
Add e into S and SN
6:
else
7:
Add e into W and WN
8:
end if
9: end for
10: while either SN or WN is not empty do
11:
WN ← ∅, SN ← ∅
12:
for each pair of motifs (p1 , p2 ) in (SN, S), (S, WN) and (SN, W) do
13:
CP ← combine(p1 , p2 ).
14:
for each candidate motif g in CP do
15:
if g is not in S and N W Sup(G, g) ≥ t then
16:
Add g into SN
17:
end if
18:
end for
19:
end for
20:
for each motif g in SN do
21:
SE ← a set of edges that can be added to g
22:
for each edge e in SE do
23:
Obtain g by adding e in g.
24:
if g is not in S and W then
25:
if N W Sup(G, g ) < t then
26:
Add g into WN.
27:
else
28:
Add g into SN
29:
end if
30:
end if
31:
end for
32:
end for
33:
SN ← SN; WN ← WN
34:
Add motifs in SN into S; Add motifs in WN into W
35: end while
36: return S.
95
label of the corresponding node. Each matrix M is encoded as a sequence of
concatenating lower triangular entries of M including diagonal entries. The
order of matrix codes is based on their lexicographic order. Since a graph
can be represented by multiple matrices, its canonical form is defined as the
maximum code among all its possible codes. In theory, graph canonicalization
is a NP-hard problem. However, when the graph is relatively small, the calculation is not expensive. Almost all existing papers support this claim.If g does
not exist in S, NW Sup(G, g) is calculated. If NW Sup(G, g) < t, then g is
discarded. Otherwise, it is added into SN
, the set of strong motifs generated
during this round..
Next, new 1-extension motifs are generated. By definition, a 1-extension
motif can be obtained by adding one edge to a strong motif. It is unnecessary
to extend all motifs in S since many of these motifs have been extended in
previous rounds. Thus, only motifs in SN are extended. For each motif g in
SN, one more edge is added to g. It could be an edge connecting two vertices
in g or connecting one vertex in g and another vertex not in g. For each newly
extended motif g , we check whether g is in S or W. If not, NW Sup(G, g )
is computed. If NW Supp(G, g ) < t, g is appended to WN. Otherwise, g is
added to SN.
The final step in each round is to replace WN and SN with the newly
generated WN and SN, respectively. In addition, S and W are updated to
include these new motifs. The process terminates when WN = SN= ∅.
96
4.6.2
Subgraph Motif Combination
One of the main difficulties in t-WIGM is to combine two subgraphs
g1 and g2 . Since we require the resulting motifs be connected, at least one
vertex from each subgraph should have the same label. If none of the vertex
labels in g1 appears in g2 , then the results of the combination is a null set.
Otherwise, for each vertex v in g1 , we find the vertices u in g2 which have the
same label as v. The vertices of u and v can be combined into one vertex in
the new motif. A data structure M is used to maintain the mapping of all
pairs of vertices in g1 and g2 that have the same label.
Let’s assume that g1 has three vertices v0 , v1 , and v2 with labels A, A,
and C while g2 consists of three vertices u0 , u1 , and u2 with labels A, B, and
C as shown in Figure 4.4. The mapping includes the following pairs (v0 , u0 ),
(v1 , u0), and (v2 , u2 ).
A new combined motif g includes one or more combined vertices. We
first generate new motifs with one combined vertices, then two combined vertices, and so on. The maximum number of combined vertices in a new motif
is equal to the number of disjoint pairs in M, which can be determined by
the maximum match in an undirected bipartite graph. Vertices in g1 are one
set while vertices in g2 are another set. If there is a match (u, v) in M, there
is an edge between u and v in the bipartite graph. The maximum bipartite
matching problem can be reduced to a maximum-flow problem. We use the
Ford-Fulkerson method to find a maximum matching in an undirected bipartite graph. There are three new motifs with one combined vertex g1 , g2 , and g3
97
in Figure 4.4(b). Also, there are two motifs g4 and g5 with two combined vertices. The formal description of the combination algorithm is in Algorithm 5.
In the worst case, the combination algorithm is exponential with the number
of pairs of vertices having a same label. But in the experimental results, we
show that in real applications, the algorithm is much more efficient on average.
(a) Two Graphs to Be Combined
(b) Five New Motifs after Combination
Figure 4.4: Example of Combining Two Graphs
4.6.3
Support Computation
The computation of the normalized weighted support of a motif g is
at the heart of the t-WIGM algorithm. Since it is invoked many times, it is
essential that this computation is performed efficiently. The main difficulty is
to locate all occurrences of a subgraph motif. Subgraph indexing is used to
find occurrences of a motif since it has been shown to accelerate the match
98
Algorithm 5 Combining Two Motifs
Input: Motif g1 and g2 .
Output: A set of new combined motifs CP .
1: M ← ∅
2: for each vertex u in g1 do
3:
Find a set of vertices SV in g2 having the same label as u.
4:
for each vertex v in SV do
5:
Add (u, v) into M .
6:
end for
7: end for
8: l ← max f low(g1 , g2 , M )
9: i ← 1
10: while i ≤ l do
11:
for each i vertices in g1 do
12:
SU ← a set formed by these i vertices
13:
SV ← a set of (distinct) vertices mapped from vertices in SU based on M
14:
if SV == i then
15:
p ← a new motif generated by combing g1 and g2 through combining
mapped vertices in SV and SU
16:
CP ← CP ∪ p
17:
end if
18:
end for
19:
i←i+1
20: end while
21: return CP
99
process dramatically. Without a loss of generality, GADDI [93] is chosen
as the indexing structure for the large weighted graph. After matches of a
subgraph motif is discovered, the weight of matched edges are obtained and
the normalized weight of the motif is computed based on the definition.
4.6.4
Algorithm Analysis
In this subsection, the correctness of the t-WIGM algorithm is first
proven, then the time complexity of this algorithm is shown. To prove the
correctness of the algorithm, we show that every strong motif is enumerated
in the algorithm by induction. Any single edge strong and 1-extension motifs
are generated in the initialization step. Let’s assume that all 1-extension motifs
and strong motifs with i or less edges have been enumerated. A strong motif
p with i + 1 edges can be constructed by combining 2 connected strong motifs
with i or less edges or a strong motif and a 1-extension motif with i or less
edges. Therefore, p will be enumerated. In addition, a 1-extension subgraph
with i + 1 edges is generated by extending from a strong motif with i edges.
Since all i edge strong motifs are enumerated, all i+ 1 edges 1-extension motifs
are also enumerated. Thus the threshold-WIGM algorithm can find all strong
motifs.
The complexity of the basic algorithm is highly dependent on the number of strong motifs discovered. In the worst case, the number of strong motifs
discovered could be exponential if all or most of subgraph motifs are strong.
Let n denote the total number of strong motifs and l denote the number
100
of distinct edges in G. The number of 1-extension motifs discovered is at most
n×l. Since each strong motif needs to be combined with each strong motif and
1-extension motif, the total number of combinations is at most O(n2 l). The
complexity of combining two motifs is dependent on the number of vertices
sharing the same label in these two motifs.
In the worst case, the number of iterations is l since for each iteration,
the size (the number of edges) of largest strong motifs discovered so far increase
by at least 1 and the largest possible size of a strong motif is l.
In a word, the worst-case complexity of the basic algorithm is exponential if all or most of subgraph motifs are strong. But in reality, the algorithm
is much more efficient on average. We empirically characterize the efficiency
of the algorithm in the next section. The worst-case space complexity of the
basic algorithm is also exponential because we need to keep all the strong
motifs and 1-extensions motifs discovered.
T-WIGM mainly has two shortcomings. Firstly, it is difficult for an end
user to set a support threshold. Secondly, the algorithm may not be efficient
since it has to keep all discovered strong motifs and 1-extension in memory and
there does not exist a bound on the number of strong motifs and 1-extension
motifs. As a result, we propose another alternative model: top-k motifs to
address these problems.
4.6.5
Top-K-WIGM: Mining for Top-k Motifs
When finding top k motifs, the minimum support threshold t is un101
Algorithm 6 k-WIGM-Addition
Input: Graph G, the number k, motif sets S, W, SN, WN.
Output: None.
1: t is set to kth normalized weighted support in S.
2: Remove motifs with normalized weighted support less than t from S and SN.
3: W ← ∅, WN ← ∅.
4: for each motif p in SN do
5:
for each edge e in G do
6:
Initialize a map M ← ∅ and a candidate motif set CP ← ∅
7:
CP ← combine(p, e)
8:
for each candidate motif g in CP do
9:
if N W Sup(G, g) < t then
10:
Add g into W and WN
11:
end if
12:
end for
13:
end for
14: end for
15: for each motif p in S and not in SN do
16:
for each edge e in ES do
17:
CP ← combine(p,e).
18:
for each candidate motif g in CP do
19:
if N W Sup(G, g) < t then
20:
Add g into W
21:
end if
22:
end for
23:
end for
24: end for
102
known. Therefore, an iterative approach, called top-k-WIGM (k-WIGM for
short), is devised. The top-k motif discovery process is the same as t-WIGM
with one modification: the minimum normalized weighted support threshold
is updated at the end of each round of iteration. At the end of the ith round,
S contains a set of strong motifs. The normalized weighted support of the kth
motif is computed and it is chosen as the minimum support threshold ti for
the next round. Motifs with support less than ti are pruned from S and SN.
W and WN are updated based on the new S and SN. First, both of W and WN
are set to empty. Next, for each motif p in SN, p is extended by one edge and
the new extended motifs are put into WN and W if they are not strong. For
any subgraph motif q in S but not in SN, q is also extended by adding a new
edge and the new extended motifs are included in W if they are not strong.
Since there are at most k distinctive motifs in S and SN together, the computation time to generate motifs in WN and W is not significant. The k-WIGM
is the same as t-WIGM with one exception: the k-WIGM-Addition procedure
is invoked between line 36 and 37 in Algorithm 4. The formal description of
the k-WIGM-Addition procedure is in Algorithm 6.
In each round, with newly discovered strong and 1-extension motifs, the
minimum support threshold ti increases monotonically because ti is chosen as
the support of the kth strongest motif in S, the set of strong patterns and
the support of any motif in S must not be smaller than ti−1 . As a result, the
number of motifs in S is controlled to be k. (S may have more than k motifs
if multiple motifs have the same normalized weighted support.) Due to this
103
fact, the memory requirement of k-WIGM is quite small. When the algorithm
terminates (i.e., SN is empty), the motifs in S are returned.
k-WIGM is correct because all potential strong motifs are combined
with potential strong motifs and potential 1-extension motifs. Therefore, no
strong motif will be missed. We denote the minimum support threshold in the
i-th round of iteration as ti . For example, for a strong subgraph motif P with
normalized weighted support more than ti , there exist two subgraphs motifs
P1 and P2 such that one motif is a strong (support greater than ti ) and the
other is a strong motif or a 1-extension motif. P1 and P2 will be discovered
and put into S and/or W in previous rounds since the threshold in previous
rounds is less than or equal to ti . Therefore, P would be discovered.
For k-WIGM, we only keep k motifs in the strong motif set S. Thus,
here are at most k × l 1-extension motifs in the 1-extension motif set W .
The number of combinations is at most k 2 l at each round of iteration. The
number of rounds of iteration is at most l. Therefore, the total number of
combinations is at most O(k 2l2 ). The complexity of combining two motifs is
highly depended on the number of shared vertex labels in the two subgraphs.
4.7
Experimental Results
We analyze the effectiveness and efficiency of our weighted subgraph
mining models and methods in this section. WIGM algorithms deal with
weighted graphs. To the best of our knowledge, although much work has been
proposed to mine motifs in a set of weighted graphs and motifs in a single non104
weighted graph, little literature is available on discovering subgraph motifs in
a single large weighted graph. Thus, we could not compare our methods
with other existing alternative models. Therefore, we compare WIGM with
a baseline algorithm. In the baseline algorithm, the 1-extension property is
not employed. Instead, in each round, one more edge is added into existing
motifs as described in the algorithm section. The two versions of WIGM and
two version of the baseline algorithm (t-base and k-base) are implemented
with the C++ programming language. All experiments are conducted on a
Dell PowerEdge 2950, with two 3.33GHZ quad-core CPUs and 32GB main
memory, using Linux 2.6.18-92.e15-smp.
4.7.1
Biological Networks
The biological network used in this experiment Gbio = ( Vbio , Ebio ,
ΣVbio , ΣEbio , LGbio ) is constructed from the experimental data from [17] fruit
fly (Drosphila melanogaster); specifically, Gbio is constructed from the proteinprotein interaction data. Vbio is a set of fruit fly genes. An edge in Gbio
is a possible (potential) interaction between two genes. The weight on an
edge is the sum of likelihoods of multi-model experimental data supporting a
functional relationship between two genes, which represents the probability of
the interaction between the two proteins. An edge exists if the sum is above
a determined threshold. We are interested in finding functional subgraph
motifs, i.e., patterns of interactions among known functional categories. For
this purpose, we use Gene Ontology, a standardized dictionary of biological
105
processes, molecular functions, and cellular components to label the vertices
of the target subgraph patterns, thus each node in the PPI network is labeled
with the biological processes with which the respective protein is annotated.
Note that a protein may be involved in multiple biological processes, i.e., a
node can have multiple labels. In this work, we use GO Biological Process
Terms [27] (GO:BP) to label the vertices of the target subgraph patterns,
thus each node in the PPI network is labeled with the biological processes
with which the respective protein is annotated. Note that a protein may be
involved in multiple biological processes, i.e., a node can have multiple labels.
In summary, there are 7496 vertices (|Vbio | = 7496) with 515 distinct
labels and 25408 edges (|Ebio | = 25408). The average degree of vertices is 6.78.
We apply both WIGM algorithms on this data set with different thresholds.
The t value is chosen so that the same number of motifs will be discovered by all
the four methods, the two WIGM algorithms and the two baseline algorithms.
Table 4.1 shows the execution time of t-WIGM, k-WIGM, t-base, and
k-base algorithms. Here, t-WIGM (threshold-WIGM) is the version of WIGM
used to discover all the subgraph patterns above a user-specified threshold and
k-WIGM (top-K-WIGM) is used to find top-K patterns with largest weights.
What’s more t-base and k-base are the baseline versions of t-WIGM and kWIGM respectively without utilizing the 1-extension property.
It is clear that the t-WIGM saves about 5% to 10% execution time
compared to k-WIGM. At each round, k-WIGM sets the minimum weight
threshold t as the kth normalized weight support in the set of strong motifs
106
Table 4.1: Results on Biological Network (time in sec.)
k
t
k-WIGM t-WIGM k-base t-base
4 176
23
23
47
44
8
67
67
65
159
149
16 126
158
149
468
451
32 115
389
359
1332
1287
50 107
723
668
3345
3192
100 100
1854
1715
11081 10454
generated. Thus, in the earlier rounds, t is set as a small value and more
weak motifs are generated by k-WIGM, which leads to a prolonged execution
time. It takes about 6.5 minutes to find 32 motifs with the highest weights.
Compared with that of t-base and k-base methods, the execution time of
WIGM algorithms is about 1/2 to 1/6 because the base algorithms have the
following two shortcomings: (1) the termination condition is loose, thus more
iterations are needed and (2) edges are inserted one at a time, more candidate
motifs are generated. (However, there is one advantage of the base algorithms:
adding one edge into a motif can be done more efficiently than combining two
motifs.) Overall, the base algorithms need a longer execution time than WIGM
algorithms. Based on this, the pruning power of the 1-extension property is
evident.
4.7.2
Synthetic Graphs
To better analyze the performance of the WIGM algorithms with re-
spect to different aspects of the input data, a set of synthetic graphs are
employed. The input graphs are generated by a tool called gengraph win [96]
107
based on four parameters: the number of vertices, the number of vertex labels, the average degree of vertices, and standard deviation of the weights on
an edge. In all experiments, we assume that the weight on an edge follows a
normal distribution and the average weight of edges is 10. Table 4.2 shows
the default values of these parameters. The degree of a vertex in the input
graph G follows an exponential distribution with the rate parameter λ set to
1/d where d is the average degree.
There are two extra parameters in this set of experiments, which are
the thresholds of the number of motifs (k) in k-WIGM and the minimum
normalized weighted support (t) in t-WIGM. To make fair comparisons, the
value t is set according to k such that all methods will discover the same
number of motifs. The default value of k is also shown in Table 4.2. In this
section, these parameters are varied one at a time to show their effects on all
methods.
Table 4.2: Default Parameter Value
Parameter
Default Value
Number of Vertices in G
5000
Number of Labels
500
Average Degree of G
10
k
100
In all experiments, we find that t-WIGM takes 5% to 10% less time
than k-WIGM to find the same set of motifs. At the beginning, the threshold
used in k-WIGM may be much lower than the true threshold. As a result,
many “useless“ motifs will be discovered by k-WIGM at early rounds that will
108
be discarded by higher thresholds in later rounds. In addition, the WIGM
algorithms outperform the base algorithms due to the pruning power of the
1-extension property, which reduces the number of candidate motifs and the
number of iterations. When the number of vertices varies from 1000 to 10000,
as shown in Figure 4.5, the execution time of both versions of WIGM increases
with the number of vertices at a linear pace while the execution time of base
methods increases at an exponential pace because the number of potential
motifs is larger with more the number of vertices and thus, the pruning power
of the 1-extension property is more evident.
Figure 4.5: Execution Time w.r.t. Number of Vertices in G
When the average degree of G increases, the execution time of WIGM
increases exponentially as shown in Figure 4.6. The main reason is that with
higher degree in G, the discovered motifs will also have a higher average degree.
109
In such a case, the cost of combining motifs is higher, which leads to a higher
execution time. As with the previous figure, when the degree is higher, the
number of potential candidate motifs is larger. Thus, the 1-extension property
can prune more motifs and it leads to more execution time saving over the base
algorithms.
Figure 4.6: Execution Time w.r.t. Average Vertex Degree of G
With more distinct label types, the execution time also increases as
illustrated in Figure 4.7. In each round, more candidate motifs will be generated. As a result, the overall time for discovering motifs increases. On the
other hand, for the base methods, although the number of motifs increases,
the increases pace is not high. Therefore, the improvement of WIGM over
base algorithms remain more or less constant with different number of label
types.
110
Figure 4.7: Execution Time w.r.t. Distinct Labels in G
In Figure 4.8, the execution time increases when more motifs are needed.
In this case, all four methods will take more rounds and the mining process
takes more time. With more iterations, the pruning power of the 1-extension
property is more significant and the disparity between WIGM and base methods is larger.
On average, the discovered motifs in synthetic graphs consist of around
seven vertices and 15 edges. Overall, our proposed WIGM algorithms can
efficiently discover strong motifs in graphs with hundreds of thousands of edges.
The average degree of the graph can be in the range of 30 or 40. Some very large
social networks may have millions of vertices and edges, WIGM algorithms may
not be able to handle these graphs. On the other hand, many real data sets,
e.g., biological networks, small social networks, fall into this range. Therefore,
111
Figure 4.8: Execution Time w.r.t. K
our WIGM algorithms can be used to find important motifs for these graphs
with a manageable execution time.
112
Chapter 5
Conclusions and Discussion
In ARCS-motif finder, We use ARCS as the measurement of the importance of motifs to discover biologically interesting motifs from unaligned protein sequences. ARCS measure has been proposed to detect conserved regions
from aligned protein sequences but cannot be used directly to motif-discovery
from unaligned sequences. We propose a novel algorithms to find ARCS-motif
from unaligned protein sequences. After applying it on real protein data sets,
we show that our algorithm can discover motifs having a better quality and
using a less amount of time than alternative methods in many data sets.
Currently, we only use ARCS-motif model and algorithm on protein sequences. We do not apply this model to DNA sequences because their alphabet
size is too small. With an alphabet size of four, there might be dependence
between columns by chance. Thus, ARCS may not be an appropriate measurement of the importance of motifs in DNA sequences.
In Permu-Motif algorithm, driven by the motivation that there exists
variations including substitutions and permutations in many applications such
as genetic sequence analysis and text mining, we propose a novel motif model
called interchangeable permutation motif to capture the characteristics of se-
113
quences exhibit those kinds of variations. We propose a new algorithm to
discover permutation motifs from sequence databases efficiently based on a
newly defined reachability property.
Unlike the existing algorithms that predict motifs in a pair of genomes
using dynamic programming, our algorithm is highly scalable in discovering motifs from multiple genomes. To show the usefulness of Permu-Motif
model and algorithm in genetic sequence analysis (especially in prokaryote
genomes), we apply Permu-Motif model and algorithm on genome sequences
of 97 species and show that discovered motifs (gene clusters) reveals various
biological themes, some of which are not discovered by methods not taking into
account those kinds of variations or considering only one kind of variation.
Next, we also show the performance of Permu-Motif algorithm by applying it to a large set of synthetic data sets.
Last but not least, we illustrate another application of Permu-Motif
model and algorithm. It can be used to detect orthology correctly as gene
motifs can be used as context of gene matching. In an experiment detecting
orthologies in 10 genomes, our algorithm achieved an average of 3% increase in
recall compared to BBH without sacrificing the precision of ortholog detection.
In WIGM, we study the problem of subgraph motif mining in a large
weighted graph. As far as we know, little literature and algorithms is available
to solve this problem. Therefore, we propose a new subgraph motif model for a
single large weighted graph. Although anti-monotonic property does not hold
114
for the new model, we identify a weaker property, namely 1-extension property
to prune the search space. Based on 1-extension property, we propose t-WIGM
algorithm to find all subgraph motifs with a weight larger than a user-specified
threshold t. Since it is difficult for end-users to give an appropriate threshold t,
we propose another version of WIGM called k-WIGM to find top-k subgraph
motifs.
Overall, our proposed WIGM algorithms can efficiently discover motifs
from a graph with thousands of edges. Many data sets in real applications
such as biological networks and small social networks fall into this range. Our
algorithms can find motifs from these graphs in a manageable execution time.
However, some very large social networks may have millions of vertices and
edges, WIGM algorithms may not be able to handle these graphs.
115
Bibliography
[1] R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases.
VLDB, 1994.
[2] R. Agrawal, R. Srikant. Mining Sequential Patterns. ICDE, 1995.
[3] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and J. Lipman.
Gaped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[4] A. Apostolico, M. Comin, L. Parida. Conservative Extraction of Over-represented Motifs, Bioinformatics, vol. 21, No.1, 9-18, 2005.
[5] A. Apostolico, L. Parida. Incremental Paradigms of Motif Discovery,Journal of Computational Biology, vol.11, No.1, 15-25, 2004.
[6] T. L. Bailey, C. Elkan. Fitting a mixture model by expectation maximization to discover
motifs in biopolymers. Proc. of Intelligent Systems for Molecular Biology, 28-36, 1994.
[7] A. Bateman, Lachlan Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths Jones, A.
Khanna, M. Marshall, S. Moxon, E. L. L. Sonnhammer, D. J. Studholme, C. Yeats, and
S. R. Eddy. The Pfam protein families database Nucleic Acids Research 2004.
[8] Y. Barash, G. Friedman, and T. Kaplan. Modeling dependencies in protein-DNA binding sites, RECOMB, 28-37, 2003.
[9] G. Bebek, J. Yang. PathFinder: mining signal transduction pathway segments from
protein-protein interaction networks, BMC Bioinformatics, 2007.
[10] A. Bergeron, J. Stoye. On the Similarity of Sets of Permutations and Its Applications
to Genome Comparison. COCOON 2003.
[11] C. Borgelt and M. Fiedler. Graph Mining: Repository vs. Canonical Form. Proc. of
Annual Conference of the German Classification Society, 2007.
[12] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic
discovery of patterns in biosequences. Journal of Computational Biology, vol 5, 279-305,
1998.
[13] B. Bringmann, and S. Nijssen: What Is Frequent in a Single Graph? PAKDD, 2008.
116
[14] P. Calabrese, S. Chakravarty, and J. Todd. Vision Fast identification and statistical
evaluation of segmental homologies in comparative maps. Bioinformatics, 19: 74-80,
2003.
[15] A. Califano. SPLASH: structural pattern localization analysis by sequential histograms.
Bioinformatics, vol. 16, 341-357, 2000.
[16] S. Cong, J. Han, and D.A. Padua. Parallel Mining of Closed Sequential Patterns.
KDD, 2005.
[17] J. Costello and M. Dalkilic, S. Beason, R. Patwardhan, S. Middha, B. Eads, and J.
Andrews. Gene networks in Drosophila melanogaster: integrating experimental data to
predict gene function. Genome Biology, vol. 10, no. 9, 2009.
[18] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 23: 324C328 1998.
[19] M. K Das, H. Dai. A survey of DNA motif finding algorithms. BMC Bioinformatics
2007.
[20] G. Deckers-Hebestreit, K. Altendorf. THE F0F1-TYPE ATP SYNTHASES OF BACTERIA: Structure and Function of the F0 Complex. Annual Review of Microbiology, Vol.
50: 791 -824.
[21] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis :
Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
[22] R. Eres, G.M. Landau, and L. Parida. Permutation Pattern Discovery in Biosequences.
Journal of Computational Biology, 2004, 11(6): 1050-1060.doi:10.1089/cmb.2004.11.1050.
[23] C. Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, 1998.
[24] M. Fiedler, C. Borgelt. Support computation for mining frequent subgraphs in a single
graph. MLG, 2007.
[25] Y. Gao, K. Mathee, G. Narasimhan, and X. Wang. Motif detection in protein sequences. Proc. of SPIRE, 63-72, 1999.
[26] M.N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential Pattern Mining with
Regular Expression Constraints. VLDB, 1999.
[27] The Gene Ontology Consortium. The Gene Ontology project in 2008. Nucleic Acids
Research, vol. 36, 2008.
[28] C. Giannella, E. Robertson. On approximation measures for functional dependencies.
Information Systems, 483-507, 2004.
117
[29] U. Gobel, C. Sander, R. Schneider, A. Valencia. Correlated mutations and residue
contacts in proteins. Proteins: Struct. Funct. Genet, 1994.
[30] D. Graur, W.H. Li. Fundamentals of Molecular Evolution. Ed. Sinauer Associates,
Inc., 1991.
[31] W. Grundy, T. Bailey, C. Elkan and M. Baker. Meta-MEME: Motif-based Hidden Markov Models of Biological Sequences, Computer Applications in the Biosciences,
13(4):397-406, 1997.
[32] B. Haas, A. Delcher, J. Wortman, and S. Salzberg. DAGchainer: a tool for mining
segmental genome duplications and synteny. Bioinformatics, 20: 3643-3646, 2004.
[33] E. Halperin, J. Buhler, R. Karp, R. Krauthgamer, and B. Westover. Detecting Protein
Sequence Conservation via Metric Embeddings,Bioinformatics, vol. 19, no. 1, 122-122,
2003.
[34] J. Han, J. Pei. Mining Frequent Patterns by Pattern-growth: Methodology and Implications. Proc. of KDD, 2000.
[35] S. Hannenhalli, and L. Wang. Enhanced position weight matrices using mixture models,
ISMB, 204-212, 2005.
[36] X. He, M. Sarma, X. Ling, B. Chee, C. Zhai, and B. Schatz. Identifying overrepresented
concepts in gene lists from literature: a statistical approach based on Poisson mixture
model. BMC Bioinformatics, vol. 11, 2010.
[37] S. Heber, J. Stoye. Finding All Common Intervals of k Permutations. CPM, 2001.
[38] G.Hertz and G. Stormo. Identification of consensus patterns in unaligned DNA and
protein sequences: a large-deviation statistical basis for penalizing gaps. Proc. of Bioinformatics and Genome Research, 201–216, 1995.
[39] J. Huan, W. Wang, J. Prins. Efficient Mining of Frequent Subgraph in the Presence of
Isomorphism. ICDM, 2003.
[40] J. Huan, W. Wang, J. Prins, and J. Yang. SPIN: mining maximal frequent subgraphs
from graph databases. Proc. of KDD, 2004.
[41] R. Hughey, A. Krogh. Hidden Markov models for sequence analysis: extension and
analysis of the basic method, Computer Applications in the Biosciences, 12(2): 95-107,
1996.
[42] M. Huynen, B. Snel, W. Lathe, 3rd, P. Bork. Predicting protein function by genomic
context: quantitative evaluation and qualitative inferences. Genome Res. 2000;10:1204C1210.
2000.
118
[43] C. Jiang, F. Coenen, M. Zito. Frequent Sub-graph Mining on Edge Weighted Graphs.
DAWAK, 2010.
[44] M. Joshi, G. Karypis, V. Kumar. A Universal Formulation of Sequential Patterns.
KDD workshop on Temporal Data Mining, 2001.
[45] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic
Acids Research, vol. 28, 2000.
[46] U. Keich, P. Pevzner. Finding motifs in the twilight zone., Bioinformatics 18(10) 2002
[47] S. A. Krawetz, D. D. Womble. Introduction to Bioinformatics: A Theoretical And
Practical Approach. Humana Press, 2003.
[48] M. Kuramochi, G. Karypis. Finding Frequent Patterns in a Large Sparse Graph. Data
Mining and Knowledge Discovery, 2005.
[49] M. Kuramochi, and G. Karypis, Finding Frequent Patterns in a Large Sparse Graph.
DMKD, 2005.
[50] G.M. Landau, L. Parida, O. Weimann. Gene Proximity Analysis Across Whole Genomes
via PQ Trees. Journal of Computational Biology, 12(10), pp 1289–1306, 2005.
[51] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, J. Wootton. Detecting
Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science.
1993. 262, 208-14.
[52] D. S. Lieber, O. Elemento, S. Tavazoie. Large-Scale Discovery and Characterization of
Protein Regulatory Motifs in Eukaryotes. PLoS One 2010.
[53] E. Marcotte, M. Pellegrini, M. Thompson, T. Yeates, and D. Eisenberg. A combined
algorithm for genome-wide prediction of protein function. Nature, vol. 402, 1999.
[54] C. von Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, B. Snel. STRING: a
database of predicted functional associations between proteins Nucleic Acids Res. 2003
January 1; 31(1): 258C261. 2003
[55] E. Neher. How frequent are correlated changes in families of protein sequences? Proc.
Natl Acad. Sci. 1994.
[56] A. Neuwald, J. Liu, D. Lipman, and C. Lawrence. Extracting protein alignment models
from the sequence database, Nucleic Acids Research, 25(9): 665-1667. 1998.
[57] C. Nevill-Manning, T. Wu, and D. Brutlag, Highly Specific Protein Sequence Motifs
for Genome Analysis. Proc. of Natl. Acad. Sci., 95(11): 5865-5871, 1998.
[58] S. Nijssen, J. Kok. A quickstart in frequent structure mining can make a difference,
Proc of KDD, 2004.
119
[59] C. Notredame C, D. Higgins, J. Heringa. T-Coffee: A novel method for fast and
accurate multiple sequence alignment. J Mol Biol 302(1):205-17, 2000.
[60] R. Overbeek, M. Fonstein, M. D’Souza, G. D. Pusch, N. Maltsev. Use of contiguity
on the chromosome to predict functional coupling. In Silico Biol. 1999; 1(2): 93C108.
1999
[61] R. Overbeek, M. Fonstein, M. D’Souza, G. D. Pusch, N. Maltsev. The Use of Gene
Clusters to Infer Functional Coupling. Proc. Natl. Acad. Sci. U.S.A., 96(6):2896C2901,
1999.
[62] L. Parida. Pattern Discovery in Bioinformatics: Theory and Algorithms, Chapman
and Hall CRC, 2007.
[63] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, Y. Gao. Pattern discovery on character
sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial
time algorithm. Proc. of SODA, 297-308, 2000.
[64] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth. ICDE,
2001.
[65] J. Pei, J. Han, W. Wang. Mining Sequential Patterns with Constraints in Large
Databases. CIKM, 2002.
[66] J. Pei, J. Liu, H. Wang, K. Wang, P.S. Yu, J. Wang. Efficiently Mining Frequent
Closed Partial Orders. ICDM, 2005.
[67] P. Pevzner, and S. Sze. Combinatorial algorithm for finding subtle signals in DNA
sequences. Proc. of ISMB,, 269-278, 2000.
[68] S. Rajasekaran, S. Balla and C.H Huang. Exact Algorithms for Planted Motif Problems, Journal of Computational Biology, vol. 12, No. 8, 1117-1128, 2005.
[69] I. Rigoutsos and A. Floratos. Combinatorial pattern discovery in biological sequences:
the TEIRESIAS algorithm. Bioinformatics, 14, 55-67, 1998.
[70] R. Rymon. Search Through Systematic Set Enumeration. Proc. of Third Int’l Conf.
on Principles of Knowledge Representation and Reasoning, 1992.
[71] G. K. Sandve, Finn Drablos. A survey of motif discovery methods in an integrated
framework. Biol Direct 2006.
[72] T. Schmidt, J. Stoye. Quadratic Time Algorithms for Finding Common Intervals in
Two and More Sequences. CPM, 2004.
[73] T. Schneider, R. Stephens. Sequence Logos: a new way to display consensus sequences.
Nucleic Acids Research, 18, 6097-100, 1990.
120
[74] J. Scott , T. Ideker , R. M. Karp , R. Sharan. Efficient algorithms for detecting signaling
pathways in protein interaction networks. Journal of Computational Biology, 2005.
[75] I. N. Shindyalov, N. A. Kolchanov, C. Sander. Can three-dimensional contacts in
protein structures be predicted by analysis of correlated mutations. Protein Eng. 1994.
[76] M. Shinoda, T. Ozaki, and T. Ohkawa. Weighted Frequent Subgraph Mining in
Weighted Graph Databases. ICDM workshop on Domain Driven Data Mining, 2009.
[77] M. Singh, B. Berger, P. Kim, J. Berger, and A. Cochran. Computational learning
reveals coiled coil-like motifs in histidine kinase linker domains Proc. Natl. Acad. Sci.
USA 95:2738-743, 1998.
[78] M. Socolich, S.W. Lockless, W.P. Russ, H. Lee, K.H. Gardner, R. Ranganathan. Evolutionary information for specifying a protein fold. Nature, 2005.
[79] B. Song, J. Choi, G. Chen, J. Szymanski, G. Zhang, A. Tung, J. Kang, S. Kim, and
J. Yang. ARCS: An Aggregated Related Column Scoring Scheme for Aligned Sequences.
Bioinformatics. 2006. 22(19), 2326-32.
[80] R. Srikant, R. Agrawal. Mining Sequential Patterns: Generalization and Performance
Improvements. EDBT, 1996.
[81] G. Stormo. DNA binding sites: representation and discovery. Bioinformatics, vol. 16,
16-23, 2000.
[82] A. R. Subramanian, M. Kaufmann, and B. Morgenstern. DIALIGN-TX: greedy and
progressive approaches for segment-based multiple sequence alignment. Algorithms for
Molecular Biology, 2008, 3:6.
[83] G.M. Suel, et al. Evolutionarily conserved networks of residues mediate allosteric
communication in proteins. Nat. Struct. Biol., 2003.
[84] R.L. Tatusov, E.V. Koonin, D.J. Lipman. A genomic perspective on protein families.
Science, 278(5338):631-7, 1997.
[85] R. L. Tatusov, N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin,
D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A.
V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. The COG database:
an updated version includes eukaryotes. BMC Bioinformatics, 2003, 4:41.
[86] W. R. Taylor, K. Hatrick. Compensating changes in protein multiple sequence alignments. Protein Eng. 1994
[87] N. Vanetik, S. E. Shimony and E. Gudes. Support measures for graph data. Data Min.
Knowl. Discov., 2006.
121
[88] K. Wang, Y. Xu, J. Xu Yu. Scalable Sequential Pattern Mining for Biological Sequences. CIKM, 2004.
[89] W. Wang and J. Yang. Mining Sequential Patterns from Large Data Sets. Kluwer
Publisher, 2005.
[90] X. Yan, J. Han. gSpan: graph-based substructure pattern mining, Proc. of ICDM,
2002.
[91] J. Yang, W. Wang, P. Yu, and J. Han. Mining Long Sequential Patterns in a Noisy
Environment. SIGMOD, 2002.
[92] K. Y. Yip, P. Patel, P. M. Kim, D. M. Engelman, D. McDermott, and M. Gerstein.
An integrated system for studying residue coevolution in proteins. Bioinformatics, 2008.
[93] S. Zhang, S. Li, and J. Yang. GADDI: distance index based subgraph matching in
biological networks. Proc. of EDBT, 2009.
[94] M. Zaki. SPADE: an efficient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31-60, 2001.
[95] BLAST. Available at http://ncbi.nih.gov/BLAST.
[96] gengraph win. Available at
http://www.cs.sunysb.edu/ãlgorith/implement/viger/distrib/.
[97] PROSITE database. Available at http://www.expasy.org/prosite.
122

Download Report

MOTIF MINING ON STRUCTURED AND SEMI

Paperzz.com

Your Paperzz