Pseudo-periodic partitions of biological sequences

Vol. 20 no. 3 2004, pages 295–306
DOI: 10.1093/bioinformatics/btg404
BIOINFORMATICS
Pseudo-periodic partitions of biological
sequences
Lugang Li1,2,† , Renchao Jin2,3,4,† , Poh-Lin Kok2 and
Honghui Wan2,5,6, ∗
1 Department
of Protective Medicine, Nanjing Army Medical College, The Second Army
Medical University, Nanjing, Jiangsu 210099, China, 2 Laboratory of Bioinformatics,
Maryland Institute of Dynamic Genomics, 3910 Jeffry Street, Silver Spring, MD 20906,
USA, 3 Department of Computer and Information Sciences, Temple University,
Philadelphia, PA 19139, USA, 4 School of Computer Science and Technology,
Huazhong University of Science and Technology, Wuhan, Hubei 430074, China,
5 Global Bioinformatics Laboratory, National Center for Genome Resources,
2935 Rodeo Park Drive East, Santa Fe, NM 87505, USA and 6 National Center for
Toxicogenomics, National Institute of Environmental Health Sciences, National
Institutes of Health, P.O. Box 12233, Mail Drop F1-05, 111 T. W. Alexander Drive,
Research Triangle Park, NC 27709, USA
Received on February 2, 2003; revised on August 3, 2003; accepted on August 5, 2003
ABSTRACT
Motivation: Algorithm development for finding typical patterns
in sequences, especially multiple pseudo-repeats (pseudoperiodic regions), is at the core of many problems arising in
biological sequence and structure analysis. In fact, one of
the most significant features of biological sequences is their
high quasi-repetitiveness. Variation in the quasi-repetitiveness
of genomic and proteomic texts demonstrates the presence
and density of different biologically important information. It is
very important to develop sensitive automatic computational
methods for the identification of pseudo-periodic regions of
sequences through which we can infer, describe and understand biological properties, and seek precise molecular details
of biological structures, dynamics, interactions and evolution.
Results: We develop a novel, powerful computational tool
for partitioning a sequence to pseudo-periodic regions. The
pseudo-periodic partition is defined as a partition, which intuitively has the minimal bias to some perfect-periodic partition
of the sequence based on the evolutionary distance. We
devise a quadratic time and space algorithm for detecting a
pseudo-periodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal
of the directed (acyclic) weighted graph constructed by the
Smith–Waterman self-alignment of the sequence. We use several typical examples to demonstrate the utilization of our
algorithm and software system in detecting functional or structural domains and regions of proteins. A big advantage of our
∗ To
whom correspondence should be addressed.
The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
†
Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved.
software program is that there is a parameter, the granularity
factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best
parameter. In general, we choose all repeats (including many
pseudo-repeats) in the SWISS-PROT amino acid sequence
database as a typical training set. We show that the granularity factor is 0.52 and the average agreement accuracy of
pseudo-periodic partitions, detected by our software for all
pseudo-repeats in the SWISS-PROT database, is as high
as 97.6%.
Availability: The program is available upon request from
Honghui Wan and will be also available at http://www.
mindgen.org
Contact: [email protected]
1
INTRODUCTION
DNA and protein sequences are neither ‘complex’, nor
‘simple’ (Li et al., 2003). They do not in general resemble
random strings of letters, but rather consist of a heterogeneous mixture of local regions with distinct genetic functions
and evolutionary origins. These regions show many different
compositional characteristics and types of sequence patterns,
as if written in a mosaic of different languages. Genomic
sequences show, e.g. coding sequences, untranslated regions,
introns, exons, intergenic regions, promoters, terminators,
regulatory signals, RNA genes, direct or inverted repeats
of widely different sizes in tandem or interspersed arrangements, microsatellites, CpG islands, centromeres, telomeres
and origins of replication.
At the genomic level, many important genomes, especially
eukaryotes (higher-order organisms whose DNA is enclosed in
295
L.Li et al.
a cell nucleus), contain numerous ‘pseudo-periodic regions’,
suchashuman, pathogens, parasitesandmaize. About10–25%
of total DNA in higher eukaryotes consists of short (5–10 nt)
sequences that are randomly repeated thousands of times.
These DNA segments have a different natural density due to
the base composition differences. Over a third of the human
genome consists of interspersed repetitive sequences that are
primarily degenerate copies of transposable elements (Smit,
1996). In particular, most of the human Y chromosome consists
of pseudo-periodic segments, and overall families of reiterated
sequences account for about one-third of the human genome
(Wan and Song, 2002, http://www.computer.org/proceedings/
ipdps/1573/workshops/15730187babs.htm). Most plant and
animal genomes consist largely of repetitive DNA—perhaps
30 sequence motifs, typically 1–10 000 nt long, present many
hundreds or thousands of times in the genome, which may
be located at a few defined chromosomal sites or widely
dispersed.
Amino acid repeats are common in many proteins. All the
protein sequences from SWISS-PROT database contain many
single amino acid repeats, tandem oligo-peptide repeats and
periodically conserved amino acids. Single amino acid repeats
of glutamine, serine, glutamic acid, glycine and alanine seem
to be tolerated to a considerable extent in a lot of proteins.
Tandem oligo-peptide repeats of different types with varying
levels of conservation have been detected in several proteins
and found to be conspicuous, particularly in structural and
cell surface proteins (Katti et al., 2000). Although, the significance of the amino acid repeats in protein structure and
function has been demonstrated in some proteins, it remains
largely unclear. Recently, they have gained much attention
due to association of several neuro-degenerative disorders
with unstable poly-glutamine repeats in affected proteins. It
appears that repeated sequence patterns may be a mechanism
that provides regular arrays of spatial and functional groups,
useful for structural packing or for one to one interactions with
target molecules.
The detection of multiple repeats in biological sequences or
sequence databases is an important problem in bioinformatics
and computational biology. Repeats occur frequently in biological sequences, yet they are seldom exact. Hence, we focus
our attention on approximately repeated patterns (pseudoperiodic regions). Many algorithms have been developed for
finding pseudo-repeats of sequences, most of which are based
on computing alignment scores. The best algorithm is due
to Schmidt (1998), which has O(n2 log n) time and O(n2 )
space complexity to use weighted grid digraphs for discovering all locally optimal approximate repeats within a sequence
of length n. Wan and Song (2002) used module incidence
matrices to find quasiperiods in biological sequences in linear
complexity and space, but all quasiperiodic units should have
the same length except for the last one.
Almost all existing algorithms for finding multiple repeats
are based on the determination of some consensus pattern
296
units of the same length within a sequence. However,
there does not always exist a consensus pattern within
a sequence. For example, the protein sequence S =
ACDACDEACDEFACDEFGCDEFGAEFDCFDFD can be partitioned into eight consecutive subsequences: ACD, ACDE,
ACDEF, ACDEFG, CDEFG, AEFD, CFD, FD. The partition
{ACD, ACDE, ACDEF, ACDEFG, CDEFG, AEFD, CFD, FD}
is called a pseudo-periodic partition of S. It can be clearly
seen that the pseudo-periodic pattern unit ‘ACD’ gradually
and smoothly changes into the pseudo-periodic pattern unit
‘FD’, but no fixed pattern unit exists and most of pattern units
have different sizes. In this case all existing algorithms cannot be used to find a pseudo-periodic partition of S. In this
paper, we use the dynamic programming approach to a special type of self-alignment with a granularity factor to develop
an O(n2 )-complexity algorithm for detecting a ‘best possible’
pseudo-periodic partition of a sequence of length n.
In Section 2, we first mathematically define a perfectperiodic sequence by means of the consecutive edit distance.
Then we define a pseudo-periodic partition of a sequence
associated with a granularity factor, which is a natural generalization of the perfect-periodicity. After that, we develop
a quadratic time and space algorithm for detecting a pseudoperiodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed
(acyclic) weighted graph (DWG) constructed by the Smith–
Waterman self-alignment of the sequence. In Section 3, we
use several typical examples to demonstrate the utilization of
our algorithm and software program in detecting functional or
structural domains and regions of proteins. A great advantage
of our software system is that there is a parameter, the granularity factor, associated with it and we can freely choose a
biological sequence family as a training set to determine the
best parameter. In general, we choose all repeats (including
pseudo-repeats) in the SWISS-PROT amino acid sequence
database as a typical training set. We find that the granularity
factor is 0.52 and the average agreement accuracy of pseudoperiodic partitions, detected by our software for all pseudorepeats in the SWISS-PROT database, is as high as 97.6%.
2
2.1
METHODS AND ALGORITHMS
Periodic partition
Let A = {a1 , a2 , . . . , am } be a finite alphabet of m letters in
which ai is called the letter of type i (1 ≤ i ≤ m). Symbolic sequences are characterized by A and (usually) by a
finite length n. One-dimensional strings play an important
role in various fields, such as informatics, dynamical systems,
biology, communication theory, linguistics and psychology.
In particular, the digital information that underlies genetics,
genomics, proteomics, biochemistry, cell biology and development can be represented by a simple string on a 4-letter
alphabet (four nucleotides: A, C, G, T) or a 20-letter alphabet
(20 amino acids: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,
Pseudo-periodic partitions of sequences
T, V, W, Y), which can be regarded as the ‘data structure’ of
the life. We denote by F the set of all finite sequences over A.
A sequence in F is typically written as s = s1 s2 · · · sn , where
si ∈ A, and we denote by |s| = n the length of the sequence.
For any 1 ≤ i ≤ n, s1 s2 · · · si is a prefix of s and si si+1 · · · sn
is a suffix of s.
Given a sequence s ∈ F, a partition of s is a sequence
{s (1) , s (2) , . . . , s (k) } of contiguous segments of s, such that
s = s (1) s (2) · · · s (k) , where 1 ≤ k ≤ n, |s (1) | + |s (2) | + · · · +
|s (k) | = n, and 1 ≤ |s (i) | ≤ n for each i with 1≤i ≤ k.
The segments s (1) , s (2) , . . . , s (k) are the parts of the
partition. For example, a protein sequence s =
AACDDEFGGHHHHIKL has many partitions, and three of
them are {AAC,DDEF,GGHH,HHIKL}, {AACDDEFGGH
HHHIKL} and {A,A,C,D,D,E,F,G,G, H,H,H,H,I,
K,L}. For sequence s = s1 s2 · · · sn , there are two special
partitions. One is {s1 s2 · · · sn } consisting of only one part—
itself. This partition is called as the least partition of s. The
other is {s1 , s2 , . . . , sn } consisting of n parts, and each part is
a single letter. We call this partition as the largest partition of
s. Furthermore, for any k with 1 ≤ k ≤ n, if we partition
the
sequence s into k consecutive subsequences, we have n−1
k−1
different ways to choose k− 1cut-positions from n − 1 possible cut-positions, where n−1
− 1)!(n − k)!]
k−1 = (n − 1)!/[(k
n−1
is the binomial coefficient. Thus, there are k−1 different partitions of s for any k, 1 ≤ k ≤ n. We denote by P (s) the set
of all partitions of sequence s. We have:
Proposition 1. Total number of partitions of a sequence s
is given by:
|P (s)| =
n n−1
k=1
k−1
= 2n−1 ,
(1)
where n is the length of s.
Definition 1. For a sequence s of length n and a positive integer p < n, we say that s is p-periodic if there
exists a partition π = {s (1) , s (2) , . . . , s (k) } of s such that
k = n/p, s (1) = s (2) = · · · = s (k−1) and s (k) is a prefix of s (k−1) . Here x denotes the ceiling function of x. The
partition π is called the p-periodic partition of s.
The last part of the partition may be an incomplete
periodic unit and in this case |s (k) | < p. For example,
if s = HLPRVHLPRVHLPRVHLPRVHLP, then {HLPRV,
HLPRV,HLPRV,HLPRV,HLP} is a 5-periodic partition
of s, while {HLPRVHLPRV,HLPRVHLPRV,HLP} is a
10-periodic partition of s. Here, HLP is an incomplete periodic unit of length 3. Generally, if π = {s (1) , s (2) , . . . , s (k) }
is a p-periodic partition of s, then |s (i) | = p for 1 ≤ i ≤
k − 1, where every s (i) is a complete periodic unit, and
1 ≤ |s (k) | ≤ p, i.e. only s (k) is allowed to be an incomplete
periodic unit. If n is a multiple of p, then s (k−1) = s (k) .
By Definition 1, we know that a sequence s = s1 s2 · · · sn
is p-periodic if and only if si = si+p holds for every
i, 1 ≤ i ≤ n − p. If s is p-periodic and kp < n for some
positive integer k, then s is also kp-periodic. Moreover, we
have the following propositions, which demonstrate the structure of the set of periods of a sequence. Detailed proofs are
provided in Appendix A.
Proposition 2. If a sequence s = s1 s2 · · · sn is p1 -periodic
and p2 -periodic and n ≥ p1 + p2 , then s is also gcd(p1 , p2 )periodic, where p1 and p2 are two positive integers, and gcd
(p1 , p2 ) denotes the greatest common divisor of p1 and p2 .
Proposition 3. For a p-periodic sequence s of length n
with p ≤ n/2, there exists a positive integer p0 ≤ n/2
such that for any positive integer m, 1 ≤ m ≤ n/2, and s is
m-periodic if and only if m is a multiple of p0 .
Proposition 3 does not hold for m > n/2. For example,
s = ACAACA. s is both 3-periodic and 5-periodic, but 5 is
not a multiple of 3. This is caused by the fact that ACAACA
does not completely repeat twice. Intuitively, a period should
repeat at least twice. So we require that m ≤ n/2.
We call such p0 in Proposition 3 the atomic period (length)
of s. The periodic partition of s corresponding to p0 is called
the atomic periodic partition of s. Proposition 1 demonstrates
the significance of the atomic periodic partition in the study
of periodic partitions of s. Thus, we mainly concentrate on
the atomic periodic partition of s whenever s is p-periodic for
some p < n.
Definition 2. Let π = {s (1) , s (2) , . . . , s (k) } be a partition
of a sequence s, where |s (1) | = |s (2) | = · · · = |s (k−1) | = q ≥
|s (k) | = r. Then the consecutive edit distance (CE-distance)
of the partition π is defined as:
DCE (π ) =
k−1
d(s (i) , s (i+1) ),
(2)
i=1
where d(s (i) , s (i+1) ) is the edit distance between s (i) and s (i+1)
for 1 ≤ i ≤ k − 2, and d(s (k−1) , s (k) ) = min{d|d is the edit
distance between s (k) and a prefix of s (k−1) }.
Note that by convention, DCE (π ) = 0 for k = 1. For
k > 1, d(s (k−1) , s (k) ) is adapted for dealing with the cases
when s (k) is an incomplete periodic unit. Obviously, d(s (i) ,
s (i+1) ) ≥ 0 for 1 ≤ i ≤ k − 1. Moreover, d(s (i) , s (i+1) ) = 0
whenever s (i) = s (i+1) for 1 ≤ i ≤ k − 2, and
d(s (k−1) , s (k) ) = 0 whenever s (k) is a prefix of s (k−1) . These
facts turn out that the periodicity of a sequence can be characterized in terms of the consecutive edit distance in the
following proposition.
Proposition 4. For a sequence s of length n and a positive integer p < n, s is p-periodic if and only if there
exists a partition π = {s (1) , s (2) , . . . , s (n/p) } of s such that
DCE (π ) = 0.
297
L.Li et al.
Coward and Drablos (1998) partitioned a sequence s into
consecutive subsequences s (1) , s (2) , . . . , s (k) of fixed length p
each, and defined a measure of the ‘mutual agreement’
between the subsequences as:
d(s (i) , s (j ) ),
Mp (s) =
j <i
where d is a metric on the alphabet A. Mp (s) has the same
property as DCE (π ): s is p-periodic if and only if Mp (s) = 0.
However, an insertion or a deletion in s influences Mp (s)
greatly.
Our DCE (π ) has the better ‘stability’ compared with Mp (s).
In particular, if s is almost periodic except for a few substitutions or indels, then DCE (π ) will be quite small. In
addition, the existence of an incomplete last periodic unit
does not influence DCE (π ) at all. Due to this characteristic,
DCE (π ) offers a better measure of the weak-periodicity of
sequences. For example, s = ABCDEABCDEABCDEABCDE
is 5-periodic. In this case, M5 (s) = 0 and DCE (π ) = 0
for π = {ABCDE,ABCDE,ABCDE,ABCDE}. If an insertion
happens, say, a ‘C’ is inserted between the second ‘B’ and
‘C’ in the sequence, we get a new sequence of length 21:
s = ABCDEABCCDEABCDEABCDE, then the last letter is
discarded and M5 (s ) = 10. However, DCE (π ) = 2 for
π = {ABCDE,ABCCDE,ABCDE,ABCDE}.
2.2
Pseudo-periodic partition
Now, we suppose that a sequence s of length n is not
p-periodic for any positive integer p < n. We must turn to
investigating its pseudo-periodicity, which is a natural generalization of the perfect periodicity defined in Definition 1.
Motivated by the Proposition 4, it is possible to use the
minimal consecutive evolutionary edit distance as a measure of the distance of a sequence to the ‘nearest periodic
partition’. But we will see that this measure does not work
properly. As an example, we look at the sequence s =
AKLAKMAKNAKP. The following are three of partitions of s:
π1 = {AKL,AKM,AKN,AKP}, π2 = {AKLAKM,AKNAKP}
and π3 = {AKLAKMAKNAK,P}. Intuitively, π1 reveals the
most approximate periodicity of s and π3 the least. However, DCE (π1 ) = 3, DCE (π2 ) = 2 and DCE (π3 ) = 1
if we take the edit distance as the Levenshtein distance
(Levenshtein, 1966), in which matching letters score 0 and
deletions/insertions/substitutions of letters score 1. So the
Levenshtein distance cannot be directly used to detect the
pseudo-periodic regions and we must find out what factor we
have neglected.
For a partition π = {s (1) , s (2) , . . . , s (k) } of a sequence s,
two end parts of s are less involved than the internal parts
of s in counting DCE (π ). One end is the first subsequence
in π i.e. s (1) . The other end is more complicated to be identified. It is not just the final subsequence s (k) , because we
allow the incomplete final periodic unit. Actually, it is a suffix
298
of s starting from the letter that is next to the letter aligned
with sn in the alignment corresponding to d(s (k−1) , s (k) ). We
denote by s (k) such suffix of s. The sum |s (1) | + |s (k) | is
an important factor that makes π1 better than π2 and π3 , but
DCE (π1 ) > DCE (π2 ) > DCE (π3 ). Let g(π ) = (|s (1) | +
|s (k) |)/2 and c > 0 denotes a constant value which is the
cost of a single indel (gap) in a self-alignment of s. g(π )
is called the granularity of π , and c the granularity factor.
We have:
Definition 3. For a sequence s, let Bc (s) =
min{DCE (π ) + c · g(π )|π ∈ P (s)}. Then Bc (s) is called the
minimal distance to periodic partitions of s. A partition πq =
{s (1) , s (2) , . . . , s (k) } of s is called the pseudo-periodic partition
of s if DCE (πq ) + c · g(πq ) = Bc (s).
In the above example regarding s = AKLAKMAKNAKP,
g(π1 ) = (3 + 3)/2 = 3, g(π2 ) = (6 + 6)/2 = 6, g(π3 ) =
(11 + 11)/2 = 11. Bc (s) = 6 = DCE (π1 ) + g(π1 ). Thus,
according to Definition 3, π1 is the pseudo-periodic partition
of s and c = 1.
Definition 4. If π = {s} is a pseudo-periodic partition of s
and DCE (π ) = 0, then π is called the atomic periodic partition
of s and g(π ) is the atomic period length of s.
The concept of pseudo-periodic partition is a generalization
of that of atomic periodic partition. Starting from this generalized concept, we can investigate the pseudo-periodicity of
a sequence that has no fixed weak-periodic pattern and no
fixed period length. For example, π = {AKY, AKYV, AKYVN,
AKYVN, FKYVN, FCDEY, FVY, FNY, FY} is the pseudoperiodic partition of s = AKYAKYVAKYVNAKYVNFKYVNFC
DEYFVYFNYFY. In this case, s has nine weak-periodic patterns: AKY, AKYV, AKYVN, AKYVN, FKYVN, FCDEY, FVY,
FNY, FY. We can clearly see from this partition that
the pseudo-periodic patterns change gradually along the
sequence.
2.3
Algorithm design
Our task is to develop an efficient algorithm for detecting the
pseudo-periodic partition for a given sequence. We start from
the DWG constructed by the Smith–Waterman self-alignment.
For the DNA sequence s = AGAGA, the DWG made by the
Smith–Waterman self-alignment of s is shown in Figure 1.
In fact, every edge goes from the bottom-left cell to the topright cell. Before the modification, every vertical or horizontal
edge is assigned to a gap weight while every slope edge has
a weight of the substitution cost between the corresponding
letters. Then the shortest path from point (0, 0) to point (5, 5)
is the main diagonal from (0, 0) to (5, 5), which corresponds
to the optimal alignment of sequence s and itself, i.e. every
letter aligns with itself.
For solving the problem, we need to do some modifications
on this graph. First, change the weights of all edges included
Pseudo-periodic partitions of sequences
j
c/2
c/2
c/2
c/2
c/2
t = (5, 5)
c/2
A
c/2
G
c/2
A
c/2
G
c/2
A
s = (0, 0)
A
G
A
G
A
f (
(i, j ), (i + 1, j )) =


if j = n;
c/2,
if 0 < i < j − 1;
cos t(bi+1 , −),


+∞,
otherwise.
f (
(i, j ), (i + 1, j + 1)) =
cos t(bi+1 , bj +1 ),
if i < j ;
+∞,
otherwise.
i
Fig. 1. Directed (acyclic) weighted graph (DWG) constructed by the
Smith–Waterman self-alignment of the DNA sequence s = AGAGA.
in or crossed by the boundary of the shadowed area into +∞.
This means that we never allow a shortest path to go through
the shadowed area. Next, change the weights of all leftmost
vertical edges and all topmost horizontal edges to c/2. The
thick black line in Figure 1 shows the new shortest path from
(0, 0) to (5, 5). It corresponds to the alignment:
--AGAGA
AGAGA--
Using the standard dynamic programming algorithm, we
can find the shortest path from (0, 0) to (n, n) and a corresponding alignment between the sequence s and itself. Our
purpose is to find a partition π = {s (1) , s (2) , . . . , s (k) } of s such
that DCE (π )+c ·g(π ) equals to the length of the shortest path
from (0, 0) to (n, n).
In a pairwise alignment of two sequences s and t, we denote
by s ∗ and t ∗ two strings at the top line and bottom line of
the alignment, respectively. That is, s ∗ and t ∗ are sequences
obtained, respectively, from two sequences s and t by inserting some ‘−’s. We say that the letters or strings at the same
columns of the alignment are aligned each other. For a subsequence s (i)∗ of s ∗ at the top line of the alignment, we denote
by A∗ (s (i)∗ ) the string at the bottom line that is aligned with
s (i)∗ . In addition, we denote by A(s (i)∗ ) the sequence obtained
by deleting all ‘−’s from A∗ (s (i)∗ ). If the pairwise alignment is a self-alignment of s, then for a subsequence s (i)∗ of
s ∗ , A(s (i)∗ ) is a subsequence of s. For example, the following
is a self-alignment of the sequence s = AMNPQRS:
--AM-NPQRS
AMN-PQRS--
This alignment can be divided into four parts:
-- AG AG A
AG AG A- Finally, we get a partition of s: π = {AG,AG,A}. It is a
pseudo-periodic partition of s, and is also the atomic periodic
partition of s.
Now, we show that this procedure is not executed by accident. For a sequence s = s1 s2 · · · sn , we construct a WDG with
(n + 1) × (n + 1) vertices: G = V , E, f , where
V = {(i, j )|0 ≤ i, j ≤ n},
E = {
(i, j ), (i, j + 1)|0 ≤ i ≤ n, 0 ≤ j < n}∪
{
(i, j ), (i + 1, j )|0 ≤ i < n, 0 ≤ j ≤ n}∪
{
(i, j ), (i + 1, j + 1)|0 ≤ i, j < n},
f : E → R,


if i = 0;
c/2,
f (
(i, j ), (i, j + 1)) = cos t(−, bj +1 ), if 0 < i < j ;


+∞,
otherwise.
Let s (1)∗ = AM − N, s (2)∗ = PQRS. Then A∗ (s (1)∗ ) =
N − PQ, A(s (1)∗ ) = NPQ, A∗ (s (2)∗ ) = RS − − and
A(s (2)∗ ) = RS.
For a sequence s = s1 s2 · · · sn , we suppose that the
alignment, corresponding to the shortest path from (0, 0) to
(n, n), is:
AL =
−
s1
−
s2
··· −
· · · sr
s1 · · ·
··· ···
sr · · ·
··· ···
sn
−.
(3)
There are r ‘−’s before s1 at the top line (r ≥ 1). We use the
following algorithm P artition(AL) to find a partition π of s.
Algorithm Partition(AL):
Set s (1) = s (1)∗ = s1 s2 · · · sr and i = 1;
While A(s (i)∗ ) = φ do:
{
s (i+1)∗ = A∗ (s (i)∗ );
s (i+1) = A(s (i)∗ );
i = i + 1.
}
Output π = {s (1) , s (2) , . . . , s (i) }.
299
L.Li et al.
Algorithm Partition(AL) is at the heart of finding pseudoperiodic partition of a sequence s. The following theorem is
the main algorithmic results of this paper which demonstrates
the feasibility and effectiveness of algorithm Partition(AL).
A detailed proof is given in Appendix B.
Theorem 1. If AL is a self-alignment of s corresponding to the shortest path from (0, 0) to (n, n) in the graph
G = V , E, f , then the partition π = {s (1) , s (2) , . . . , s (k) }
generated by Partition(AL) is a pseudo-periodic partition of s.
It is not difficult to see that algorithm P artition(AL) can
be executed in at quadratic time and space, just like the standard dynamic programming approach to the pairwise sequence
alignment. We have implemented the algorithm in the computer language C on a UNIX environment. The code under
the name ‘Aperiod’ is associated with a parameter, the granularity factor c described in Definition 3. It is portable to most
computers with a C compiler and can be easily used by the
biologists to do complex trait analysis of biological sequences.
The program is available from the authors.
3
3.1
RESULTS
Applications to proteins
The algorithm developed in the previous section for generating
pseudo-periodic partitions of sequences is very useful to detect
functional or structural domains and regions of proteins. We
use several typical examples to demonstrate the utilization of
our algorithm and software program.
It is important for discovery of subtle repeating (pseudoperiodic) properties in proteins, especially of those representing a physiochemical property, such as hydrophobicity
or polarity. Segmentation of the globular and non-globular
(coiled-coil) domains of the seryl-tRNA synthetase protein on
the basis of sequence complexity is shown in Figure 2 of Wan
et al. (2003). In this example, the N-terminal non-globular
region of Thermus thermophilus seryl-tRNA synthetase is
known from a high-resolution crystal structure determination to be mostly an extended, antiparallel, two-stranded
coiled-coil (PDB: 1SRY) (Biou et al., 1994). The sequence
consisting of 421 amino acids was segmented automatically
by the DSR algorithm (Wan et al., 2003). The corresponding
parts of the protein structure (from X-ray crystallography) are
shown by red (‘simple’ region, comprising 105 amino acids,
non-globular domain) and green (‘complex’ region, globular
domain). It is well known that the non-globular domain of the
seryl-tRNA synthetase protein is with weak 7-residue repeat.
In fact, using the Aperiod program, we detect a pseudoperiodic partition of this non-globular region as shown in
Table 1. The results demonstrate that the region has indeed
a pseudo-period 7.
The second example is apo(a) (APOA_HUMAN), whose
SWISS-PROT accession number is P08519. Apo(a) is the
300
Table 1. Pseudo-periodic partition of non-globular region of 1SRY
Pseudo-periodic unit
Position
Length
MVDLKRLR
QEPEVFHR
AIREKGVA
LDLEALLA
LDREVQEL
KKRLQEVQ
TERNQVA
KRVPKAP
PEEKEAL
IARGKAL
GEEAKRL
EEALRE
KEARLE
ALLLQV
PLPP
1–8
9–16
17–24
25–32
33–40
41–48
49–55
56–62
63–69
70–76
77–83
84–89
90–95
96–101
102–105
8
8
8
8
8
8
7
7
7
7
7
6
6
6
4
main constituent of lipoprotein(a) [Lp(a)]. It has serine proteinase activity and is capable of autoproteolysis; also inhibits
tissue-type plasminogen activator 1. Lp(a) may be a ligand
for megalin/Gp 330 (McLean et al., 1987). Apo(a) is known
to be proteolytically cleaved, leading to the formation of
the so-called mini-Lp(a). Apo(a) fragments accumulate in
atherosclerotic lesions, where they may promote thrombogenesis. O-Glycosylation may limit the extent of proteolytic
fragmentation.
Elevated plasma concentrations of apo(a) and its naturally
occurring proteolytic fragments are correlated with atherosclerosis. APOA HUMAN belongs to peptidase family S1;
also known as the trypsin family and plasminogen subfamily. It contains 38 kringle domains (37 of type IV and one
of type V), each of which is approximately 110 amino acids
in length and of high complexity (Martin, 1999). Homology
with plasminogen kringles IV and V is thought to underlie
the atherogenicity of the protein, because the fragments are
competing with plasminogen for fibrin(ogen) binding. In fact,
kringles (Castellino and Beals, 1987; Ikeo et al., 1991; Patthy,
1985) are triple-looped, disulfide cross-linked domains found
in a varying number of copies, in some serine proteases and
plasma proteins.
Kringle domains are thought to play a role in binding mediators, such as membranes, other proteins or phospholipids,
and in the regulation of proteolytic activity. Using the Aperiod
program, we have exactly detected all 38 kringle domains in
APOA HUMAN as shown in Table 2. In the pseudo-periodic
partition of apolipoprotein A in APOA_HUMAN generated
by Aperiod, each pseudo-periodic unit corresponds precisely
to a kringle domain. There are 28 perfect periodic regions
(exact repeats) of length 114 in the pseudo-periodic partition,
which starts at position 131 and ends at position 3322 in the
sequence.
Pseudo-periodic partitions of sequences
Table 2. Pseudo-periodic partition of apolipoprotein A in APOA_HUMAN (38 copies)
Pseudo-periodic unit
Position
EQSHVVQDCYHGDGQSYRGTYSTTVTGRTCQAWSSMTPHQHNRTTENYPNAGLIMNY CRNP
DAVAAPYCYTRDPGVRWEYCNLTQCSDAEGTAVAPPTVTPVPSLEAPSEQ
APTEQRPGVQECYHGNGQSYRGTYSTTVTGRTCQAWSSMTPHSHSRTPEYYPNAGLI
MNYCRNPDAVAAPYCYTRDPGVRWEYCNLTQCSDAEGTAVAPPTVTPVPSLEAPSEQ
APTEQRPGVQECYHGNGQSYRGTYSTTVTGRTCQAWSSMTPHSHSRTPEYYPNAGLI
MNYCRNPDPVAAPYCYTRDPSVRWEYCNLTQCSDAEGTAVAPPTITPIPSLEAPSEQ
APTEQRPGVQECYHGNGQSYQGTYFITVTGRTCQAWSSMTPHSHSRTPAYYPNAGLI
KNYCRNPDPVAAPWCYTTDPSVRWEYCNLTRCSDAEWTAFVPPNVILAPSLEAFFEQ
ALTEETPGVQDCYYHYGQSYRGTYSTTVTGRTCQAWSSMTPHQHSRTPENYPNAGLT
RNYCRNPDAEIRPWCYTMDPSVRWEYCNLTQCLVTESSVLATLTVVPDPSTEASSEE
APTEQSPGVQDCYHGDGQSYRGSFSTTVTGRTCQSWSSMTPHWHQRTTEYYPNGGLT
RNYCRNPDAEISPWCYTMDPNVRWEYCNLTQCPVTESSVLATSTAVSEQ
APTEQSPTVQDCYHGDGQSYRGSFSTTVTGRTCQSWSSMTPHWHQRTTEYYPNGGLT
RNYCRNPDAEIRPWCYTMDPSVRWEYCNLTQCPVMESTLLTTPTVVPVPSTELPSEE
APTENSTGVQDCYRGDGQSYRGTLSTTITGRTCQSWSSMTPHWHRRIPLYYPNAGLT
RNYCRNPDAEIRPWCYTMDPSVRWEYCNLTRCPVTESSVLTTPTVAPVPSTEAPSEQ
APPEKSPVVQDCYHGDGRSYRGISSTTVTGRTCQSWSSMIPHWHQRTPENYPNAGLT
ENYCRNPDSGKQPWCYTTDPCVRWEYCNLTQCSETESGVLETPTVVPVPSMEAHSEA
APTEQTPVVRQCYHGNGQSYRGTFSTTVTGRTCQSWSSMTPHRHQRTPENYPNDGLT
MNYCRNPDADTGPWCFTMDPSIRWEYCNLTRCSDTEGTVVAPPTVIQVPSLGPPSEQ
DCMFGNGKGYRGKKATTVTGTPCQEWAAQEPHRHSTFIPGTNKWAGLEKNYCRNPDG
DINGPWCYTMNPRKLFDYCDIPLCASSSFDCGKPQVEPKKCPGS
We now apply our Aperiod program to test a typical protein sequences dealt with by Coward and Drablos (1998):
Acyl-[acyl-carrier-protein]
(UDP-N-acetylglucosamine
O-acyltransferase, SwissProt entry: LPXA_ECOLI; accession no. P10440). UDP-N -acetylglucosamine 3-O-acyltransferase (LpxA) catalyzed the transfer of (R)-3-hydroxymyristic
acid from its acyl carrier protein thioester to UDP-N acetylglucosamine. LpxA is the first enzyme in the lipid A
biosynthetic pathway and is a target for the design of antibiotics. The X-ray crystal structure of LpxA was determined
by Raetz and Roderick (1995) to 2.6 Å resolution and reveals
a domain motif composed of parallel beta strands, termed a
left-handed parallel beta helix (L beta H). This unusual fold
displays repeated violations of the protein folding constraint
requiring right-handed crossover connections between strands
of parallel beta sheets and may be present in other enzymes
that share amino acid sequence homology to the repeated
hexapeptide motif of LpxA.
20–130
Length
111
131–244,245–358,
359–472,473–586,
587–700,701–814,
815–928,929–1042,
1043–1156,1157–1270,
1271–1384,1385–1498,
1499–1612,1613–1726,
1727–1840,1841–1954,
1955–2068,2069–2182,
2183–2296,2297–2410,
2411–2524,2525–2638,
2639–2752,2753–2866,
2867–2980,2981–3094,
3095–3208,3209–3322
3323–3436
114
114
3437–3550
114
3551–3664
114
3665–3770
114
3771–3884
114
3885–3998
114
3999–4112
114
4113–4226
114
4227–4327
101
The conversion of tetrahydrodipicolinate and succinyl-CoA
to N -succinyltetrahydrodipicolinate and CoA was catalyzed
by tetrahydrodipicolinate N -succinyltransferase and is the
committed step in the succinylase pathway by which bacteria
synthesize l-lysine and meso-diaminopimelate, a component of peptidoglycan. The X-ray crystal structure of THDP
succinyltransferase was determined by Beaman et al. (1997)
to 2.2 Å resolution and was refined to a crystallographic
R-factor of 17.0%. The enzyme was trimeric and displayed
the left-handed parallel beta-helix (L beta H) structural motif
encoded by the ‘hexapeptide repeat’ amino acid sequence
motif (Raetz and Roderick, 1995). The approximate location
of the active site of THDP succinyltransferase was suggested by the proximity of binding sites for two inhibitors:
p-(chloromercuri)benzenesulfonic acid and cobalt ion, both
of which bind to the L beta H domain.
LPXA_ECOLI belongs to the transferase hexapeptide
repeat family and lpxa subfamily as shown in Figure 2 (PDB
301
L.Li et al.
Fig. 2. LpxA of Escherichia coli and LpxA family (adapted from Parisi and Echave, 2001). (A) Cartoon view of the LbH domain of
the LpxA of E.coli. PDB entry 1LXA. The left-handed b helix is formed by nine triangular coils (C1–C9). Each coil is formed by three
hexapeptides, colored red, yellow and blue, respectively. Loops are colored gray. (B) Detailed view of coil C2 of A. Amino acids at conserved
hexapeptide positions 1 (A18, A24 and C30) and 3 (I20, I26 and V32) are labeled. Panels A and B were prepared with the program MOLMOL
(Koradi et al., 1996). (C) Multiple-sequence alignment of the LbH domain of the members of the LpxA family. The alignment was obtained
using CLUSTAL W (Thompson et al., 1994). Sequences are identified using the SwissProt/TrEMBL codes (http://www.expasy.ch/sprot/sprottop.html). Conserved substitutions are shaded in black if the whole column is conserved, and they are shaded in gray if 0.75% is conserved. The
following classes were used to judge conservation: aliphatic (ACILMV), aromatic (FHWY), polar (NQST), charged positive (KR), charged
negative (DE) and special (GP). Colors of the first line of C relate the alignment to the structure shown in A and B. In addition, different coils
(C1–C9) and hexapeptide third positions (dots) are explicitly indicated.
302
Pseudo-periodic partitions of sequences
Table 4. Comparison of Aperiod segmentation and LbH domain of
LPXA_ECOLI
Table 3. Pseudo-periodic partition of LPXA_ECOLI sequence
Pseudo-periodic unit
Position
Length
MIDKSAFVHPTAIVEEGA
SIGANAHIGPFCIVGPHV
EIGEGTVLKSHVVVNGHT
KIGRDNEIYQFASI
GEVNQDLKYAGEPTR
VEIGDRNRIRESVTI
HRGTVQGGGL
TKVGSDNLLMINAHIAHD
CTVGNRCILANNATLAGH
VSVDDFAIIGGMTAVHQF
CIIGAHVMVGGCSGV
1–18
19–36
37–54
55–68
69–83
84–98
99–108
109–126
127–144
145–162
163–177
18
18
18
14
15
15
10
18
18
18
15
code 1LXA). The Aperiod program was tested on the left
left-handed parallel b helix (LbH) domain of LpxA, which
displays a distinctive sequence pattern that is likely to result
from structural constraints. We find a pseudo-periodic partition of the sequence of LPXA_ECOLI as shown in Table 3.
The sequence of the LbH domain that is a subsequence of
LPXA_ECOLI from the first position to the 177th position,
consists of 11 pseudo-repeats in which the dominating repeats
are of length 18 and 15. Actually, the sequences of the LbH
domain of members of the LpxA family, as demonstrated
in Figure 2C, consist of the imperfect tandem repetition of
hexapeptide units (Vaara, 1992; Vuorio et al., 1994; Raetz and
Roderick, 1995; Parisi and Echave, 2001). These imperfect
tandem repeats have been accurately detected by the Aperiod
program. In fact, the results with Aperiod, as demonstrated
in Table 4, show a strong agreement in predicting the exact
positions of experimental coils and loops. The hexapeptides
are characterized by a high degree of conservation of the third
position, which usually displays I, L or V (a one-letter code
is used to designate amino acids). Hexapeptide position 1
is also significantly conserved, although less so than position 3, whereas the other four hexapeptide sites (2, 4, 5 and 6)
are not conserved. Figure 2B shows that the residues of conserved sites 1 and 3 point toward the inside of the beta helix,
whereas those in variable positions point toward the outside.
The LpxA family belongs to a larger superfamily of LbH
acyltransferases.
3.2
Determination of the granularity factor
There is a significant parameter, the granularity factor c
described in Definition 3, associated with the Aperiod program. It is a key to find the best parameter c for the applications
in biological sequence and structure analysis. We can see that
sizes of the pseudo-periodic units in a pseudo-periodic partition of a sequence s are close to the pseudo-period length of s.
We look for a pseudo-periodic partition π of s with the minimum of DCE (π ) + c · g(π ) where g(π ) is strongly related to
Aperiod segmentation
Experimental coils/loops
Comments
1–18
19–36
37–54
55–68
69–83
84–98
99–108
109–126
127–144
145–162
163–177
1–17
18–35
36–53
54–68
69–83
84–98
99–108
109–126
127–144
145–162
163–177
Coil1
Coil2
Coil3
Coil4
Loop1
Coil5
Loop2
Coil6
Coil7
Coil8
Coil9
sizes of the pseudo-periodic units in the partition. A smaller c
results in more pseudo-periodic units and more strict approximation of the periodicity of the sequence. However, if c is
too small, the algorithm would tend to find partitions with
too large pseudo-periodic units, e.g. number of the pseudoperiodic units >n/2, then it may overlook some important
pseudo-periodic partition with small period length. Therefore, we need a good trade-off between the strictness of the
pseudo-periodicity and number of pseudo-periodic units in
the partition.
We now turn to determining the best parameter c with
Aperiod for biological applications. To this end, we need
to define mathematically an objective function (Wan et al.,
2003)—‘agreement accuracy (sensitivity)’ to measure the
efficiency of the Aperiod program for the particular investigation of pseudo-periodic partitions of protein sequences.
For any sequence s = s1 s2 · · · sn ∈ F, suppose that s[i..j ]
is a pseudo-periodic unit of s detected by Aperiod, which
corresponds to known true repeat unit s[k..l]. Then s[r..t]
is called a mutual region of s, where r = max{i, k} and
t = min{j , l}. Let p denote total length of all mutual regions
of s. The agreement accuracy ω of Aperiod partition is defined
as: ω(s) = p/n. For example, as shown in Table 4, the
sequence s of LPXA_ECOLI has 11 pseudo-periodic units
in the APERIOD segmentation: s[1..18], s[19..36], s[37..54],
s[55..68], s[69..83], s[84..98], s[99..108], s[109..126],
s[127..144], s[145..162], s[163..177]. On the other hand,
s has 11 perfect repeat units, each of which corresponds to
an experimental coil or loop: s[1..17], s[18..35], s[36..53],
s[54..68], s[69..83], s[84..98], s[99..108], s[109..126],
s[127..144], s[145..162], s[163..177]. Then, s has 11 mutual
regions: s[1..17], s[19..35], s[37..53], s[55..68], s[69..83],
s[84..98], s[99..108], s[109..126], s[127..144], s[145..162],
s[163..177]. In this case, p = 174 and n = 177. Therefore, the agreement accuracy is ω = 98.3%.
Now we switch our attention to considering the agreement accuracy whenever the Aperiod program is used for
303
L.Li et al.
a sequence database. Let D denote a protein sequence database
or a protein sequence family. Then the agreement accuracy
ω(D)(c) of Aperiod partition for D associated with parameters c (the granularity factor), is defined as the average sum
of agreement accuracy values of all individual sequences in D:
ωD (c) =
1 ω(s).
|D|
s∈D
To discover the best parameter c for D, we should mathematically find an ‘extreme point’ c∗ in which ωD (c)
achieves its maximum: c∗ = arg maxc∈ ωD (c), where
‘argmax’ stands for the argument for which the function involved takes its maximum value, and = {0.01,
0.02, . . . , 0.99, 1.00, 1.01, . . . , 1.99, 2.00} is a 200-element
set. A large number of computational examples show that
for c ≥ 2, all pseudo-periodic partitions of protein sequences
are identical. Thus, it is enough to deal with those cases when
c ∈ . We have developed a software program for finding an
extreme point c∗ in the set and calculating the maximum
value ωD (c∗ ) for D.
Users can freely choose a biological sequence database as
training set to which a query sequence of interest belongs.
Cut-off value c may be determined based on simulations
with random sequences (Benson, 1999). However, biological sequences, especially pseudo-periodic sequences, are
extremely different from real random strings (Wan and
Wootton, 1999, 2000, 2002; Wan et al., 2003; Li et al.,
2003). Therefore, random strings may not be suitable for
using as a training set to discover a good threshold cutoff granularity factor for finding pseudo-periodic regions
and subsequences in biological sequences. In general, we
choose all repeats (including pseudo-repeats) in the SWISSPROT amino acid sequence database (Release 40.22 of
24 June, 2002 of SWISS-PROT) as a typical training set.
Query ‘[libs = {swiss_prot}-keywords: Repeat]’ found
8782 entries (http://us.expasy.org/cgi-bin/getentries?KW=
Repeat&SRS=Perform&db=sp). Our computational experiments on all these 8782 repeats have shown that in this case
the extreme point c∗ = 0.52, and maximum = 0.976. In
other words, the best parameter is for automatic partition of
amino acid sequences into pseudo-repeats by the Aperiod
program c∗ = 0.52. The average agreement accuracy (sensitivity) of pseudo-periodic partitions, generated by Aperiod
for all repeats in the SWISS-PROT database, is 97.6%.
4
DISCUSSION
This paper has developed an efficient algorithm and a software
program, Aperiod, associated with a parameter c (the granularity factor), to find pseudo-periodic regions or sequences
within protein databases. The resulting segmentation corresponds very well to intuitive views of the detection of these
304
‘quasi-repeats’. The Aperiod program provides a useful bioinformatics tool for delineating functional and structural features of pseudo-periodic sequences. We have applied
Aperiod to five typical examples to demonstrate significant utilities for detecting protein domains, coils and loops
whenever they are quasi-repeats. We have also used these techniques to evaluate the abundance of quasi-repeating sequences
within the SWISS-PROT database. We find that the average
agreement accuracy of pseudo-periodic partitions, detected by
Aperiod for all quasi-repeats in the SWISS-PROT database,
is as high as 97.6%.
Pseudo-periodic regions are well defined, and are different from ‘simple’ (low-complexity) regions. Low-complexity
regions are regions of biased composition. These regions
are often mosiacs of a small number of amino acids. These
regions have been shown to be functionally important in some
proteins, but they are generally not very well understood.
SEG (Wootton and Federhen, 1993) and CAST (Promponas
et al., 2000) are two widely used, powerful tools for the
complexity analysis of biological sequence tracts. Recently,
we (Wan et al., 2003) proposed a new complexity function,
called the reciprocal complexity. Based on this complexity
measure, we developed an efficient algorithm and a software program ‘DSR’ for classifying and analyzing simple
segments of protein and nucleotide sequence databases associated with scoring schemes. The significant difference between
DSR and SEG, CAST is that DSR is much more general
and associated with scoring schemes, while SEG and CAST
are not associated with scoring schemes. DRS have more
applications than SEG and CAST do. It is clear that by
definition, only those lowest complexity regions are pseudoperiodic, and those pseudo-periodic regions consisting of very
short pseudo-periodic units have low complexity. In contrast,
pseudo-periodic regions have very high complexity when they
comprise long pseudo-periodic units.
The approach developed in this paper is a general, efficient
methodology in bioinformatics and computational biology to
find biological functions from pseudo-periodic segments of
gene or protein sequences, which are remarkably informative
for inferring, describing and understanding biological properties. It can be utilized to look for precise molecular details
of biological structures, dynamics, interactions and evolution.
However, these important details cannot be inferred by using
sequence alignment for a large proportion of genomic and
deduced protein sequences for which relevant experimental
data or homologous precedents are lacking.
We are continuing to use the program Aperiod for showing
pseudo-periodic features of protein sequence databases. For
instance, we use it to detect specific functional domains or
subdomains in proteins and reveal the abundance of pseudoperiodic segments within protein sequences. We also apply
it to analyze whole genomes and proteomes for extracting
the distribution of pseudo-periodic unit lengths and number of pseudo-periodic units within the sequences. Then we
Pseudo-periodic partitions of sequences
want to find the correlation between the distribution and the
degree of order of the organisms, and explore the role of those
pseudo-periodic genes and proteins or regions in the molecular
evolution of modern organisms.
ACKNOWLEDGEMENTS
We are very grateful to three anonymous referees for many
valuable comments. This work was supported in part by NIH
grant R01-GM00028 and NSF grant DBI 0078307.
REFERENCES
Beaman,T.W., Binder,D.A., Blanchard,J.S. and Roderick,S.L.
(1997) Three-dimensional structure of tetrahydrodipicolinate N succinyltransferase. Biochemistry, 36, 489–494.
Benson,G. (1999) Tandem repeats finder: a program to analyze DNA
sequences. Nucleic Acids Res., 27, 573–580.
Biou,V. et al. (1994) The 2.9 Å crystal structure of T. thermophilus
seryl-tRNA synthetase complexed with tRNA(Ser). Science, 263,
1404–1410.
Castellino,F.J. and Beals,J.M. (1987) The genetic relationships
between the kringle domains of human plasminogen, prothrombin, tissue plasminogen activator, urokinase, and coagulation
factor XII. J. Mol. Evol., 26, 358–369.
Coward,E. and Drablos,F. (1998) Detecting periodic patterns in
biological sequences. Bioinformatics, 14, 498–507.
Ikeo,K., Takahashi,K. and Gojobori,T. (1991) Evolutionary origin of
numerous kringles in human and simian apolipoprotein(a). FEBS
Lett., 287, 146–148.
Katti,M.V. et al. (2000) Amino acid repeat patterns in protein
sequences: their diversity and structural–functional implications.
Protein Sci., 9, 1203–1209.
Kobe,B. and Deisenhofer,J. (1993) Crystal structure of porcine ribonuclease inhibitor, a protein with leucine-rich repeats. Nature,
366, 751–756.
Koradi,R., Billeter,M. and Wuthrich,K. (1996) MOLMOL: a program for display and analysis of macromolecular structures.
J. Mol. Graph., 14, 29–32, 51–55.
Levenshtein,V.I. (1966) Binary codes capable of correcting deletions,
insertions, and reversals. Soviet Phys. Dokl., 10, 707–710.
Li,G., Kok,P. and Wan,H. (2003) Biological sequences are neither
simple, nor complex. in press.
Martin,J. (1999) Kringle domain. In T.E.Creighton (ed), Encyclopedia of Molecular Biology. John Wiley, p. 1353.
McLean,J.W. et al. (1987) cDNA sequence of human apolipoprotein(a) is homologous to plasminogen. Nature, 330, 132–137.
Parisi,G. and Echave,J. (2001) Structural constraints and emergence
of sequence patterns in protein evolution. Mol. Biol. Evol., 18,
750–756.
Patthy,L. (1985) Evolution of the proteases of blood coagulation and
fibrinolysis by assembly from modules. Cell, 41, 657–663.
Promponas,V.J. et al. (2000) CAST: an iterative algorithm for the
complexity analysis of sequence tracts. Complexity analysis of
sequence tracts. Bioinformatics, 16, 915–922.
Raetz,C.R.H. and Roderick,S.L. (1995) A left-handed parallel beta
helix in the structure of UDP-N-acetylglucosamine acyltransferase. Science, 270, 997–1000.
Schmidt,J.P. (1998) All highest scoring paths in weighted grid graphs
and their application to finding all repeats in strings. SIAM J.
Comput., 27, 972–992.
Smit,A.F. (1996) The origin of interspersed repeats in the human
genome. Curr. Opin. Genet. Dev., 6, 743–748.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties
and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
Vaara,M. (1992) Eight bacterial proteins, including UDP-Nacetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity
theme. FEMS Microbiol. Lett., 76, 249–254.
Vuorio,R. et al. (1994) The novel hexapeptide motif found in
the acyltransferases LpxA and LpxD of lipid A biosynthesis is
conserved in various bacteria. FEBS Lett., 337, 289–392.
Wan,H., Li,L., Federhen,S. and Wootton,J.C. (2003) Discovering
simple regions in biological sequences associated with scoring
schemes, J. Comput. Biol., 2, 171–185.
Wan,H. and Song,E. (2002) Quasiperiodic biosequences and modulo
incidence matrices. Proceedings of the International Parallel and
Distributed Processing Symposium: IPDPS 2002 Workshops.
Wan,H. and Wootton,J.C. (1999) Axiomatic foundations of complexity functions of biological sequences. Ann. Comb., 3, 105–127.
Wan,H. and Wootton,J.C. (2000) A global compositional complexity
measure for biological sequences: AT-rich and GC-rich genomes
encode less complex proteins. Comput. Chem., 24, 67–88.
Wan,H. and Wootton,J.C. (2002) Algorithms for computing lengths
of chains in integral partition lattices generated by sequences.
Theoret. Comput. Sci., 289, 783–800.
Wootton,J.C. and Federhen,S. (1993) Statistics of local complexity
in amino acid sequences and sequence databases. Comput. Chem.,
17, 149–163.
APPENDIX
A
Proofs of Proposition 2 and 3
Proof of Proposition 2. Let r = gcd(p1 , p2 ). We prove
Proposition 2 for a fixed r by induction on p1 + p2 . Clearly,
the proposition is true for p1 + p2 = 1. Suppose it true for all
positive integers <p1 +p2 . Let r = gcd(p1 , p2 ). Without loss
of generality, we assume that p1 > p2 and write s = tu, where
t = t1 t2 · · · tp1 −r . Since s is both p1 -periodic and p2 -periodic,
we have ti = si = si+p1 = si+p1 −p2 = ti+p1 −p2 . This means
that t is (p1 − p2 )-periodic. Note that t is p2 -periodic and
gcd(p1 − p2 , p2 ) = r. By the induction hypothesis, t is
r-periodic. Since t is a prefix of s, and s is p2 -periodic, where
r|p2 , we immediately deduce that s should be r-periodic.
Proof of Proposition 3. Let p0 be the minimum period
of s. Obviously, p0 ≤ p ≤ n/2. If s is m-periodic,
1 ≤ m ≤ n/2, by Proposition 2, we should have p0 =
gcd(p0 , m) because p0 is the minimum period. Thus p0 |m.
In other words, m is a multiple of p0 .
305
L.Li et al.
B
Feasibility and effectiveness of algorithm
Partition(AL)
The following lemma gives a precise characterization of
empty substring A(s (i)∗ ), and a necessary and sufficient
condition for the termination of algorithm P artition(AL).
Lemma 1. A(s (i)∗ ) = φ if and only if s (i) is a suffix of s.
Proof. We have A(s (i)∗ ) = φ whenever the shortest
path, corresponding to the alignment AL, must pass through
the point (j , j ), where j is the length of the sequence
s (1) s (2) · · · s (i) . But, there are only two points that we allow
the shortest path to reach: (0, 0) and (n, n). Apparently j > 0.
Thus j = n, i.e. s (i) is a suffix of s.
The following lemma tells us that algorithm P artition(AL)
really generates a partition.
Lemma 2. Let π = {s (1) , s (2) , . . . , s (i) } be the output
generated by Partition(AL), then π is a real partition of s.
Proof. Note that s (1) = s1 s2 · · · sr at the bottom line of AL
is a prefix of s that is aligned with all ‘−’s before s (1) at the top
line. Thus, s (2) = A(s (1)∗ ) is a prefix of s − s (1) , s (1) s (2) is a
prefix of s; s (3) = A(s (2)∗ ) is a prefix of s−s (1) s (2) , s (1) s (2) s (3)
is a prefix of s; and so on. Here, s −s denotes the subsequence
of s obtained by removing the prefix s . Finally, we have
s (1) s (2) · · · s (k) = s.
306
For a self-alignment AL of s, we define Cost(AL) as
the length of the corresponding path from (0, 0) to (n, n)
in G, the DWG made by AL. Interestingly, for the optimal
AL corresponding to the shortest path from (0, 0) to (n, n)
in G, Cost(AL) achieves the minimal distance to periodic
partitions of s, as described in the following lemma.
Lemma 3. Let AL be the optimal alignment corresponding
to the shortest path from (0, 0) to (n, n) in G. Then Bc(s) =
Cost(AL).
Proof. Let π = {s (1) , s (2) , . . . , s (k) } be the partition of s
generated by P artition(AL). We can see that D(π ) + c ·
g(π ) = Cost(AL). By the definition of Bc (s), we have
D(π ) + c · g(π ) ≥ Bc (s). If D(π ) + c · g(π ) > Bc (s), then
there must exist a partition π0 of s such that D(π )+c ·g(π ) >
D(π0 ) + c · g(π0 ). So we can construct a self-alignment AL
of s such that Cost(AL ) = D(π0 ) + c · g(π0 ) < Cost(AL).
A path of length Cost(AL ) is thus found from (0, 0) to
(n, n) in G. It contradicts that AL is an optimal alignment corresponding to the shortest path from (0, 0) to (n, n)
in G.
Lemma 3 demonstrates that any partition of s, generated
by a self-alignment AL using algorithm P artition(AL), is a
pseudo-periodic partition of s. Combining Lemma 1, 2, with 3
yields Theorem 1.