Vol. 20 no. 3 2004, pages 295–306 DOI: 10.1093/bioinformatics/btg404 BIOINFORMATICS Pseudo-periodic partitions of biological sequences Lugang Li1,2,† , Renchao Jin2,3,4,† , Poh-Lin Kok2 and Honghui Wan2,5,6, ∗ 1 Department of Protective Medicine, Nanjing Army Medical College, The Second Army Medical University, Nanjing, Jiangsu 210099, China, 2 Laboratory of Bioinformatics, Maryland Institute of Dynamic Genomics, 3910 Jeffry Street, Silver Spring, MD 20906, USA, 3 Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19139, USA, 4 School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China, 5 Global Bioinformatics Laboratory, National Center for Genome Resources, 2935 Rodeo Park Drive East, Santa Fe, NM 87505, USA and 6 National Center for Toxicogenomics, National Institute of Environmental Health Sciences, National Institutes of Health, P.O. Box 12233, Mail Drop F1-05, 111 T. W. Alexander Drive, Research Triangle Park, NC 27709, USA Received on February 2, 2003; revised on August 3, 2003; accepted on August 5, 2003 ABSTRACT Motivation: Algorithm development for finding typical patterns in sequences, especially multiple pseudo-repeats (pseudoperiodic regions), is at the core of many problems arising in biological sequence and structure analysis. In fact, one of the most significant features of biological sequences is their high quasi-repetitiveness. Variation in the quasi-repetitiveness of genomic and proteomic texts demonstrates the presence and density of different biologically important information. It is very important to develop sensitive automatic computational methods for the identification of pseudo-periodic regions of sequences through which we can infer, describe and understand biological properties, and seek precise molecular details of biological structures, dynamics, interactions and evolution. Results: We develop a novel, powerful computational tool for partitioning a sequence to pseudo-periodic regions. The pseudo-periodic partition is defined as a partition, which intuitively has the minimal bias to some perfect-periodic partition of the sequence based on the evolutionary distance. We devise a quadratic time and space algorithm for detecting a pseudo-periodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed (acyclic) weighted graph constructed by the Smith–Waterman self-alignment of the sequence. We use several typical examples to demonstrate the utilization of our algorithm and software system in detecting functional or structural domains and regions of proteins. A big advantage of our ∗ To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. † Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved. software program is that there is a parameter, the granularity factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best parameter. In general, we choose all repeats (including many pseudo-repeats) in the SWISS-PROT amino acid sequence database as a typical training set. We show that the granularity factor is 0.52 and the average agreement accuracy of pseudo-periodic partitions, detected by our software for all pseudo-repeats in the SWISS-PROT database, is as high as 97.6%. Availability: The program is available upon request from Honghui Wan and will be also available at http://www. mindgen.org Contact: [email protected] 1 INTRODUCTION DNA and protein sequences are neither ‘complex’, nor ‘simple’ (Li et al., 2003). They do not in general resemble random strings of letters, but rather consist of a heterogeneous mixture of local regions with distinct genetic functions and evolutionary origins. These regions show many different compositional characteristics and types of sequence patterns, as if written in a mosaic of different languages. Genomic sequences show, e.g. coding sequences, untranslated regions, introns, exons, intergenic regions, promoters, terminators, regulatory signals, RNA genes, direct or inverted repeats of widely different sizes in tandem or interspersed arrangements, microsatellites, CpG islands, centromeres, telomeres and origins of replication. At the genomic level, many important genomes, especially eukaryotes (higher-order organisms whose DNA is enclosed in 295 L.Li et al. a cell nucleus), contain numerous ‘pseudo-periodic regions’, suchashuman, pathogens, parasitesandmaize. About10–25% of total DNA in higher eukaryotes consists of short (5–10 nt) sequences that are randomly repeated thousands of times. These DNA segments have a different natural density due to the base composition differences. Over a third of the human genome consists of interspersed repetitive sequences that are primarily degenerate copies of transposable elements (Smit, 1996). In particular, most of the human Y chromosome consists of pseudo-periodic segments, and overall families of reiterated sequences account for about one-third of the human genome (Wan and Song, 2002, http://www.computer.org/proceedings/ ipdps/1573/workshops/15730187babs.htm). Most plant and animal genomes consist largely of repetitive DNA—perhaps 30 sequence motifs, typically 1–10 000 nt long, present many hundreds or thousands of times in the genome, which may be located at a few defined chromosomal sites or widely dispersed. Amino acid repeats are common in many proteins. All the protein sequences from SWISS-PROT database contain many single amino acid repeats, tandem oligo-peptide repeats and periodically conserved amino acids. Single amino acid repeats of glutamine, serine, glutamic acid, glycine and alanine seem to be tolerated to a considerable extent in a lot of proteins. Tandem oligo-peptide repeats of different types with varying levels of conservation have been detected in several proteins and found to be conspicuous, particularly in structural and cell surface proteins (Katti et al., 2000). Although, the significance of the amino acid repeats in protein structure and function has been demonstrated in some proteins, it remains largely unclear. Recently, they have gained much attention due to association of several neuro-degenerative disorders with unstable poly-glutamine repeats in affected proteins. It appears that repeated sequence patterns may be a mechanism that provides regular arrays of spatial and functional groups, useful for structural packing or for one to one interactions with target molecules. The detection of multiple repeats in biological sequences or sequence databases is an important problem in bioinformatics and computational biology. Repeats occur frequently in biological sequences, yet they are seldom exact. Hence, we focus our attention on approximately repeated patterns (pseudoperiodic regions). Many algorithms have been developed for finding pseudo-repeats of sequences, most of which are based on computing alignment scores. The best algorithm is due to Schmidt (1998), which has O(n2 log n) time and O(n2 ) space complexity to use weighted grid digraphs for discovering all locally optimal approximate repeats within a sequence of length n. Wan and Song (2002) used module incidence matrices to find quasiperiods in biological sequences in linear complexity and space, but all quasiperiodic units should have the same length except for the last one. Almost all existing algorithms for finding multiple repeats are based on the determination of some consensus pattern 296 units of the same length within a sequence. However, there does not always exist a consensus pattern within a sequence. For example, the protein sequence S = ACDACDEACDEFACDEFGCDEFGAEFDCFDFD can be partitioned into eight consecutive subsequences: ACD, ACDE, ACDEF, ACDEFG, CDEFG, AEFD, CFD, FD. The partition {ACD, ACDE, ACDEF, ACDEFG, CDEFG, AEFD, CFD, FD} is called a pseudo-periodic partition of S. It can be clearly seen that the pseudo-periodic pattern unit ‘ACD’ gradually and smoothly changes into the pseudo-periodic pattern unit ‘FD’, but no fixed pattern unit exists and most of pattern units have different sizes. In this case all existing algorithms cannot be used to find a pseudo-periodic partition of S. In this paper, we use the dynamic programming approach to a special type of self-alignment with a granularity factor to develop an O(n2 )-complexity algorithm for detecting a ‘best possible’ pseudo-periodic partition of a sequence of length n. In Section 2, we first mathematically define a perfectperiodic sequence by means of the consecutive edit distance. Then we define a pseudo-periodic partition of a sequence associated with a granularity factor, which is a natural generalization of the perfect-periodicity. After that, we develop a quadratic time and space algorithm for detecting a pseudoperiodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed (acyclic) weighted graph (DWG) constructed by the Smith– Waterman self-alignment of the sequence. In Section 3, we use several typical examples to demonstrate the utilization of our algorithm and software program in detecting functional or structural domains and regions of proteins. A great advantage of our software system is that there is a parameter, the granularity factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best parameter. In general, we choose all repeats (including pseudo-repeats) in the SWISS-PROT amino acid sequence database as a typical training set. We find that the granularity factor is 0.52 and the average agreement accuracy of pseudoperiodic partitions, detected by our software for all pseudorepeats in the SWISS-PROT database, is as high as 97.6%. 2 2.1 METHODS AND ALGORITHMS Periodic partition Let A = {a1 , a2 , . . . , am } be a finite alphabet of m letters in which ai is called the letter of type i (1 ≤ i ≤ m). Symbolic sequences are characterized by A and (usually) by a finite length n. One-dimensional strings play an important role in various fields, such as informatics, dynamical systems, biology, communication theory, linguistics and psychology. In particular, the digital information that underlies genetics, genomics, proteomics, biochemistry, cell biology and development can be represented by a simple string on a 4-letter alphabet (four nucleotides: A, C, G, T) or a 20-letter alphabet (20 amino acids: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, Pseudo-periodic partitions of sequences T, V, W, Y), which can be regarded as the ‘data structure’ of the life. We denote by F the set of all finite sequences over A. A sequence in F is typically written as s = s1 s2 · · · sn , where si ∈ A, and we denote by |s| = n the length of the sequence. For any 1 ≤ i ≤ n, s1 s2 · · · si is a prefix of s and si si+1 · · · sn is a suffix of s. Given a sequence s ∈ F, a partition of s is a sequence {s (1) , s (2) , . . . , s (k) } of contiguous segments of s, such that s = s (1) s (2) · · · s (k) , where 1 ≤ k ≤ n, |s (1) | + |s (2) | + · · · + |s (k) | = n, and 1 ≤ |s (i) | ≤ n for each i with 1≤i ≤ k. The segments s (1) , s (2) , . . . , s (k) are the parts of the partition. For example, a protein sequence s = AACDDEFGGHHHHIKL has many partitions, and three of them are {AAC,DDEF,GGHH,HHIKL}, {AACDDEFGGH HHHIKL} and {A,A,C,D,D,E,F,G,G, H,H,H,H,I, K,L}. For sequence s = s1 s2 · · · sn , there are two special partitions. One is {s1 s2 · · · sn } consisting of only one part— itself. This partition is called as the least partition of s. The other is {s1 , s2 , . . . , sn } consisting of n parts, and each part is a single letter. We call this partition as the largest partition of s. Furthermore, for any k with 1 ≤ k ≤ n, if we partition the sequence s into k consecutive subsequences, we have n−1 k−1 different ways to choose k− 1cut-positions from n − 1 possible cut-positions, where n−1 − 1)!(n − k)!] k−1 = (n − 1)!/[(k n−1 is the binomial coefficient. Thus, there are k−1 different partitions of s for any k, 1 ≤ k ≤ n. We denote by P (s) the set of all partitions of sequence s. We have: Proposition 1. Total number of partitions of a sequence s is given by: |P (s)| = n n−1 k=1 k−1 = 2n−1 , (1) where n is the length of s. Definition 1. For a sequence s of length n and a positive integer p < n, we say that s is p-periodic if there exists a partition π = {s (1) , s (2) , . . . , s (k) } of s such that k = n/p, s (1) = s (2) = · · · = s (k−1) and s (k) is a prefix of s (k−1) . Here x denotes the ceiling function of x. The partition π is called the p-periodic partition of s. The last part of the partition may be an incomplete periodic unit and in this case |s (k) | < p. For example, if s = HLPRVHLPRVHLPRVHLPRVHLP, then {HLPRV, HLPRV,HLPRV,HLPRV,HLP} is a 5-periodic partition of s, while {HLPRVHLPRV,HLPRVHLPRV,HLP} is a 10-periodic partition of s. Here, HLP is an incomplete periodic unit of length 3. Generally, if π = {s (1) , s (2) , . . . , s (k) } is a p-periodic partition of s, then |s (i) | = p for 1 ≤ i ≤ k − 1, where every s (i) is a complete periodic unit, and 1 ≤ |s (k) | ≤ p, i.e. only s (k) is allowed to be an incomplete periodic unit. If n is a multiple of p, then s (k−1) = s (k) . By Definition 1, we know that a sequence s = s1 s2 · · · sn is p-periodic if and only if si = si+p holds for every i, 1 ≤ i ≤ n − p. If s is p-periodic and kp < n for some positive integer k, then s is also kp-periodic. Moreover, we have the following propositions, which demonstrate the structure of the set of periods of a sequence. Detailed proofs are provided in Appendix A. Proposition 2. If a sequence s = s1 s2 · · · sn is p1 -periodic and p2 -periodic and n ≥ p1 + p2 , then s is also gcd(p1 , p2 )periodic, where p1 and p2 are two positive integers, and gcd (p1 , p2 ) denotes the greatest common divisor of p1 and p2 . Proposition 3. For a p-periodic sequence s of length n with p ≤ n/2, there exists a positive integer p0 ≤ n/2 such that for any positive integer m, 1 ≤ m ≤ n/2, and s is m-periodic if and only if m is a multiple of p0 . Proposition 3 does not hold for m > n/2. For example, s = ACAACA. s is both 3-periodic and 5-periodic, but 5 is not a multiple of 3. This is caused by the fact that ACAACA does not completely repeat twice. Intuitively, a period should repeat at least twice. So we require that m ≤ n/2. We call such p0 in Proposition 3 the atomic period (length) of s. The periodic partition of s corresponding to p0 is called the atomic periodic partition of s. Proposition 1 demonstrates the significance of the atomic periodic partition in the study of periodic partitions of s. Thus, we mainly concentrate on the atomic periodic partition of s whenever s is p-periodic for some p < n. Definition 2. Let π = {s (1) , s (2) , . . . , s (k) } be a partition of a sequence s, where |s (1) | = |s (2) | = · · · = |s (k−1) | = q ≥ |s (k) | = r. Then the consecutive edit distance (CE-distance) of the partition π is defined as: DCE (π ) = k−1 d(s (i) , s (i+1) ), (2) i=1 where d(s (i) , s (i+1) ) is the edit distance between s (i) and s (i+1) for 1 ≤ i ≤ k − 2, and d(s (k−1) , s (k) ) = min{d|d is the edit distance between s (k) and a prefix of s (k−1) }. Note that by convention, DCE (π ) = 0 for k = 1. For k > 1, d(s (k−1) , s (k) ) is adapted for dealing with the cases when s (k) is an incomplete periodic unit. Obviously, d(s (i) , s (i+1) ) ≥ 0 for 1 ≤ i ≤ k − 1. Moreover, d(s (i) , s (i+1) ) = 0 whenever s (i) = s (i+1) for 1 ≤ i ≤ k − 2, and d(s (k−1) , s (k) ) = 0 whenever s (k) is a prefix of s (k−1) . These facts turn out that the periodicity of a sequence can be characterized in terms of the consecutive edit distance in the following proposition. Proposition 4. For a sequence s of length n and a positive integer p < n, s is p-periodic if and only if there exists a partition π = {s (1) , s (2) , . . . , s (n/p) } of s such that DCE (π ) = 0. 297 L.Li et al. Coward and Drablos (1998) partitioned a sequence s into consecutive subsequences s (1) , s (2) , . . . , s (k) of fixed length p each, and defined a measure of the ‘mutual agreement’ between the subsequences as: d(s (i) , s (j ) ), Mp (s) = j <i where d is a metric on the alphabet A. Mp (s) has the same property as DCE (π ): s is p-periodic if and only if Mp (s) = 0. However, an insertion or a deletion in s influences Mp (s) greatly. Our DCE (π ) has the better ‘stability’ compared with Mp (s). In particular, if s is almost periodic except for a few substitutions or indels, then DCE (π ) will be quite small. In addition, the existence of an incomplete last periodic unit does not influence DCE (π ) at all. Due to this characteristic, DCE (π ) offers a better measure of the weak-periodicity of sequences. For example, s = ABCDEABCDEABCDEABCDE is 5-periodic. In this case, M5 (s) = 0 and DCE (π ) = 0 for π = {ABCDE,ABCDE,ABCDE,ABCDE}. If an insertion happens, say, a ‘C’ is inserted between the second ‘B’ and ‘C’ in the sequence, we get a new sequence of length 21: s = ABCDEABCCDEABCDEABCDE, then the last letter is discarded and M5 (s ) = 10. However, DCE (π ) = 2 for π = {ABCDE,ABCCDE,ABCDE,ABCDE}. 2.2 Pseudo-periodic partition Now, we suppose that a sequence s of length n is not p-periodic for any positive integer p < n. We must turn to investigating its pseudo-periodicity, which is a natural generalization of the perfect periodicity defined in Definition 1. Motivated by the Proposition 4, it is possible to use the minimal consecutive evolutionary edit distance as a measure of the distance of a sequence to the ‘nearest periodic partition’. But we will see that this measure does not work properly. As an example, we look at the sequence s = AKLAKMAKNAKP. The following are three of partitions of s: π1 = {AKL,AKM,AKN,AKP}, π2 = {AKLAKM,AKNAKP} and π3 = {AKLAKMAKNAK,P}. Intuitively, π1 reveals the most approximate periodicity of s and π3 the least. However, DCE (π1 ) = 3, DCE (π2 ) = 2 and DCE (π3 ) = 1 if we take the edit distance as the Levenshtein distance (Levenshtein, 1966), in which matching letters score 0 and deletions/insertions/substitutions of letters score 1. So the Levenshtein distance cannot be directly used to detect the pseudo-periodic regions and we must find out what factor we have neglected. For a partition π = {s (1) , s (2) , . . . , s (k) } of a sequence s, two end parts of s are less involved than the internal parts of s in counting DCE (π ). One end is the first subsequence in π i.e. s (1) . The other end is more complicated to be identified. It is not just the final subsequence s (k) , because we allow the incomplete final periodic unit. Actually, it is a suffix 298 of s starting from the letter that is next to the letter aligned with sn in the alignment corresponding to d(s (k−1) , s (k) ). We denote by s (k) such suffix of s. The sum |s (1) | + |s (k) | is an important factor that makes π1 better than π2 and π3 , but DCE (π1 ) > DCE (π2 ) > DCE (π3 ). Let g(π ) = (|s (1) | + |s (k) |)/2 and c > 0 denotes a constant value which is the cost of a single indel (gap) in a self-alignment of s. g(π ) is called the granularity of π , and c the granularity factor. We have: Definition 3. For a sequence s, let Bc (s) = min{DCE (π ) + c · g(π )|π ∈ P (s)}. Then Bc (s) is called the minimal distance to periodic partitions of s. A partition πq = {s (1) , s (2) , . . . , s (k) } of s is called the pseudo-periodic partition of s if DCE (πq ) + c · g(πq ) = Bc (s). In the above example regarding s = AKLAKMAKNAKP, g(π1 ) = (3 + 3)/2 = 3, g(π2 ) = (6 + 6)/2 = 6, g(π3 ) = (11 + 11)/2 = 11. Bc (s) = 6 = DCE (π1 ) + g(π1 ). Thus, according to Definition 3, π1 is the pseudo-periodic partition of s and c = 1. Definition 4. If π = {s} is a pseudo-periodic partition of s and DCE (π ) = 0, then π is called the atomic periodic partition of s and g(π ) is the atomic period length of s. The concept of pseudo-periodic partition is a generalization of that of atomic periodic partition. Starting from this generalized concept, we can investigate the pseudo-periodicity of a sequence that has no fixed weak-periodic pattern and no fixed period length. For example, π = {AKY, AKYV, AKYVN, AKYVN, FKYVN, FCDEY, FVY, FNY, FY} is the pseudoperiodic partition of s = AKYAKYVAKYVNAKYVNFKYVNFC DEYFVYFNYFY. In this case, s has nine weak-periodic patterns: AKY, AKYV, AKYVN, AKYVN, FKYVN, FCDEY, FVY, FNY, FY. We can clearly see from this partition that the pseudo-periodic patterns change gradually along the sequence. 2.3 Algorithm design Our task is to develop an efficient algorithm for detecting the pseudo-periodic partition for a given sequence. We start from the DWG constructed by the Smith–Waterman self-alignment. For the DNA sequence s = AGAGA, the DWG made by the Smith–Waterman self-alignment of s is shown in Figure 1. In fact, every edge goes from the bottom-left cell to the topright cell. Before the modification, every vertical or horizontal edge is assigned to a gap weight while every slope edge has a weight of the substitution cost between the corresponding letters. Then the shortest path from point (0, 0) to point (5, 5) is the main diagonal from (0, 0) to (5, 5), which corresponds to the optimal alignment of sequence s and itself, i.e. every letter aligns with itself. For solving the problem, we need to do some modifications on this graph. First, change the weights of all edges included Pseudo-periodic partitions of sequences j c/2 c/2 c/2 c/2 c/2 t = (5, 5) c/2 A c/2 G c/2 A c/2 G c/2 A s = (0, 0) A G A G A f ( (i, j ), (i + 1, j )) = if j = n; c/2, if 0 < i < j − 1; cos t(bi+1 , −), +∞, otherwise. f ( (i, j ), (i + 1, j + 1)) = cos t(bi+1 , bj +1 ), if i < j ; +∞, otherwise. i Fig. 1. Directed (acyclic) weighted graph (DWG) constructed by the Smith–Waterman self-alignment of the DNA sequence s = AGAGA. in or crossed by the boundary of the shadowed area into +∞. This means that we never allow a shortest path to go through the shadowed area. Next, change the weights of all leftmost vertical edges and all topmost horizontal edges to c/2. The thick black line in Figure 1 shows the new shortest path from (0, 0) to (5, 5). It corresponds to the alignment: --AGAGA AGAGA-- Using the standard dynamic programming algorithm, we can find the shortest path from (0, 0) to (n, n) and a corresponding alignment between the sequence s and itself. Our purpose is to find a partition π = {s (1) , s (2) , . . . , s (k) } of s such that DCE (π )+c ·g(π ) equals to the length of the shortest path from (0, 0) to (n, n). In a pairwise alignment of two sequences s and t, we denote by s ∗ and t ∗ two strings at the top line and bottom line of the alignment, respectively. That is, s ∗ and t ∗ are sequences obtained, respectively, from two sequences s and t by inserting some ‘−’s. We say that the letters or strings at the same columns of the alignment are aligned each other. For a subsequence s (i)∗ of s ∗ at the top line of the alignment, we denote by A∗ (s (i)∗ ) the string at the bottom line that is aligned with s (i)∗ . In addition, we denote by A(s (i)∗ ) the sequence obtained by deleting all ‘−’s from A∗ (s (i)∗ ). If the pairwise alignment is a self-alignment of s, then for a subsequence s (i)∗ of s ∗ , A(s (i)∗ ) is a subsequence of s. For example, the following is a self-alignment of the sequence s = AMNPQRS: --AM-NPQRS AMN-PQRS-- This alignment can be divided into four parts: -- AG AG A AG AG A- Finally, we get a partition of s: π = {AG,AG,A}. It is a pseudo-periodic partition of s, and is also the atomic periodic partition of s. Now, we show that this procedure is not executed by accident. For a sequence s = s1 s2 · · · sn , we construct a WDG with (n + 1) × (n + 1) vertices: G = V , E, f , where V = {(i, j )|0 ≤ i, j ≤ n}, E = { (i, j ), (i, j + 1)|0 ≤ i ≤ n, 0 ≤ j < n}∪ { (i, j ), (i + 1, j )|0 ≤ i < n, 0 ≤ j ≤ n}∪ { (i, j ), (i + 1, j + 1)|0 ≤ i, j < n}, f : E → R, if i = 0; c/2, f ( (i, j ), (i, j + 1)) = cos t(−, bj +1 ), if 0 < i < j ; +∞, otherwise. Let s (1)∗ = AM − N, s (2)∗ = PQRS. Then A∗ (s (1)∗ ) = N − PQ, A(s (1)∗ ) = NPQ, A∗ (s (2)∗ ) = RS − − and A(s (2)∗ ) = RS. For a sequence s = s1 s2 · · · sn , we suppose that the alignment, corresponding to the shortest path from (0, 0) to (n, n), is: AL = − s1 − s2 ··· − · · · sr s1 · · · ··· ··· sr · · · ··· ··· sn −. (3) There are r ‘−’s before s1 at the top line (r ≥ 1). We use the following algorithm P artition(AL) to find a partition π of s. Algorithm Partition(AL): Set s (1) = s (1)∗ = s1 s2 · · · sr and i = 1; While A(s (i)∗ ) = φ do: { s (i+1)∗ = A∗ (s (i)∗ ); s (i+1) = A(s (i)∗ ); i = i + 1. } Output π = {s (1) , s (2) , . . . , s (i) }. 299 L.Li et al. Algorithm Partition(AL) is at the heart of finding pseudoperiodic partition of a sequence s. The following theorem is the main algorithmic results of this paper which demonstrates the feasibility and effectiveness of algorithm Partition(AL). A detailed proof is given in Appendix B. Theorem 1. If AL is a self-alignment of s corresponding to the shortest path from (0, 0) to (n, n) in the graph G = V , E, f , then the partition π = {s (1) , s (2) , . . . , s (k) } generated by Partition(AL) is a pseudo-periodic partition of s. It is not difficult to see that algorithm P artition(AL) can be executed in at quadratic time and space, just like the standard dynamic programming approach to the pairwise sequence alignment. We have implemented the algorithm in the computer language C on a UNIX environment. The code under the name ‘Aperiod’ is associated with a parameter, the granularity factor c described in Definition 3. It is portable to most computers with a C compiler and can be easily used by the biologists to do complex trait analysis of biological sequences. The program is available from the authors. 3 3.1 RESULTS Applications to proteins The algorithm developed in the previous section for generating pseudo-periodic partitions of sequences is very useful to detect functional or structural domains and regions of proteins. We use several typical examples to demonstrate the utilization of our algorithm and software program. It is important for discovery of subtle repeating (pseudoperiodic) properties in proteins, especially of those representing a physiochemical property, such as hydrophobicity or polarity. Segmentation of the globular and non-globular (coiled-coil) domains of the seryl-tRNA synthetase protein on the basis of sequence complexity is shown in Figure 2 of Wan et al. (2003). In this example, the N-terminal non-globular region of Thermus thermophilus seryl-tRNA synthetase is known from a high-resolution crystal structure determination to be mostly an extended, antiparallel, two-stranded coiled-coil (PDB: 1SRY) (Biou et al., 1994). The sequence consisting of 421 amino acids was segmented automatically by the DSR algorithm (Wan et al., 2003). The corresponding parts of the protein structure (from X-ray crystallography) are shown by red (‘simple’ region, comprising 105 amino acids, non-globular domain) and green (‘complex’ region, globular domain). It is well known that the non-globular domain of the seryl-tRNA synthetase protein is with weak 7-residue repeat. In fact, using the Aperiod program, we detect a pseudoperiodic partition of this non-globular region as shown in Table 1. The results demonstrate that the region has indeed a pseudo-period 7. The second example is apo(a) (APOA_HUMAN), whose SWISS-PROT accession number is P08519. Apo(a) is the 300 Table 1. Pseudo-periodic partition of non-globular region of 1SRY Pseudo-periodic unit Position Length MVDLKRLR QEPEVFHR AIREKGVA LDLEALLA LDREVQEL KKRLQEVQ TERNQVA KRVPKAP PEEKEAL IARGKAL GEEAKRL EEALRE KEARLE ALLLQV PLPP 1–8 9–16 17–24 25–32 33–40 41–48 49–55 56–62 63–69 70–76 77–83 84–89 90–95 96–101 102–105 8 8 8 8 8 8 7 7 7 7 7 6 6 6 4 main constituent of lipoprotein(a) [Lp(a)]. It has serine proteinase activity and is capable of autoproteolysis; also inhibits tissue-type plasminogen activator 1. Lp(a) may be a ligand for megalin/Gp 330 (McLean et al., 1987). Apo(a) is known to be proteolytically cleaved, leading to the formation of the so-called mini-Lp(a). Apo(a) fragments accumulate in atherosclerotic lesions, where they may promote thrombogenesis. O-Glycosylation may limit the extent of proteolytic fragmentation. Elevated plasma concentrations of apo(a) and its naturally occurring proteolytic fragments are correlated with atherosclerosis. APOA HUMAN belongs to peptidase family S1; also known as the trypsin family and plasminogen subfamily. It contains 38 kringle domains (37 of type IV and one of type V), each of which is approximately 110 amino acids in length and of high complexity (Martin, 1999). Homology with plasminogen kringles IV and V is thought to underlie the atherogenicity of the protein, because the fragments are competing with plasminogen for fibrin(ogen) binding. In fact, kringles (Castellino and Beals, 1987; Ikeo et al., 1991; Patthy, 1985) are triple-looped, disulfide cross-linked domains found in a varying number of copies, in some serine proteases and plasma proteins. Kringle domains are thought to play a role in binding mediators, such as membranes, other proteins or phospholipids, and in the regulation of proteolytic activity. Using the Aperiod program, we have exactly detected all 38 kringle domains in APOA HUMAN as shown in Table 2. In the pseudo-periodic partition of apolipoprotein A in APOA_HUMAN generated by Aperiod, each pseudo-periodic unit corresponds precisely to a kringle domain. There are 28 perfect periodic regions (exact repeats) of length 114 in the pseudo-periodic partition, which starts at position 131 and ends at position 3322 in the sequence. Pseudo-periodic partitions of sequences Table 2. Pseudo-periodic partition of apolipoprotein A in APOA_HUMAN (38 copies) Pseudo-periodic unit Position EQSHVVQDCYHGDGQSYRGTYSTTVTGRTCQAWSSMTPHQHNRTTENYPNAGLIMNY CRNP DAVAAPYCYTRDPGVRWEYCNLTQCSDAEGTAVAPPTVTPVPSLEAPSEQ APTEQRPGVQECYHGNGQSYRGTYSTTVTGRTCQAWSSMTPHSHSRTPEYYPNAGLI MNYCRNPDAVAAPYCYTRDPGVRWEYCNLTQCSDAEGTAVAPPTVTPVPSLEAPSEQ APTEQRPGVQECYHGNGQSYRGTYSTTVTGRTCQAWSSMTPHSHSRTPEYYPNAGLI MNYCRNPDPVAAPYCYTRDPSVRWEYCNLTQCSDAEGTAVAPPTITPIPSLEAPSEQ APTEQRPGVQECYHGNGQSYQGTYFITVTGRTCQAWSSMTPHSHSRTPAYYPNAGLI KNYCRNPDPVAAPWCYTTDPSVRWEYCNLTRCSDAEWTAFVPPNVILAPSLEAFFEQ ALTEETPGVQDCYYHYGQSYRGTYSTTVTGRTCQAWSSMTPHQHSRTPENYPNAGLT RNYCRNPDAEIRPWCYTMDPSVRWEYCNLTQCLVTESSVLATLTVVPDPSTEASSEE APTEQSPGVQDCYHGDGQSYRGSFSTTVTGRTCQSWSSMTPHWHQRTTEYYPNGGLT RNYCRNPDAEISPWCYTMDPNVRWEYCNLTQCPVTESSVLATSTAVSEQ APTEQSPTVQDCYHGDGQSYRGSFSTTVTGRTCQSWSSMTPHWHQRTTEYYPNGGLT RNYCRNPDAEIRPWCYTMDPSVRWEYCNLTQCPVMESTLLTTPTVVPVPSTELPSEE APTENSTGVQDCYRGDGQSYRGTLSTTITGRTCQSWSSMTPHWHRRIPLYYPNAGLT RNYCRNPDAEIRPWCYTMDPSVRWEYCNLTRCPVTESSVLTTPTVAPVPSTEAPSEQ APPEKSPVVQDCYHGDGRSYRGISSTTVTGRTCQSWSSMIPHWHQRTPENYPNAGLT ENYCRNPDSGKQPWCYTTDPCVRWEYCNLTQCSETESGVLETPTVVPVPSMEAHSEA APTEQTPVVRQCYHGNGQSYRGTFSTTVTGRTCQSWSSMTPHRHQRTPENYPNDGLT MNYCRNPDADTGPWCFTMDPSIRWEYCNLTRCSDTEGTVVAPPTVIQVPSLGPPSEQ DCMFGNGKGYRGKKATTVTGTPCQEWAAQEPHRHSTFIPGTNKWAGLEKNYCRNPDG DINGPWCYTMNPRKLFDYCDIPLCASSSFDCGKPQVEPKKCPGS We now apply our Aperiod program to test a typical protein sequences dealt with by Coward and Drablos (1998): Acyl-[acyl-carrier-protein] (UDP-N-acetylglucosamine O-acyltransferase, SwissProt entry: LPXA_ECOLI; accession no. P10440). UDP-N -acetylglucosamine 3-O-acyltransferase (LpxA) catalyzed the transfer of (R)-3-hydroxymyristic acid from its acyl carrier protein thioester to UDP-N acetylglucosamine. LpxA is the first enzyme in the lipid A biosynthetic pathway and is a target for the design of antibiotics. The X-ray crystal structure of LpxA was determined by Raetz and Roderick (1995) to 2.6 Å resolution and reveals a domain motif composed of parallel beta strands, termed a left-handed parallel beta helix (L beta H). This unusual fold displays repeated violations of the protein folding constraint requiring right-handed crossover connections between strands of parallel beta sheets and may be present in other enzymes that share amino acid sequence homology to the repeated hexapeptide motif of LpxA. 20–130 Length 111 131–244,245–358, 359–472,473–586, 587–700,701–814, 815–928,929–1042, 1043–1156,1157–1270, 1271–1384,1385–1498, 1499–1612,1613–1726, 1727–1840,1841–1954, 1955–2068,2069–2182, 2183–2296,2297–2410, 2411–2524,2525–2638, 2639–2752,2753–2866, 2867–2980,2981–3094, 3095–3208,3209–3322 3323–3436 114 114 3437–3550 114 3551–3664 114 3665–3770 114 3771–3884 114 3885–3998 114 3999–4112 114 4113–4226 114 4227–4327 101 The conversion of tetrahydrodipicolinate and succinyl-CoA to N -succinyltetrahydrodipicolinate and CoA was catalyzed by tetrahydrodipicolinate N -succinyltransferase and is the committed step in the succinylase pathway by which bacteria synthesize l-lysine and meso-diaminopimelate, a component of peptidoglycan. The X-ray crystal structure of THDP succinyltransferase was determined by Beaman et al. (1997) to 2.2 Å resolution and was refined to a crystallographic R-factor of 17.0%. The enzyme was trimeric and displayed the left-handed parallel beta-helix (L beta H) structural motif encoded by the ‘hexapeptide repeat’ amino acid sequence motif (Raetz and Roderick, 1995). The approximate location of the active site of THDP succinyltransferase was suggested by the proximity of binding sites for two inhibitors: p-(chloromercuri)benzenesulfonic acid and cobalt ion, both of which bind to the L beta H domain. LPXA_ECOLI belongs to the transferase hexapeptide repeat family and lpxa subfamily as shown in Figure 2 (PDB 301 L.Li et al. Fig. 2. LpxA of Escherichia coli and LpxA family (adapted from Parisi and Echave, 2001). (A) Cartoon view of the LbH domain of the LpxA of E.coli. PDB entry 1LXA. The left-handed b helix is formed by nine triangular coils (C1–C9). Each coil is formed by three hexapeptides, colored red, yellow and blue, respectively. Loops are colored gray. (B) Detailed view of coil C2 of A. Amino acids at conserved hexapeptide positions 1 (A18, A24 and C30) and 3 (I20, I26 and V32) are labeled. Panels A and B were prepared with the program MOLMOL (Koradi et al., 1996). (C) Multiple-sequence alignment of the LbH domain of the members of the LpxA family. The alignment was obtained using CLUSTAL W (Thompson et al., 1994). Sequences are identified using the SwissProt/TrEMBL codes (http://www.expasy.ch/sprot/sprottop.html). Conserved substitutions are shaded in black if the whole column is conserved, and they are shaded in gray if 0.75% is conserved. The following classes were used to judge conservation: aliphatic (ACILMV), aromatic (FHWY), polar (NQST), charged positive (KR), charged negative (DE) and special (GP). Colors of the first line of C relate the alignment to the structure shown in A and B. In addition, different coils (C1–C9) and hexapeptide third positions (dots) are explicitly indicated. 302 Pseudo-periodic partitions of sequences Table 4. Comparison of Aperiod segmentation and LbH domain of LPXA_ECOLI Table 3. Pseudo-periodic partition of LPXA_ECOLI sequence Pseudo-periodic unit Position Length MIDKSAFVHPTAIVEEGA SIGANAHIGPFCIVGPHV EIGEGTVLKSHVVVNGHT KIGRDNEIYQFASI GEVNQDLKYAGEPTR VEIGDRNRIRESVTI HRGTVQGGGL TKVGSDNLLMINAHIAHD CTVGNRCILANNATLAGH VSVDDFAIIGGMTAVHQF CIIGAHVMVGGCSGV 1–18 19–36 37–54 55–68 69–83 84–98 99–108 109–126 127–144 145–162 163–177 18 18 18 14 15 15 10 18 18 18 15 code 1LXA). The Aperiod program was tested on the left left-handed parallel b helix (LbH) domain of LpxA, which displays a distinctive sequence pattern that is likely to result from structural constraints. We find a pseudo-periodic partition of the sequence of LPXA_ECOLI as shown in Table 3. The sequence of the LbH domain that is a subsequence of LPXA_ECOLI from the first position to the 177th position, consists of 11 pseudo-repeats in which the dominating repeats are of length 18 and 15. Actually, the sequences of the LbH domain of members of the LpxA family, as demonstrated in Figure 2C, consist of the imperfect tandem repetition of hexapeptide units (Vaara, 1992; Vuorio et al., 1994; Raetz and Roderick, 1995; Parisi and Echave, 2001). These imperfect tandem repeats have been accurately detected by the Aperiod program. In fact, the results with Aperiod, as demonstrated in Table 4, show a strong agreement in predicting the exact positions of experimental coils and loops. The hexapeptides are characterized by a high degree of conservation of the third position, which usually displays I, L or V (a one-letter code is used to designate amino acids). Hexapeptide position 1 is also significantly conserved, although less so than position 3, whereas the other four hexapeptide sites (2, 4, 5 and 6) are not conserved. Figure 2B shows that the residues of conserved sites 1 and 3 point toward the inside of the beta helix, whereas those in variable positions point toward the outside. The LpxA family belongs to a larger superfamily of LbH acyltransferases. 3.2 Determination of the granularity factor There is a significant parameter, the granularity factor c described in Definition 3, associated with the Aperiod program. It is a key to find the best parameter c for the applications in biological sequence and structure analysis. We can see that sizes of the pseudo-periodic units in a pseudo-periodic partition of a sequence s are close to the pseudo-period length of s. We look for a pseudo-periodic partition π of s with the minimum of DCE (π ) + c · g(π ) where g(π ) is strongly related to Aperiod segmentation Experimental coils/loops Comments 1–18 19–36 37–54 55–68 69–83 84–98 99–108 109–126 127–144 145–162 163–177 1–17 18–35 36–53 54–68 69–83 84–98 99–108 109–126 127–144 145–162 163–177 Coil1 Coil2 Coil3 Coil4 Loop1 Coil5 Loop2 Coil6 Coil7 Coil8 Coil9 sizes of the pseudo-periodic units in the partition. A smaller c results in more pseudo-periodic units and more strict approximation of the periodicity of the sequence. However, if c is too small, the algorithm would tend to find partitions with too large pseudo-periodic units, e.g. number of the pseudoperiodic units >n/2, then it may overlook some important pseudo-periodic partition with small period length. Therefore, we need a good trade-off between the strictness of the pseudo-periodicity and number of pseudo-periodic units in the partition. We now turn to determining the best parameter c with Aperiod for biological applications. To this end, we need to define mathematically an objective function (Wan et al., 2003)—‘agreement accuracy (sensitivity)’ to measure the efficiency of the Aperiod program for the particular investigation of pseudo-periodic partitions of protein sequences. For any sequence s = s1 s2 · · · sn ∈ F, suppose that s[i..j ] is a pseudo-periodic unit of s detected by Aperiod, which corresponds to known true repeat unit s[k..l]. Then s[r..t] is called a mutual region of s, where r = max{i, k} and t = min{j , l}. Let p denote total length of all mutual regions of s. The agreement accuracy ω of Aperiod partition is defined as: ω(s) = p/n. For example, as shown in Table 4, the sequence s of LPXA_ECOLI has 11 pseudo-periodic units in the APERIOD segmentation: s[1..18], s[19..36], s[37..54], s[55..68], s[69..83], s[84..98], s[99..108], s[109..126], s[127..144], s[145..162], s[163..177]. On the other hand, s has 11 perfect repeat units, each of which corresponds to an experimental coil or loop: s[1..17], s[18..35], s[36..53], s[54..68], s[69..83], s[84..98], s[99..108], s[109..126], s[127..144], s[145..162], s[163..177]. Then, s has 11 mutual regions: s[1..17], s[19..35], s[37..53], s[55..68], s[69..83], s[84..98], s[99..108], s[109..126], s[127..144], s[145..162], s[163..177]. In this case, p = 174 and n = 177. Therefore, the agreement accuracy is ω = 98.3%. Now we switch our attention to considering the agreement accuracy whenever the Aperiod program is used for 303 L.Li et al. a sequence database. Let D denote a protein sequence database or a protein sequence family. Then the agreement accuracy ω(D)(c) of Aperiod partition for D associated with parameters c (the granularity factor), is defined as the average sum of agreement accuracy values of all individual sequences in D: ωD (c) = 1 ω(s). |D| s∈D To discover the best parameter c for D, we should mathematically find an ‘extreme point’ c∗ in which ωD (c) achieves its maximum: c∗ = arg maxc∈ ωD (c), where ‘argmax’ stands for the argument for which the function involved takes its maximum value, and = {0.01, 0.02, . . . , 0.99, 1.00, 1.01, . . . , 1.99, 2.00} is a 200-element set. A large number of computational examples show that for c ≥ 2, all pseudo-periodic partitions of protein sequences are identical. Thus, it is enough to deal with those cases when c ∈ . We have developed a software program for finding an extreme point c∗ in the set and calculating the maximum value ωD (c∗ ) for D. Users can freely choose a biological sequence database as training set to which a query sequence of interest belongs. Cut-off value c may be determined based on simulations with random sequences (Benson, 1999). However, biological sequences, especially pseudo-periodic sequences, are extremely different from real random strings (Wan and Wootton, 1999, 2000, 2002; Wan et al., 2003; Li et al., 2003). Therefore, random strings may not be suitable for using as a training set to discover a good threshold cutoff granularity factor for finding pseudo-periodic regions and subsequences in biological sequences. In general, we choose all repeats (including pseudo-repeats) in the SWISSPROT amino acid sequence database (Release 40.22 of 24 June, 2002 of SWISS-PROT) as a typical training set. Query ‘[libs = {swiss_prot}-keywords: Repeat]’ found 8782 entries (http://us.expasy.org/cgi-bin/getentries?KW= Repeat&SRS=Perform&db=sp). Our computational experiments on all these 8782 repeats have shown that in this case the extreme point c∗ = 0.52, and maximum = 0.976. In other words, the best parameter is for automatic partition of amino acid sequences into pseudo-repeats by the Aperiod program c∗ = 0.52. The average agreement accuracy (sensitivity) of pseudo-periodic partitions, generated by Aperiod for all repeats in the SWISS-PROT database, is 97.6%. 4 DISCUSSION This paper has developed an efficient algorithm and a software program, Aperiod, associated with a parameter c (the granularity factor), to find pseudo-periodic regions or sequences within protein databases. The resulting segmentation corresponds very well to intuitive views of the detection of these 304 ‘quasi-repeats’. The Aperiod program provides a useful bioinformatics tool for delineating functional and structural features of pseudo-periodic sequences. We have applied Aperiod to five typical examples to demonstrate significant utilities for detecting protein domains, coils and loops whenever they are quasi-repeats. We have also used these techniques to evaluate the abundance of quasi-repeating sequences within the SWISS-PROT database. We find that the average agreement accuracy of pseudo-periodic partitions, detected by Aperiod for all quasi-repeats in the SWISS-PROT database, is as high as 97.6%. Pseudo-periodic regions are well defined, and are different from ‘simple’ (low-complexity) regions. Low-complexity regions are regions of biased composition. These regions are often mosiacs of a small number of amino acids. These regions have been shown to be functionally important in some proteins, but they are generally not very well understood. SEG (Wootton and Federhen, 1993) and CAST (Promponas et al., 2000) are two widely used, powerful tools for the complexity analysis of biological sequence tracts. Recently, we (Wan et al., 2003) proposed a new complexity function, called the reciprocal complexity. Based on this complexity measure, we developed an efficient algorithm and a software program ‘DSR’ for classifying and analyzing simple segments of protein and nucleotide sequence databases associated with scoring schemes. The significant difference between DSR and SEG, CAST is that DSR is much more general and associated with scoring schemes, while SEG and CAST are not associated with scoring schemes. DRS have more applications than SEG and CAST do. It is clear that by definition, only those lowest complexity regions are pseudoperiodic, and those pseudo-periodic regions consisting of very short pseudo-periodic units have low complexity. In contrast, pseudo-periodic regions have very high complexity when they comprise long pseudo-periodic units. The approach developed in this paper is a general, efficient methodology in bioinformatics and computational biology to find biological functions from pseudo-periodic segments of gene or protein sequences, which are remarkably informative for inferring, describing and understanding biological properties. It can be utilized to look for precise molecular details of biological structures, dynamics, interactions and evolution. However, these important details cannot be inferred by using sequence alignment for a large proportion of genomic and deduced protein sequences for which relevant experimental data or homologous precedents are lacking. We are continuing to use the program Aperiod for showing pseudo-periodic features of protein sequence databases. For instance, we use it to detect specific functional domains or subdomains in proteins and reveal the abundance of pseudoperiodic segments within protein sequences. We also apply it to analyze whole genomes and proteomes for extracting the distribution of pseudo-periodic unit lengths and number of pseudo-periodic units within the sequences. Then we Pseudo-periodic partitions of sequences want to find the correlation between the distribution and the degree of order of the organisms, and explore the role of those pseudo-periodic genes and proteins or regions in the molecular evolution of modern organisms. ACKNOWLEDGEMENTS We are very grateful to three anonymous referees for many valuable comments. This work was supported in part by NIH grant R01-GM00028 and NSF grant DBI 0078307. REFERENCES Beaman,T.W., Binder,D.A., Blanchard,J.S. and Roderick,S.L. (1997) Three-dimensional structure of tetrahydrodipicolinate N succinyltransferase. Biochemistry, 36, 489–494. Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. Biou,V. et al. (1994) The 2.9 Å crystal structure of T. thermophilus seryl-tRNA synthetase complexed with tRNA(Ser). Science, 263, 1404–1410. Castellino,F.J. and Beals,J.M. (1987) The genetic relationships between the kringle domains of human plasminogen, prothrombin, tissue plasminogen activator, urokinase, and coagulation factor XII. J. Mol. Evol., 26, 358–369. Coward,E. and Drablos,F. (1998) Detecting periodic patterns in biological sequences. Bioinformatics, 14, 498–507. Ikeo,K., Takahashi,K. and Gojobori,T. (1991) Evolutionary origin of numerous kringles in human and simian apolipoprotein(a). FEBS Lett., 287, 146–148. Katti,M.V. et al. (2000) Amino acid repeat patterns in protein sequences: their diversity and structural–functional implications. Protein Sci., 9, 1203–1209. Kobe,B. and Deisenhofer,J. (1993) Crystal structure of porcine ribonuclease inhibitor, a protein with leucine-rich repeats. Nature, 366, 751–756. Koradi,R., Billeter,M. and Wuthrich,K. (1996) MOLMOL: a program for display and analysis of macromolecular structures. J. Mol. Graph., 14, 29–32, 51–55. Levenshtein,V.I. (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl., 10, 707–710. Li,G., Kok,P. and Wan,H. (2003) Biological sequences are neither simple, nor complex. in press. Martin,J. (1999) Kringle domain. In T.E.Creighton (ed), Encyclopedia of Molecular Biology. John Wiley, p. 1353. McLean,J.W. et al. (1987) cDNA sequence of human apolipoprotein(a) is homologous to plasminogen. Nature, 330, 132–137. Parisi,G. and Echave,J. (2001) Structural constraints and emergence of sequence patterns in protein evolution. Mol. Biol. Evol., 18, 750–756. Patthy,L. (1985) Evolution of the proteases of blood coagulation and fibrinolysis by assembly from modules. Cell, 41, 657–663. Promponas,V.J. et al. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics, 16, 915–922. Raetz,C.R.H. and Roderick,S.L. (1995) A left-handed parallel beta helix in the structure of UDP-N-acetylglucosamine acyltransferase. Science, 270, 997–1000. Schmidt,J.P. (1998) All highest scoring paths in weighted grid graphs and their application to finding all repeats in strings. SIAM J. Comput., 27, 972–992. Smit,A.F. (1996) The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev., 6, 743–748. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Vaara,M. (1992) Eight bacterial proteins, including UDP-Nacetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity theme. FEMS Microbiol. Lett., 76, 249–254. Vuorio,R. et al. (1994) The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria. FEBS Lett., 337, 289–392. Wan,H., Li,L., Federhen,S. and Wootton,J.C. (2003) Discovering simple regions in biological sequences associated with scoring schemes, J. Comput. Biol., 2, 171–185. Wan,H. and Song,E. (2002) Quasiperiodic biosequences and modulo incidence matrices. Proceedings of the International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops. Wan,H. and Wootton,J.C. (1999) Axiomatic foundations of complexity functions of biological sequences. Ann. Comb., 3, 105–127. Wan,H. and Wootton,J.C. (2000) A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem., 24, 67–88. Wan,H. and Wootton,J.C. (2002) Algorithms for computing lengths of chains in integral partition lattices generated by sequences. Theoret. Comput. Sci., 289, 783–800. Wootton,J.C. and Federhen,S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem., 17, 149–163. APPENDIX A Proofs of Proposition 2 and 3 Proof of Proposition 2. Let r = gcd(p1 , p2 ). We prove Proposition 2 for a fixed r by induction on p1 + p2 . Clearly, the proposition is true for p1 + p2 = 1. Suppose it true for all positive integers <p1 +p2 . Let r = gcd(p1 , p2 ). Without loss of generality, we assume that p1 > p2 and write s = tu, where t = t1 t2 · · · tp1 −r . Since s is both p1 -periodic and p2 -periodic, we have ti = si = si+p1 = si+p1 −p2 = ti+p1 −p2 . This means that t is (p1 − p2 )-periodic. Note that t is p2 -periodic and gcd(p1 − p2 , p2 ) = r. By the induction hypothesis, t is r-periodic. Since t is a prefix of s, and s is p2 -periodic, where r|p2 , we immediately deduce that s should be r-periodic. Proof of Proposition 3. Let p0 be the minimum period of s. Obviously, p0 ≤ p ≤ n/2. If s is m-periodic, 1 ≤ m ≤ n/2, by Proposition 2, we should have p0 = gcd(p0 , m) because p0 is the minimum period. Thus p0 |m. In other words, m is a multiple of p0 . 305 L.Li et al. B Feasibility and effectiveness of algorithm Partition(AL) The following lemma gives a precise characterization of empty substring A(s (i)∗ ), and a necessary and sufficient condition for the termination of algorithm P artition(AL). Lemma 1. A(s (i)∗ ) = φ if and only if s (i) is a suffix of s. Proof. We have A(s (i)∗ ) = φ whenever the shortest path, corresponding to the alignment AL, must pass through the point (j , j ), where j is the length of the sequence s (1) s (2) · · · s (i) . But, there are only two points that we allow the shortest path to reach: (0, 0) and (n, n). Apparently j > 0. Thus j = n, i.e. s (i) is a suffix of s. The following lemma tells us that algorithm P artition(AL) really generates a partition. Lemma 2. Let π = {s (1) , s (2) , . . . , s (i) } be the output generated by Partition(AL), then π is a real partition of s. Proof. Note that s (1) = s1 s2 · · · sr at the bottom line of AL is a prefix of s that is aligned with all ‘−’s before s (1) at the top line. Thus, s (2) = A(s (1)∗ ) is a prefix of s − s (1) , s (1) s (2) is a prefix of s; s (3) = A(s (2)∗ ) is a prefix of s−s (1) s (2) , s (1) s (2) s (3) is a prefix of s; and so on. Here, s −s denotes the subsequence of s obtained by removing the prefix s . Finally, we have s (1) s (2) · · · s (k) = s. 306 For a self-alignment AL of s, we define Cost(AL) as the length of the corresponding path from (0, 0) to (n, n) in G, the DWG made by AL. Interestingly, for the optimal AL corresponding to the shortest path from (0, 0) to (n, n) in G, Cost(AL) achieves the minimal distance to periodic partitions of s, as described in the following lemma. Lemma 3. Let AL be the optimal alignment corresponding to the shortest path from (0, 0) to (n, n) in G. Then Bc(s) = Cost(AL). Proof. Let π = {s (1) , s (2) , . . . , s (k) } be the partition of s generated by P artition(AL). We can see that D(π ) + c · g(π ) = Cost(AL). By the definition of Bc (s), we have D(π ) + c · g(π ) ≥ Bc (s). If D(π ) + c · g(π ) > Bc (s), then there must exist a partition π0 of s such that D(π )+c ·g(π ) > D(π0 ) + c · g(π0 ). So we can construct a self-alignment AL of s such that Cost(AL ) = D(π0 ) + c · g(π0 ) < Cost(AL). A path of length Cost(AL ) is thus found from (0, 0) to (n, n) in G. It contradicts that AL is an optimal alignment corresponding to the shortest path from (0, 0) to (n, n) in G. Lemma 3 demonstrates that any partition of s, generated by a self-alignment AL using algorithm P artition(AL), is a pseudo-periodic partition of s. Combining Lemma 1, 2, with 3 yields Theorem 1.
© Copyright 2026 Paperzz