Biochimica et Biophysica Acta 1724 (2005) 394 – 403 http://www.elsevier.com/locate/bba Minireview Hidden Markov Model-derived structural alphabet for proteins: The learning of protein local shapes captures sequence specificity A.C. Camproux*, P. Tufféry Equipe de Bioinformatique Génomique et Moléculaire, INSERM U726, Université Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France Received 1 March 2005; received in revised form 10 May 2005; accepted 11 May 2005 Available online 15 June 2005 Abstract Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them. We have recently set up a Hidden Markov Model to optimally compress protein three-dimensional conformations into a one-dimensional series of letters of a structural alphabet. Such a model learns simultaneously the shape of representative structural letters describing the local conformation and the logic of their connections, i.e. the transition matrix between the letters. Here, we move one step further and report some evidence that such a model of protein local architecture also captures some accurate amino acid features. All the letters have specific and distinct amino acid distributions. Moreover, we show that words of amino acids can have significant propensities for some letters. Perspectives point towards the prediction of the series of letters describing the structure of a protein from its amino acid sequence. D 2005 Elsevier B.V. All rights reserved. Keywords: Hidden Markov Model; Structural alphabet; Protein structural organization; Sequence – structure relationship 1. Introduction The recent genome sequencing projects [1] have provided sequence information for a large number of proteins. In most cases, an accurate three-dimensional (3D) structural knowledge of the proteins is necessary for a detailed functional characterization of these sequences. However, even in the days of high-throughput techniques, experimental determination of protein structures by X-ray crystallography or NMR remains quite time consuming. Thus, there is an increasing gap between the number of available protein sequences and experimentally derived protein structures, which makes it even more important to improve the methods for predicting protein 3D structures. The structural biology community has long focused on the very hard task of developing algorithms for solving the ab initio protein folding problem—namely, predicting protein * Corresponding author. E-mail address: [email protected] (A.C. Camproux). 0304-4165/$ - see front matter D 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.bbagen.2005.05.019 structure from sequence. An obvious direction to get some simplification is to consider that recurrent structural motifs exist at all levels of organization of protein structures [2]. The recent years have seen the re-emergence of an old concept that is the identification of libraries of canonical 3D structural fragments that span the space of local structures in a library of representative but finite set of generic protein fragments [3]. These libraries of fragments are classically constructed by clustering fragments from a collection obtained from a set of dissimilar proteins. Many clustering approaches have been used to extract sets of representative fragments able to represent adequately all known local protein structures [4– 14]. Despite the fact that such libraries provide an accurate approximation of protein conformations, their identification teaches us little about the way protein structures are organized. They do not consider the rules that govern the assembly process of the local fragments to produce a protein structure during the learning step. An obvious means of overcoming such limitations is to consider that the series of representative fragments that can describe protein structures are in fact not independent but A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 governed by a Markovian process. For this purpose, we have shown that Hidden Markov Models (HMM) [15] are relevant to identify structural alphabet (HMM-SA). Recently, we have set up an optimal Hidden Markov Model derived structural alphabet for proteins, which describes the local shape of proteins and the logic of their assembly using 27 letters [16]. Such a structural alphabet is able to optimally compress 3D information into a unique 1D representation. We have since shown that such an alphabet that provides an accurate description of proteins conformation can be used to search structural similarities [17]. However, one important question is to assess to what extent such a structural alphabet is suited for structural prediction. Fragment libraries classically face the dilemma [3] that some balance has to be found between the accuracy of the structural description and reasonable fragment library size: (i) to keep a good representativity of 3D structural fragments optimising the relevance of the 1D –3D relationship to ensure the quality of the prediction, and (ii) to obtain a reasonable complexity for 3D reconstruction. In this paper, we address the question of the ability of HMM-SA, a structural alphabet that was learnt using exclusively geometrical and logical information (i.e. taking into account geometry of the letters and their transitions), to 395 capture clear sequence –conformation relationship. We perform an a posteriori analysis of the dependence between the local amino acid sequence and the local shapes of proteins encoded as series of letters of the structural alphabet. 1.1. Materials Our analysis was performed from a collection of nonredundant protein structures presenting less than 50% sequence identity. Only proteins at least 30 amino acids long, having no chain breaks, and obtained by X-ray diffraction with a resolution better than 2.5 Å were retained. This resulted in a collection of 3427 protein chains (denoted as Id50) and a subset of 1429 protein chains having less than 30% sequence identity (denoted as Id30). The collection of 3427 proteins represents a total of 809,638 amino acids and 799,357 four-residue fragments. The subset of 1429 proteins contains 336,780 amino acids and 332,493 four-residue fragments. 1.2. Model learning Proteins are described by their alpha carbons only (see Fig. 1.a1) and are decomposed as series of overlapping fragments of Fig. 1. Illustration of the HMM-SA encoded process. The left block, called ‘‘3D space’’, represents the polypeptidic chain of protein 8abp (a1) scanned in overlapping windows that encompassed 4 successive a-carbons (a2), thereby producing a series of four-residue fragments. Each fragment is described by a vector of four descriptors (a3). Panels b1 and network b2 illustrate the optimal HMM-SA corresponding to 27 average four-residue fragments associated to 27 letters and main trajectories between letters. The right block, called ‘‘1DVHMM-SA space’’, represents the corresponding encoded chain 8abp (c1), coloured related to secondary structures and the corresponding HMM-SA letter series. 396 A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 four residue length (see Fig. 1.a2). Each four-residue fragment h is described by a 4-descriptor vector: the three distances between the non-consecutive a-carbons d1(h)=d{Calpha1(h) Calpha3(h)}, d2(h)=d{Calpha1(h) Calpha4(h)}, d3(h)=d{Calpha2(h) Calpha4(h)}, and the oriented projection P 4 of the last alpha-carbon C alpha4(h) to the plane formed by the three first ones [15], as shown in Fig. 1.a3. Suppose that polypeptidic chains are made up of representative fragments of (R) different types S 1,. . .,S R , we then assume that there are (R) letters of the model. Each letter S i is associated to a multi-normal function b Si( y) of parameters h i , given the parameters of mean l i and of variability R i of the descriptors corresponding to the set of fragments generated by letter i. Two types of model were considered to identify R letters: a process without memory (order 0) assuming independence of the R letters (training by simple finite Mixture Models, MM0) or a process with a memory of order 1. The model with memory (order 1) takes into account dependence between letters using a Hidden Markov Model of order one (HMM1). We assume a common letter dependence process for all polypeptidic chains governed by a Markov chain of order one. The evolution of the Markov chain of proteins is completely described by: (1) the law m p of the initial letter of each polypeptide chain p, i.e. the probability that each polypeptide chain starts in different letters, and (2) the matrix of transition probabilities / between different letters of the Markov chain, where /iiV= P(X t+1 = S tV|X t = S i ) is the probability for different proteins to evolve from letter S i to S iV at any position t. Here, the hidden sequence of letters {x 1, x 2,. . ., x N } emits the series of vectors { y 1, y 2,. . ., y N }, describing consecutive overlapping fragments of the proteins, resulting in a HMM1. Our ultimate goal is to reconstruct the unobserved (hidden) letter sequence {x 1, x 2,. . ., x N } of the polypeptide chains, given the corresponding emitted four-dimensional vectors of descriptors { y 1, y 2,. . ., y N }, and to provide a classification of successive fragments in R letters. For a given 3D conformation and a selected model (fixed number R of letters), the corresponding best letter sequence among all the possible paths in {S 1,. . ., S R }N can be reconstructed by a dynamic programming algorithm based on Markovian process (Viterbi algorithm, [18]). For a given set of proteins and a given number (R) of letters, unknown parameters k = (/,m,h) of the selected model were estimated with an Expectation and Maximization (EM) algorithm [19] applied on the complete likelihood. Complete likelihood of n four-residue fragments { y 1, y 2,. . ., y n } describing a protein of n + 3 residues X V ðx1 Þbx1 ðy1 Þ Vk ðy1 ;y2 ;N ;yn Þ ¼ fx1 ;x2 ;N ;xn g n1 Y pxt xtþ1 bxtþ1 ðytþ1 Þ ð1Þ t¼1 For an overview of the basic theory of HMM, see [18], and for practical details on application to protein struc- tures, see [15]. Structural alphabets of different sizes (R) were learnt on two independent learning sets of proteins using HMM1 and MM0 by progressively increasing R from 12 to 33 letters and compared using statistical criteria such as the Bayesian Information Criterion (BIC), [20]. 1.3. HMM-SA We briefly recall the main characteristics of HMM-SA [16]. The optimal structural alphabet (HMM-SA), using statistical criterion BIC, corresponds to 27 letters and their transition matrix. The identified letters are denoted as structural letters: [a, A, B,. . ., Y, Z] and presented by increasing stretches (see Fig. 1.b1). Concerning protein architecture logic description, as suggested by the large influence of the Markovian process on the BIC, we observed strong dependence between letters. This results in the existence of only few pathways between the letters, obeying some precise and unidirectional rules (see Fig. 1.b2). The letters associated with close shapes have different logical roles. For instance, the two closest letters [A, a] in terms of geometry, close to canonical alpha helix, are distinguished by different preferred input and output letters. To simplify the description of the HHM-SA, we compare the 27 letters to the usual secondary structures: a-helices (38%), extended structures (19%) and coils (43%). More precisely, the occurrence of a letter into a secondary structure corresponds to the occurrence of its corresponding four-residue fragments with third residue assigned to this secondary structure. Letters [A, a, V, W] appear almost exclusively in a-helices (more than 92% of associated fragments assigned to ahelices) while letters [Z, B, C, E, H] are more split on ahelices and coils. Five letters [L, N, M, T, X] are mostly located in extended structures (from 47% to 78% of associated fragments assigned to h-strands). Other letters are mostly associated with coils. For instance, letters [D, S, Q, F, U] have more than 90% of fragments assigned to coils. 1.4. Structure encoding To encode the structures, a vector of 4 descriptors describes each fragment of 4-residue length (see Fig. 1.a3, [15]). For one protein of n + 3 residues, given the series of 4*n descriptors and the HMM-SA, it is possible to encode it as a series of letters of the alphabet using the Viterbi or the forward – backward algorithms, for example. The Viterbi algorithm identifies the best letter sequence (or optimal trajectory) among all the possible paths in {a, A, . . ., Z}n by dynamic programming taking into account the Markovian dependence between the consecutive letters. We used this approach to optimally describe each structure as a series of letters of HMM-SA. This process of compression of 3D protein conformation in 1DVHMM-SA is illustrated on Fig. 1 on the structure of an l-arabinose binding protein (8abp). This ah protein is coloured first (Fig. 1.a1) according to its secondary structure assignment and, secondly, after HMM- A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 SA compression, according to its corresponding series of letters (Fig. 1.c1) results. 1.5. Sequence specificity of HMM-SA To study how the amino acids are distributed into the letters of HMM-SA, we extracted from our collection of encoded proteins the four-residue fragments associated with each type of letter and the corresponding n-uplets (n varying from 1 to 4) of amino acids. The amino acid distribution associated with each letter was compared to that observed in the Id30 set by relative entropy [21]. This measure, also known as the Kullback – Leibler asymmetric divergence measure, denoted Kdl(i) for letter i, quantifies the sequence specificity on the four positions of position i as: " # X X pa;l;i Kdl ðSiÞ ¼ pa;l;i ln ; ð2Þ pa;l 1VlV4 1VaV20 where a denotes a given amino acid (1 a 20), l denotes the one position of the four residues (1 l 4), p a,l,i the amino acid frequency of amino acid a observed in the position l of letter i, and p a,l the amino acid frequency of amino acid a in the global database for position l. This measure is expected to be close to 0 for an unspecific amino acid distribution and to increase with sequence dependence. The Kdl values can be assessed by a chi-square test, since the quantity N i IKdl(i), with N i being the number of four-residue fragments associated to letter i, follows a chi-square distribution. Thus, letters associated to specific amino acid 397 distribution have significant Kdl values. To compare the sequence specificity of HMM-SA of different sizes, R, we use the global sequence information K R , obtained as the weighted average relative entropy over R letters. Larger values correspond to stronger global sequence specificity of the corresponding HMM-SA, while values close to 0 correspond to letters without sequence signature. To check the divergence of sequence specificity between R letters, we use the Kullback – Leibler symmetric divergence, whose expression is for letters S i and S i V: Kdiv ðSi ;SiV Þ " # X X X pa;l;i pa;l;iV ¼ pa;l;i ln pa;l;iV ln þ pa;l;iV pa;l;i 1VaV20 1VlV4 1VaV20 ð3Þ where p a,l,i and p a,l,iV denote the amino acid frequency of amino acid a observed in the position l of letter S i or S i V, respectively. This measure is expected to be close to 0 for no difference of sequence distributions and to increase with sequence dependence. Thus, positions with high specific amino acid distribution are associated with significant values. A correspondence analysis is used to visualize the main relationship between 20 amino acids and 27 letters. Then, to analyse position-by-position dependence between amino acids and fragments associated to each letter, Z-score matrices are computed for the 27 letters. For each letter, the Z-scores are computed from the amino acid distributions observed at each of the four positions of the associated Fig. 2. Evolution of the amino acid distribution specificity as a function of the number of letters of the HMM-SA. The evolution of the amino acid distribution specificity of HMM-SA of different sizes, R, measured by Kullback – Leibler asymmetric divergence measure K R and Kullback – Leibler symmetric divergence measure K divR. As R increases, these measures are expected to increase and were computed for increasing R, on letters randomly picked up in the data set with identical letter frequency than SA-R. At the optimum, from a statistical point of view (R = 27), all letters have significant amino acid specificity compared to the amino acid profile of the whole learning database (Id30), i.e. each letter has a significant Kullback – Leibler asymmetric divergence measure ( P < 0.001): [a]: 0.190, [A]: 0.397, [V]: 0.324, [W]: 0.235, [Z]: 0.150, [B]: 0.168, [C]: 0.316, [D]: 1.172, [O]: 0.100, [S]: 0.201, [R]: 0.593, [Q]: 0.384, [I]: 0.307, [F]: 0.228, [F]: 0.511, [U]: 0.598, [P]: 0.264, [H]: 0.683, [J]: 0.281, [Y]: 0.398, [J]: 0.749, [K]: 0.413, [l]: 0.135, [N]: 0.318, [M]: 0.408, [T]: 0.254, [X]: 0.184. 398 A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 Fig. 3. Correspondence analysis between HMM-SA and amino acids. Representation of the main associations between 20 amino acids and 27 letters obtained using correspondence analysis. Here, the amino acid preference of the third residue of four-residue fragments associated to each letter is presented. The first four eigenvalues account for 96% of the variance. The first factorial plane shows the predominant role of glycine and asparagine associated with coil letters [D, U, J, R, F] and proline associated with coil letters [K, Y, H, P]. The second factorial plane shows the amino acid antagonism between the regular secondary structures with the extended structures [M, N, T, X] on the right of the third factorial axis and the helical ones [A, a, V, W] on the left. fragment of 4-residue length, independently, normalized by the distribution of the amino acids in Id50. Hence, positive Zscores (respectively negative) correspond to over-represented (respectively under-represented) amino acid at some position of one letter. To analyse amino acid word preference for letters, the occurrences of amino acids word are computed and normalized for each letter into a Z-score taking into account the expected number of amino acid word within this letter 2. Results 2.1. Conformational learning is coupled with increasing sequence specificity As shown in Fig. 2, K R values increase with the number of letters R from 12 to 33 on the learning set Id30, which implies that the distribution of the 4-plets of amino acids associated with each letter becomes more and more specific as the number of letters of the HMM-SA increases. For randomly selected letters, the increase is 0.002 from 12 to 33 letters versus 0.063 for the HMM-SA. We also observe an increase in the difference of specificity between letters, as measured by K div values: 0.0003 for randomly selected letters versus 0.88 for the HMM-SA. In the following, we only discuss HMM-SA, an alphabet of 27 letters identified as optimal, i.e. with no over-fitting of the model parameters, using the statistical BIC criterion [16]. At the optimum, i.e. HMM-SA-27, K R values is of 0.34 and all letters have significant amino acid specificity compared to the amino acid profile of the whole learning set (Id30). This confirms that, despite that we have learnt HMM-SA using only geometric and sequential information, we have not oversplit sequence information. 2.2. Amino acid distributions are highly specific of each letter of HMM-SA First, we assess if the distribution of amino acids into the different letters of HMM-SA is significant. This was performed on the Id30 set. Since each letter represents a set of four-residue fragment, we have checked that, for each of the four positions, the distributions of the amino acids at Fig. 4. Z-score matrices for the 27 letters computed from the amino acid distributions. The frequency of amino acid occurrence at each of the four positions of the fragments associated with each letter, normalized into Z-scores relative to the frequency of amino acids and to the frequency of sates in the Id30 protein set. Letters are sorted by increasing stretches. The amino acids, represented by their one-letter code (I, V, L, M, A, F, Y, W, C, P, G, H, S, T, Q, D, N, K, R, E), are plotted in the y-axis and the positions within each segment (1 – 4) on the x-axis. The colour of each square indicates the level of occurrence of each amino acid at each position. Higher levels are indicated by dark colours (absolute Z-scores > 4.4), the still significant values by light colours (2 < absolute Z-scores < 4.4), and low levels, by white (absolute Z-scores <2). Blue squares designate over-represented amino acids (Z-score >2) and pink squares, the under-represented amino acids (Z-scores < 2). (4a) More helical letters. (4b) Extended letters. (4c) Coils letters. (4d) Coils letters. A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 that position differ from one letter to another, using a chisquare test ( P < 0.0001). The visualization of main associations between 27 letters and 20 amino acids is performed in 399 Fig. 3, for position 3 of associated fragments, using a correspondence analysis. The first two eigenvalues account for 80% of the variance, and the first four, for 96%. The first 400 A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 Fig. 4 (continued). A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 factorial plane illustrates the predominant role of glycine and asparagine (first factorial axis), associated with coil letters [D, U, J, R, F], and proline (second factorial axis), associated with coil letters [K, Y, H, P]. The second factorial plane illustrates the antagonism between the regular secondary structures with the extended structures [M, N, T, X] on the right of the third factorial axis and the helical ones [A, a, V, W] on the left. In agreement with the known amino acid preferences in terms of secondary structures [9,22 –26], there are large preferences of helical letters [A, a, V, W] for hydrophobic (alanine, leucine, and methionine) and charged (glutamic acid and arginine) amino acids, as well as glutamine. Strand letters [X, L, M, N, T] present a strong association with valine, isoleucine, threonine, phenylalanine and tyrosine. Letters [E, O, S, Z] present few specificity. The weakest dependences measured by the Kullback –Leibler asymmetric divergence measure observed for letters [E] and [Z] are still very significant ( P < 0.001). Interestingly, letters [F, J, R], the fuzziest in terms of geometry, are among most specific letters in terms of amino acid preference. This confirms, as suggested by the analysis of the transitions (see [16]), that these fuzzy letters are not ‘‘trash’’ letters clustering ‘‘outliers’’. More in detail, Fig. 4 shows Z-score matrices for the 27 letters. For each letter, the Z-scores are computed from the amino acid distributions observed at each of the four positions of the 4-residue-length fragments independently, normalized by the distribution of the amino acids in the whole protein set. Despite the large number of letters, significant values are observed for all the positions of all the letters, including coil letters. For letters that are repeated to form regular secondary structures [A, a, M, N, T], the same significance pattern tend to be propagated through the four 401 positions. Some exceptions occur, such as the over-representation of glycine at the first position of letters [N] or [M], or the under-representation of valine at positions 3 and 4 of [A]. Letters [A, a] that are close in terms of geometry appear also close in terms of sequence dependence. Still, different amino acid preferences are observed at some positions of [A, a]. Apart from valine, under-represented at the two first positions of letter [a], isoleucine is over-represented in letter [A] and not in [a], histidine is only under-represented in [A], and glycine and phenylalanine are not under-represented at, respectively, the last position of [a] and two first positions of [a]. Note that the Z-score matrices average the way letters are connected: All possible transitions are considered simultaneously. So it is not possible to easily go thoroughly into the analysis of the over- and under-representations of the letters following the transitions between the letters. 2.3. Some amino acid words are specific of HMM-SA letters We now turn to the analysis of the distribution of words of amino acids into each of the 27 letters. Due to the size of Id50 (779,537 observations), it is, in theory, possible to expect 4.9 occurrences of each of the 160,000 words of 4 amino acids length (4-words). Here, we observe the occurrence of only 134,878 such words (i.e. 84% of possible 4-words are observed from 1 to 100 times), 25,122 being not observed. For words of size 2 (2-words), all of them occur, and for the words of size 3 (3-words), only 6 do not occur. We observe that for 20 over 27 letters, the distribution of the 2-words differ significantly from that expected considering the positions independently. For all letters, except [a, V, W, Z, C, E, I], the average Z-score over the 400 possible amino acid pairs is significant ( P < 0.0001 Table 1 Distribution over HMM-SA letters of 225 4-words of amino acids observed more than 40 times in Id50 Minimal probability T of occurrence of a word in a letter (%) Number of 4-words ‘‘specific’’ of a letter Number of letters associated to a ‘‘specific’’ word 60 12 3 50 53 5 40 30 113 181 8 13 Letters associated to at least one specific word [specific 4-words] A: [AKAA, LRAA, LEAA, AALE, EAAK, EALR] S: [GDSG, LGLP] D: [ALGL, ELGL, ADGS, ADGT] A: [LAAA, LEAA, VEAA, AIAA, AKAA, ALAA, TLAA, ARAA, LRAA, AAEA, LAEA, LLEA, AREA, AAKA, LEKA, LKKA, ALKA, LLKA, QALA, EKLA, ALLA, EVLA, ALRA, ALAE, KAIE, AALE, EALE, ALRE, SAAK, EAAK, LAAL, AEAL, LEAL, EKAL, ALAL, LLAL, LEEL, LKEL, LARL] S: [AAAR, EAAR, EALR, LGLP, GDSG, AGAD, LEEI] D: [ADGS, ADGT, ALGL, ELGL, EAGA] K: [GKPL] F: [KDGK] A, S, D, K, F, S, R, Y A, S, D, K, F, R, Y, J, M, B, I, L, W A word is considered as ‘‘specific’’ of a letter if its probability of occurrence in the letter is more than a fixed minimal probability threshold T. The first column indicates the threshold T of minimal probability of occurrence of a word in a letter to be considered as specific of this letter. The second column indicates the corresponding number of specific 4-words. The third column indicates the corresponding number of letters presenting at least one specific 4-word. The fourth column indicates the corresponding letters presenting specific 4-words [the corresponding specific 4-words for T = 60% and T = 50%]. 402 A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 to take into account test repetition). For a P value of 0.001, all letters, except [a, E], have average Z-score values significant. We observe the same tendency for 3-words (not shown). For 4-words, only 225 words occur more than 40 times in Id50. For 96% of them, their occurrence frequencies differ significantly from their expected one, using the chi-square test. Only 4% of these 4-words do not show any statistical preference for a particular letter. Looking if some 4-words are specific of one particular letter (Table 1), we observe that 12, 53, 113, and 181 of the 225 4-words occur with a probability of more than 60%, 50%, 40%, and 30%, respectively, in one particular letter. Some of these 4-words can be considered as ‘‘specific’’ signatures of some letters: for instance, respectively, 3, 5, 8, and 13 letters have at least one 4-word associated with a probability more than 60%, 50%, 40%, and 30%. Some 4words have a clear preference for two letters. The word ‘‘LKPG’’ is associated with [Y] (21/46, ¨46% of occurrences) and [K] (14/46, ¨30% of occurrences), and the word ‘‘KDGK’’ is associated with [D] (27/49, ¨55% of occurrences) and [F] (17/49, ¨35% of occurrences). Such 4words can be associated with letters unrelated structurally. For instance, the word ‘‘LTAA’’ is dispatched mostly in [A, Y] (around 26% of occurrences each), not geometrically closed. Finally, we also observe that some words that differ by only one amino acid can have marked preferences for letters presenting no related conformations. For instance, the words ‘‘ALLA’’, ‘‘ALRA’’ and ‘‘ALRE‘‘ are mostly occur in [A] (more than 40%), while the word ‘‘ALVE‘‘ is mostly found in [Y]. some sequence motifs of length up to 9 have been shown as specific of some particular local structure [10]. However, some novelty can be found in the fact that some specificity occurs in a general manner for almost all letters. This opens the door to building a prediction strategy based on more accurate information than considering independently the amino acid occurring at the consecutive positions. A classical issue to lower the impact of this limitation is to sum the information over a window encompassing the position at which the prediction is performed (see, for example, [31]). With HMM-SA, it seems possible to directly use at least doublets or triplets of amino acids (words of size 2 or 3) to perform prediction of letters, since presently, the size of the protein dataset limits the relevance of the information to words of size 3, even if we observe a clear dependence between letters and some words size of 4. More recently, it has been suggested [32] that learning the sequence– structure relationship, taking into account the way local conformations are connected, may lead to prediction improvement. In the present study, we have not analysed how much the Markovian process associated with HMM-SA could help taking into account the specificity of the amino acid sequence information. For amino acid words having preferences for several letters associated with clearly different local shapes, the Markovian transitions could help choosing among the different preferred letters. Work is under progress in this direction. 3. Discussion, perspectives [1] R.H. Waterston, E.S. Lander, J.E. Sulston, On the sequencing of the human genome, Proc. Natl. Acad. Sci. U. S. A. 99 (2002) 3712 – 3716. [2] T.A. Jones, S. Thirup, Using known substructures in protein model building and crystallography, EMBO J. 5 (1986) 819 – 822. [3] A.G. de Brevern, A.C. Camproux, S. Hazout, C. Etchebest, P. Tuffery, Beyond the secondary structures: the structural alphabets, Recent Adv. In Prot. Eng. 1 (2001) 319 – 331. [4] R. Unger, D. Harel, S. Wherland, J.L. Sussman, A 3D building blocks approach to analyzing and predicting structure of proteins, Proteins 5 (1989) 355 – 373. [5] M.J. Rooman, J. Rodriguez, S.J. Wodak, Automatic definition of recurrent local structure motifs in proteins, J. Mol. Biol. 213 (1990) 327 – 336. [6] S.J. Prestrelski, A.L. Williams, M.N. Liebman, Generation of a substructure library for the description and classification of protein secondary structure: I. Overview of the methods and results, Proteins 14 (1992) 430 – 439. [7] M. Levitt, Accurate modeling of protein conformation by automatic segment matching, J. Mol. Biol. 226 (1992) 507 – 533. [8] J. Schuchhardt, G. Schneider, J. Reichelt, D. Schomburg, P. Wrede, Local structural motifs of proteins backbones are classified by selforganizing neural networks, Protein Eng. 9 (1996) 833 – 842. [9] J.S. Fetrow, M.J. Palumbo, G. Berg, Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme, Proteins 27 (1997) 249 – 271. [10] C. Bystroff, D. Baker, Prediction of local structure in proteins using a library of sequence – structure motifs, J. Mol. Biol. 281 (1998) 565 – 577. In the present paper, we have started analysing if alphabet such as HMM-SA is suitable for deciphering some local sequence – structure relationship. Our results clearly show that sequence information specific of each letter of the structural alphabet exists and that we have not over-split sequence information. In terms of protein structure prediction, simple local libraries, such as the I-sites [27,28], introduced as local constraints on protein available conformations, have shown their usefulness for improving ab initio prediction [29,30]. Our results tend to prove that accurate HMM-SA is well suited for prediction. First, we observe that the way the letters are identified for embedded alphabets having increasing number of letters is correlated to an increased specificity in the associated amino acid sequence. Interestingly, the large number of letters of HMM-SA (27) is classically mentioned as a problem to get some specific sequence information [3]. Here, we observe that all of the 27 letters present some sequence signature. Finally, we also observe that the relationship between letter and amino acid sequence can be informative at the level of words of amino acids. Such results are not surprising if one considers that References A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403 [11] A.G. de Brevern, C. Etchebest, S. Hazout, Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins 41 (2000) 271 – 287. [12] C. Micheletti, F. Seno, A. Maritan, Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies, Proteins 40 (2000) 662 – 674. [13] R. Kolodny, P. Koehl, L. Guibas, M. Levitt, Small libraries of protein fragments model native protein structures accurately, J. Mol. Biol. 323 (2002) 297 – 307. [14] C.G. Hunter, S. Subramaniam, Protein fragment clustering and canonical local shapes, Proteins 50 (2003) 580 – 588. [15] A.C. Camproux, P. Tuffery, J.P. Chevrolat, J.F. Boisvieux, S. Hazout, Hidden Markov model approach for identifying the modular framework of the protein backbone, Protein Eng. 12 (1999) 1063 – 1073. [16] A.C. Camproux, R. Gautier, P. Tuffery, A hidden Markov model derived structural alphabet for proteins, J. Mol. Biol. 339 (2004) 591 – 605. [17] F. Guyon, A.C. Camproux, J. Hochez, P. Tuffery, SA-search: a web tool for protein structure mining based on structural alphabet, Nar 32 (2004) W545 – W548. [18] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (1989) 257 – 285. [19] L.E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, Ann. Math. Stat. 41 (1970) 164 – 171. [20] G. Schwartz, Estimating the dimension of a model, Ann. Stat. 6 (1978) 461 – 464. [21] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1951) 79 – 86. 403 [22] P. Argos, J. Palau, Amino acid distribution in protein secondary structures, Int. J. Pept. Protein Res. 19 (1982) 380 – 393. [23] J.S. Richardson, D.C. Richardson, Amino acid preferences for specific locations at the ends of a-helices, Science 240 (1988) 1648 – 1652. [24] L. Presta, Protein structure analysis and development of databases, Protein Eng. 2 (1989) 395 – 397. [25] R. Aurora, R. Srinivasan, G.D. Rose, Rule for a helix termination by glycine, Science 264 (1994) 1126 – 1130. [26] J.W. Seales, R. Srinivasan, G.D. Rose, Sequence determinants of the capping box, a stabilizing motif at the N-termini of a-helices, Prot. Sci. 3 (1994) 1741 – 1745. [27] C. Bystroff, V. Thorsson, D. Baker, HMMSTR: a hidden Markov model for local sequence – structure correlations in proteins, J. Mol. Biol. 301 (2000) 173 – 190. [28] C. Bystroff, Y. Shao, Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA, Bioinformatics 18 (2002) S54 – S61. [29] K. Simons, R. Bonneau, I. Ruczinki, D. Baker, Ab initio protein structure prediction of Casp III targets using ROSETTA, Proteins 5 (1999) 355 – 373. [30] R. Bonneau, J. Tsai, I. Ruczinki, D. Chivian, C. Rohl, C.E.M. Strauss, D. Baker, Rosetta in CASP4: progress in ab initio protein structure prediction, Proteins 37 (2001) 119 – 126. [31] R. Karchin, M. Cline, Y. Mandel-Gutfreund, K. Karplus, Hidden Markov models that use predicted local structure for fold recognition: alphabets on backbone geometry, Proteins 51 (2003) 504 – 514. [32] A.G. de Brevern, H. Valadie, S. Hazout, C. Etchebest, Extension of a local backbone description using a structural alphabet: a new approach to the sequence – structure relationship, Protein Sci. 11 (2002) 2871 – 2886.
© Copyright 2026 Paperzz