Hidden Markov Model-derived structural alphabet for proteins: The

Biochimica et Biophysica Acta 1724 (2005) 394 – 403
http://www.elsevier.com/locate/bba
Minireview
Hidden Markov Model-derived structural alphabet for proteins: The
learning of protein local shapes captures sequence specificity
A.C. Camproux*, P. Tufféry
Equipe de Bioinformatique Génomique et Moléculaire, INSERM U726, Université Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France
Received 1 March 2005; received in revised form 10 May 2005; accepted 11 May 2005
Available online 15 June 2005
Abstract
Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them.
We have recently set up a Hidden Markov Model to optimally compress protein three-dimensional conformations into a one-dimensional
series of letters of a structural alphabet. Such a model learns simultaneously the shape of representative structural letters describing the
local conformation and the logic of their connections, i.e. the transition matrix between the letters. Here, we move one step further and
report some evidence that such a model of protein local architecture also captures some accurate amino acid features. All the letters have
specific and distinct amino acid distributions. Moreover, we show that words of amino acids can have significant propensities for some
letters. Perspectives point towards the prediction of the series of letters describing the structure of a protein from its amino acid
sequence.
D 2005 Elsevier B.V. All rights reserved.
Keywords: Hidden Markov Model; Structural alphabet; Protein structural organization; Sequence – structure relationship
1. Introduction
The recent genome sequencing projects [1] have provided sequence information for a large number of proteins.
In most cases, an accurate three-dimensional (3D) structural
knowledge of the proteins is necessary for a detailed
functional characterization of these sequences. However,
even in the days of high-throughput techniques, experimental determination of protein structures by X-ray
crystallography or NMR remains quite time consuming.
Thus, there is an increasing gap between the number of
available protein sequences and experimentally derived
protein structures, which makes it even more important to
improve the methods for predicting protein 3D structures.
The structural biology community has long focused on the
very hard task of developing algorithms for solving the ab
initio protein folding problem—namely, predicting protein
* Corresponding author.
E-mail address: [email protected] (A.C. Camproux).
0304-4165/$ - see front matter D 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.bbagen.2005.05.019
structure from sequence. An obvious direction to get some
simplification is to consider that recurrent structural motifs
exist at all levels of organization of protein structures [2].
The recent years have seen the re-emergence of an old
concept that is the identification of libraries of canonical 3D
structural fragments that span the space of local structures in
a library of representative but finite set of generic protein
fragments [3]. These libraries of fragments are classically
constructed by clustering fragments from a collection
obtained from a set of dissimilar proteins. Many clustering
approaches have been used to extract sets of representative
fragments able to represent adequately all known local
protein structures [4– 14]. Despite the fact that such libraries
provide an accurate approximation of protein conformations, their identification teaches us little about the way
protein structures are organized. They do not consider the
rules that govern the assembly process of the local
fragments to produce a protein structure during the learning
step. An obvious means of overcoming such limitations is to
consider that the series of representative fragments that can
describe protein structures are in fact not independent but
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
governed by a Markovian process. For this purpose, we
have shown that Hidden Markov Models (HMM) [15] are
relevant to identify structural alphabet (HMM-SA).
Recently, we have set up an optimal Hidden Markov Model
derived structural alphabet for proteins, which describes the
local shape of proteins and the logic of their assembly using
27 letters [16]. Such a structural alphabet is able to
optimally compress 3D information into a unique 1D
representation. We have since shown that such an alphabet
that provides an accurate description of proteins conformation can be used to search structural similarities [17].
However, one important question is to assess to what extent
such a structural alphabet is suited for structural prediction.
Fragment libraries classically face the dilemma [3] that
some balance has to be found between the accuracy of the
structural description and reasonable fragment library size: (i)
to keep a good representativity of 3D structural fragments
optimising the relevance of the 1D –3D relationship to ensure
the quality of the prediction, and (ii) to obtain a reasonable
complexity for 3D reconstruction.
In this paper, we address the question of the ability of
HMM-SA, a structural alphabet that was learnt using
exclusively geometrical and logical information (i.e. taking
into account geometry of the letters and their transitions), to
395
capture clear sequence –conformation relationship. We perform an a posteriori analysis of the dependence between the
local amino acid sequence and the local shapes of proteins
encoded as series of letters of the structural alphabet.
1.1. Materials
Our analysis was performed from a collection of nonredundant protein structures presenting less than 50%
sequence identity. Only proteins at least 30 amino acids
long, having no chain breaks, and obtained by X-ray
diffraction with a resolution better than 2.5 Å were retained.
This resulted in a collection of 3427 protein chains (denoted
as Id50) and a subset of 1429 protein chains having less than
30% sequence identity (denoted as Id30). The collection of
3427 proteins represents a total of 809,638 amino acids and
799,357 four-residue fragments. The subset of 1429 proteins
contains 336,780 amino acids and 332,493 four-residue
fragments.
1.2. Model learning
Proteins are described by their alpha carbons only (see Fig.
1.a1) and are decomposed as series of overlapping fragments of
Fig. 1. Illustration of the HMM-SA encoded process. The left block, called ‘‘3D space’’, represents the polypeptidic chain of protein 8abp (a1) scanned in
overlapping windows that encompassed 4 successive a-carbons (a2), thereby producing a series of four-residue fragments. Each fragment is described by a
vector of four descriptors (a3). Panels b1 and network b2 illustrate the optimal HMM-SA corresponding to 27 average four-residue fragments associated to 27
letters and main trajectories between letters. The right block, called ‘‘1DVHMM-SA space’’, represents the corresponding encoded chain 8abp (c1), coloured
related to secondary structures and the corresponding HMM-SA letter series.
396
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
four residue length (see Fig. 1.a2). Each four-residue fragment h
is described by a 4-descriptor vector: the three distances between
the non-consecutive a-carbons d1(h)=d{Calpha1(h) Calpha3(h)},
d2(h)=d{Calpha1(h) Calpha4(h)}, d3(h)=d{Calpha2(h) Calpha4(h)},
and the oriented projection P 4 of the last alpha-carbon
C alpha4(h) to the plane formed by the three first ones [15], as
shown in Fig. 1.a3.
Suppose that polypeptidic chains are made up of
representative fragments of (R) different types S 1,. . .,S R ,
we then assume that there are (R) letters of the model. Each
letter S i is associated to a multi-normal function b Si( y) of
parameters h i , given the parameters of mean l i and of
variability R i of the descriptors corresponding to the set of
fragments generated by letter i. Two types of model were
considered to identify R letters: a process without memory
(order 0) assuming independence of the R letters (training by
simple finite Mixture Models, MM0) or a process with a
memory of order 1. The model with memory (order 1) takes
into account dependence between letters using a Hidden
Markov Model of order one (HMM1). We assume a common
letter dependence process for all polypeptidic chains governed by a Markov chain of order one. The evolution of the
Markov chain of proteins is completely described by: (1) the
law m p of the initial letter of each polypeptide chain p, i.e. the
probability that each polypeptide chain starts in different
letters, and (2) the matrix of transition probabilities /
between different letters of the Markov chain, where
/iiV= P(X t+1 = S tV|X t = S i ) is the probability for different
proteins to evolve from letter S i to S iV at any position t. Here,
the hidden sequence of letters {x 1, x 2,. . ., x N } emits the series
of vectors { y 1, y 2,. . ., y N }, describing consecutive overlapping fragments of the proteins, resulting in a HMM1. Our
ultimate goal is to reconstruct the unobserved (hidden) letter
sequence {x 1, x 2,. . ., x N } of the polypeptide chains, given the
corresponding emitted four-dimensional vectors of descriptors { y 1, y 2,. . ., y N }, and to provide a classification of
successive fragments in R letters. For a given 3D conformation and a selected model (fixed number R of letters), the
corresponding best letter sequence among all the possible
paths in {S 1,. . ., S R }N can be reconstructed by a dynamic
programming algorithm based on Markovian process (Viterbi
algorithm, [18]). For a given set of proteins and a given
number (R) of letters, unknown parameters k = (/,m,h) of the
selected model were estimated with an Expectation and
Maximization (EM) algorithm [19] applied on the complete
likelihood.
Complete likelihood of n four-residue fragments { y 1,
y 2,. . ., y n } describing a protein of n + 3 residues
X
V ðx1 Þbx1 ðy1 Þ
Vk ðy1 ;y2 ;N ;yn Þ ¼
fx1 ;x2 ;N ;xn g
n1
Y
pxt xtþ1 bxtþ1 ðytþ1 Þ
ð1Þ
t¼1
For an overview of the basic theory of HMM, see [18],
and for practical details on application to protein struc-
tures, see [15]. Structural alphabets of different sizes (R)
were learnt on two independent learning sets of proteins
using HMM1 and MM0 by progressively increasing R
from 12 to 33 letters and compared using statistical criteria
such as the Bayesian Information Criterion (BIC), [20].
1.3. HMM-SA
We briefly recall the main characteristics of HMM-SA
[16]. The optimal structural alphabet (HMM-SA), using
statistical criterion BIC, corresponds to 27 letters and their
transition matrix. The identified letters are denoted as
structural letters: [a, A, B,. . ., Y, Z] and presented by
increasing stretches (see Fig. 1.b1). Concerning protein
architecture logic description, as suggested by the large
influence of the Markovian process on the BIC, we observed
strong dependence between letters. This results in the
existence of only few pathways between the letters, obeying
some precise and unidirectional rules (see Fig. 1.b2). The
letters associated with close shapes have different logical
roles. For instance, the two closest letters [A, a] in terms of
geometry, close to canonical alpha helix, are distinguished
by different preferred input and output letters. To simplify
the description of the HHM-SA, we compare the 27 letters to
the usual secondary structures: a-helices (38%), extended
structures (19%) and coils (43%). More precisely, the
occurrence of a letter into a secondary structure corresponds
to the occurrence of its corresponding four-residue fragments
with third residue assigned to this secondary structure.
Letters [A, a, V, W] appear almost exclusively in a-helices
(more than 92% of associated fragments assigned to ahelices) while letters [Z, B, C, E, H] are more split on ahelices and coils. Five letters [L, N, M, T, X] are mostly
located in extended structures (from 47% to 78% of
associated fragments assigned to h-strands). Other letters
are mostly associated with coils. For instance, letters [D, S,
Q, F, U] have more than 90% of fragments assigned to coils.
1.4. Structure encoding
To encode the structures, a vector of 4 descriptors
describes each fragment of 4-residue length (see Fig. 1.a3,
[15]). For one protein of n + 3 residues, given the series of
4*n descriptors and the HMM-SA, it is possible to encode it
as a series of letters of the alphabet using the Viterbi or the
forward – backward algorithms, for example. The Viterbi
algorithm identifies the best letter sequence (or optimal
trajectory) among all the possible paths in {a, A, . . ., Z}n by
dynamic programming taking into account the Markovian
dependence between the consecutive letters. We used this
approach to optimally describe each structure as a series of
letters of HMM-SA. This process of compression of 3D
protein conformation in 1DVHMM-SA is illustrated on Fig.
1 on the structure of an l-arabinose binding protein (8abp).
This ah protein is coloured first (Fig. 1.a1) according to its
secondary structure assignment and, secondly, after HMM-
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
SA compression, according to its corresponding series of
letters (Fig. 1.c1) results.
1.5. Sequence specificity of HMM-SA
To study how the amino acids are distributed into the
letters of HMM-SA, we extracted from our collection of
encoded proteins the four-residue fragments associated with
each type of letter and the corresponding n-uplets (n varying
from 1 to 4) of amino acids. The amino acid distribution
associated with each letter was compared to that observed in
the Id30 set by relative entropy [21]. This measure, also
known as the Kullback – Leibler asymmetric divergence
measure, denoted Kdl(i) for letter i, quantifies the sequence
specificity on the four positions of position i as:
"
#
X X
pa;l;i
Kdl ðSiÞ ¼
pa;l;i ln
;
ð2Þ
pa;l
1VlV4 1VaV20
where a denotes a given amino acid (1 a 20), l denotes
the one position of the four residues (1 l 4), p a,l,i the
amino acid frequency of amino acid a observed in the
position l of letter i, and p a,l the amino acid frequency of
amino acid a in the global database for position l. This
measure is expected to be close to 0 for an unspecific amino
acid distribution and to increase with sequence dependence.
The Kdl values can be assessed by a chi-square test, since the
quantity N i IKdl(i), with N i being the number of four-residue
fragments associated to letter i, follows a chi-square
distribution. Thus, letters associated to specific amino acid
397
distribution have significant Kdl values. To compare the
sequence specificity of HMM-SA of different sizes, R, we use
the global sequence information K R , obtained as the weighted
average relative entropy over R letters. Larger values
correspond to stronger global sequence specificity of the
corresponding HMM-SA, while values close to 0 correspond
to letters without sequence signature. To check the divergence of sequence specificity between R letters, we use the
Kullback – Leibler symmetric divergence, whose expression
is for letters S i and S i V:
Kdiv ðSi ;SiV Þ
"
#
X X
X
pa;l;i
pa;l;iV
¼
pa;l;i ln
pa;l;iV ln
þ
pa;l;iV
pa;l;i
1VaV20
1VlV4 1VaV20
ð3Þ
where p a,l,i and p a,l,iV denote the amino acid frequency of
amino acid a observed in the position l of letter S i or S i V,
respectively. This measure is expected to be close to 0 for
no difference of sequence distributions and to increase
with sequence dependence. Thus, positions with high
specific amino acid distribution are associated with
significant values.
A correspondence analysis is used to visualize the main
relationship between 20 amino acids and 27 letters. Then, to
analyse position-by-position dependence between amino
acids and fragments associated to each letter, Z-score
matrices are computed for the 27 letters. For each letter, the
Z-scores are computed from the amino acid distributions
observed at each of the four positions of the associated
Fig. 2. Evolution of the amino acid distribution specificity as a function of the number of letters of the HMM-SA. The evolution of the amino acid distribution
specificity of HMM-SA of different sizes, R, measured by Kullback – Leibler asymmetric divergence measure K R and Kullback – Leibler symmetric divergence
measure K divR. As R increases, these measures are expected to increase and were computed for increasing R, on letters randomly picked up in the data set with
identical letter frequency than SA-R. At the optimum, from a statistical point of view (R = 27), all letters have significant amino acid specificity compared to the
amino acid profile of the whole learning database (Id30), i.e. each letter has a significant Kullback – Leibler asymmetric divergence measure ( P < 0.001): [a]:
0.190, [A]: 0.397, [V]: 0.324, [W]: 0.235, [Z]: 0.150, [B]: 0.168, [C]: 0.316, [D]: 1.172, [O]: 0.100, [S]: 0.201, [R]: 0.593, [Q]: 0.384, [I]: 0.307, [F]: 0.228,
[F]: 0.511, [U]: 0.598, [P]: 0.264, [H]: 0.683, [J]: 0.281, [Y]: 0.398, [J]: 0.749, [K]: 0.413, [l]: 0.135, [N]: 0.318, [M]: 0.408, [T]: 0.254, [X]: 0.184.
398
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
Fig. 3. Correspondence analysis between HMM-SA and amino acids. Representation of the main associations between 20 amino acids and 27 letters obtained
using correspondence analysis. Here, the amino acid preference of the third residue of four-residue fragments associated to each letter is presented. The first
four eigenvalues account for 96% of the variance. The first factorial plane shows the predominant role of glycine and asparagine associated with coil letters [D,
U, J, R, F] and proline associated with coil letters [K, Y, H, P]. The second factorial plane shows the amino acid antagonism between the regular secondary
structures with the extended structures [M, N, T, X] on the right of the third factorial axis and the helical ones [A, a, V, W] on the left.
fragment of 4-residue length, independently, normalized by
the distribution of the amino acids in Id50. Hence, positive Zscores (respectively negative) correspond to over-represented
(respectively under-represented) amino acid at some position
of one letter. To analyse amino acid word preference for
letters, the occurrences of amino acids word are computed
and normalized for each letter into a Z-score taking into
account the expected number of amino acid word within this
letter
2. Results
2.1. Conformational learning is coupled with increasing
sequence specificity
As shown in Fig. 2, K R values increase with the number
of letters R from 12 to 33 on the learning set Id30, which
implies that the distribution of the 4-plets of amino acids
associated with each letter becomes more and more specific
as the number of letters of the HMM-SA increases. For
randomly selected letters, the increase is 0.002 from 12 to
33 letters versus 0.063 for the HMM-SA. We also observe
an increase in the difference of specificity between letters, as
measured by K div values: 0.0003 for randomly selected
letters versus 0.88 for the HMM-SA. In the following, we
only discuss HMM-SA, an alphabet of 27 letters identified
as optimal, i.e. with no over-fitting of the model parameters,
using the statistical BIC criterion [16]. At the optimum, i.e.
HMM-SA-27, K R values is of 0.34 and all letters have
significant amino acid specificity compared to the amino
acid profile of the whole learning set (Id30). This confirms
that, despite that we have learnt HMM-SA using only
geometric and sequential information, we have not oversplit sequence information.
2.2. Amino acid distributions are highly specific of each
letter of HMM-SA
First, we assess if the distribution of amino acids into the
different letters of HMM-SA is significant. This was
performed on the Id30 set. Since each letter represents a
set of four-residue fragment, we have checked that, for each
of the four positions, the distributions of the amino acids at
Fig. 4. Z-score matrices for the 27 letters computed from the amino acid distributions. The frequency of amino acid occurrence at each of the four positions of
the fragments associated with each letter, normalized into Z-scores relative to the frequency of amino acids and to the frequency of sates in the Id30 protein set.
Letters are sorted by increasing stretches. The amino acids, represented by their one-letter code (I, V, L, M, A, F, Y, W, C, P, G, H, S, T, Q, D, N, K, R, E), are
plotted in the y-axis and the positions within each segment (1 – 4) on the x-axis. The colour of each square indicates the level of occurrence of each amino acid
at each position. Higher levels are indicated by dark colours (absolute Z-scores > 4.4), the still significant values by light colours (2 < absolute Z-scores < 4.4),
and low levels, by white (absolute Z-scores <2). Blue squares designate over-represented amino acids (Z-score >2) and pink squares, the under-represented
amino acids (Z-scores < 2). (4a) More helical letters. (4b) Extended letters. (4c) Coils letters. (4d) Coils letters.
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
that position differ from one letter to another, using a chisquare test ( P < 0.0001). The visualization of main associations between 27 letters and 20 amino acids is performed in
399
Fig. 3, for position 3 of associated fragments, using a
correspondence analysis. The first two eigenvalues account
for 80% of the variance, and the first four, for 96%. The first
400
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
Fig. 4 (continued).
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
factorial plane illustrates the predominant role of glycine
and asparagine (first factorial axis), associated with coil
letters [D, U, J, R, F], and proline (second factorial axis),
associated with coil letters [K, Y, H, P]. The second factorial
plane illustrates the antagonism between the regular
secondary structures with the extended structures [M, N,
T, X] on the right of the third factorial axis and the helical
ones [A, a, V, W] on the left. In agreement with the known
amino acid preferences in terms of secondary structures
[9,22 –26], there are large preferences of helical letters [A, a,
V, W] for hydrophobic (alanine, leucine, and methionine)
and charged (glutamic acid and arginine) amino acids, as
well as glutamine. Strand letters [X, L, M, N, T] present a
strong association with valine, isoleucine, threonine, phenylalanine and tyrosine. Letters [E, O, S, Z] present few
specificity. The weakest dependences measured by the
Kullback –Leibler asymmetric divergence measure observed
for letters [E] and [Z] are still very significant ( P < 0.001).
Interestingly, letters [F, J, R], the fuzziest in terms of
geometry, are among most specific letters in terms of amino
acid preference. This confirms, as suggested by the analysis
of the transitions (see [16]), that these fuzzy letters are not
‘‘trash’’ letters clustering ‘‘outliers’’.
More in detail, Fig. 4 shows Z-score matrices for the 27
letters. For each letter, the Z-scores are computed from the
amino acid distributions observed at each of the four
positions of the 4-residue-length fragments independently,
normalized by the distribution of the amino acids in the
whole protein set. Despite the large number of letters,
significant values are observed for all the positions of all the
letters, including coil letters. For letters that are repeated to
form regular secondary structures [A, a, M, N, T], the same
significance pattern tend to be propagated through the four
401
positions. Some exceptions occur, such as the over-representation of glycine at the first position of letters [N] or [M],
or the under-representation of valine at positions 3 and 4 of
[A]. Letters [A, a] that are close in terms of geometry appear
also close in terms of sequence dependence. Still, different
amino acid preferences are observed at some positions of [A,
a]. Apart from valine, under-represented at the two first
positions of letter [a], isoleucine is over-represented in letter
[A] and not in [a], histidine is only under-represented in [A],
and glycine and phenylalanine are not under-represented at,
respectively, the last position of [a] and two first positions of
[a]. Note that the Z-score matrices average the way letters are
connected: All possible transitions are considered simultaneously. So it is not possible to easily go thoroughly into the
analysis of the over- and under-representations of the letters
following the transitions between the letters.
2.3. Some amino acid words are specific of HMM-SA letters
We now turn to the analysis of the distribution of words
of amino acids into each of the 27 letters. Due to the size of
Id50 (779,537 observations), it is, in theory, possible to
expect 4.9 occurrences of each of the 160,000 words of 4
amino acids length (4-words). Here, we observe the
occurrence of only 134,878 such words (i.e. 84% of
possible 4-words are observed from 1 to 100 times),
25,122 being not observed. For words of size 2 (2-words),
all of them occur, and for the words of size 3 (3-words),
only 6 do not occur. We observe that for 20 over 27 letters,
the distribution of the 2-words differ significantly from that
expected considering the positions independently. For all
letters, except [a, V, W, Z, C, E, I], the average Z-score over
the 400 possible amino acid pairs is significant ( P < 0.0001
Table 1
Distribution over HMM-SA letters of 225 4-words of amino acids observed more than 40 times in Id50
Minimal probability T
of occurrence
of a word
in a letter (%)
Number of 4-words
‘‘specific’’
of a letter
Number of
letters associated
to a ‘‘specific’’ word
60
12
3
50
53
5
40
30
113
181
8
13
Letters associated to at
least one specific word
[specific 4-words]
A: [AKAA, LRAA, LEAA, AALE, EAAK, EALR]
S: [GDSG, LGLP]
D: [ALGL, ELGL, ADGS, ADGT]
A: [LAAA, LEAA, VEAA, AIAA, AKAA, ALAA,
TLAA, ARAA, LRAA, AAEA, LAEA, LLEA, AREA,
AAKA, LEKA, LKKA, ALKA, LLKA, QALA, EKLA,
ALLA, EVLA, ALRA, ALAE, KAIE, AALE, EALE,
ALRE, SAAK, EAAK, LAAL, AEAL, LEAL, EKAL,
ALAL, LLAL, LEEL, LKEL, LARL]
S: [AAAR, EAAR, EALR, LGLP, GDSG, AGAD, LEEI]
D: [ADGS, ADGT, ALGL, ELGL, EAGA]
K: [GKPL]
F: [KDGK]
A, S, D, K, F, S, R, Y
A, S, D, K, F, R, Y, J, M, B, I, L, W
A word is considered as ‘‘specific’’ of a letter if its probability of occurrence in the letter is more than a fixed minimal probability threshold T. The first column
indicates the threshold T of minimal probability of occurrence of a word in a letter to be considered as specific of this letter. The second column indicates the
corresponding number of specific 4-words. The third column indicates the corresponding number of letters presenting at least one specific 4-word. The fourth
column indicates the corresponding letters presenting specific 4-words [the corresponding specific 4-words for T = 60% and T = 50%].
402
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
to take into account test repetition). For a P value of 0.001,
all letters, except [a, E], have average Z-score values
significant. We observe the same tendency for 3-words
(not shown). For 4-words, only 225 words occur more than
40 times in Id50. For 96% of them, their occurrence
frequencies differ significantly from their expected one,
using the chi-square test. Only 4% of these 4-words do not
show any statistical preference for a particular letter.
Looking if some 4-words are specific of one particular
letter (Table 1), we observe that 12, 53, 113, and 181 of the
225 4-words occur with a probability of more than 60%,
50%, 40%, and 30%, respectively, in one particular letter.
Some of these 4-words can be considered as ‘‘specific’’
signatures of some letters: for instance, respectively, 3, 5, 8,
and 13 letters have at least one 4-word associated with a
probability more than 60%, 50%, 40%, and 30%. Some 4words have a clear preference for two letters. The word
‘‘LKPG’’ is associated with [Y] (21/46, ¨46% of occurrences) and [K] (14/46, ¨30% of occurrences), and the
word ‘‘KDGK’’ is associated with [D] (27/49, ¨55% of
occurrences) and [F] (17/49, ¨35% of occurrences). Such 4words can be associated with letters unrelated structurally.
For instance, the word ‘‘LTAA’’ is dispatched mostly in [A,
Y] (around 26% of occurrences each), not geometrically
closed. Finally, we also observe that some words that differ
by only one amino acid can have marked preferences for
letters presenting no related conformations. For instance, the
words ‘‘ALLA’’, ‘‘ALRA’’ and ‘‘ALRE‘‘ are mostly occur in
[A] (more than 40%), while the word ‘‘ALVE‘‘ is mostly
found in [Y].
some sequence motifs of length up to 9 have been shown as
specific of some particular local structure [10]. However,
some novelty can be found in the fact that some specificity
occurs in a general manner for almost all letters. This opens
the door to building a prediction strategy based on more
accurate information than considering independently the
amino acid occurring at the consecutive positions. A
classical issue to lower the impact of this limitation is to
sum the information over a window encompassing the
position at which the prediction is performed (see, for
example, [31]). With HMM-SA, it seems possible to
directly use at least doublets or triplets of amino acids
(words of size 2 or 3) to perform prediction of letters, since
presently, the size of the protein dataset limits the relevance
of the information to words of size 3, even if we observe a
clear dependence between letters and some words size of 4.
More recently, it has been suggested [32] that learning the
sequence– structure relationship, taking into account the
way local conformations are connected, may lead to
prediction improvement. In the present study, we have not
analysed how much the Markovian process associated with
HMM-SA could help taking into account the specificity of
the amino acid sequence information. For amino acid words
having preferences for several letters associated with clearly
different local shapes, the Markovian transitions could help
choosing among the different preferred letters. Work is
under progress in this direction.
3. Discussion, perspectives
[1] R.H. Waterston, E.S. Lander, J.E. Sulston, On the sequencing of
the human genome, Proc. Natl. Acad. Sci. U. S. A. 99 (2002)
3712 – 3716.
[2] T.A. Jones, S. Thirup, Using known substructures in protein model
building and crystallography, EMBO J. 5 (1986) 819 – 822.
[3] A.G. de Brevern, A.C. Camproux, S. Hazout, C. Etchebest, P. Tuffery,
Beyond the secondary structures: the structural alphabets, Recent Adv.
In Prot. Eng. 1 (2001) 319 – 331.
[4] R. Unger, D. Harel, S. Wherland, J.L. Sussman, A 3D building blocks
approach to analyzing and predicting structure of proteins, Proteins 5
(1989) 355 – 373.
[5] M.J. Rooman, J. Rodriguez, S.J. Wodak, Automatic definition of
recurrent local structure motifs in proteins, J. Mol. Biol. 213 (1990)
327 – 336.
[6] S.J. Prestrelski, A.L. Williams, M.N. Liebman, Generation of a
substructure library for the description and classification of protein
secondary structure: I. Overview of the methods and results, Proteins
14 (1992) 430 – 439.
[7] M. Levitt, Accurate modeling of protein conformation by automatic
segment matching, J. Mol. Biol. 226 (1992) 507 – 533.
[8] J. Schuchhardt, G. Schneider, J. Reichelt, D. Schomburg, P. Wrede,
Local structural motifs of proteins backbones are classified by selforganizing neural networks, Protein Eng. 9 (1996) 833 – 842.
[9] J.S. Fetrow, M.J. Palumbo, G. Berg, Patterns, structures, and amino
acid frequencies in structural building blocks, a protein secondary
structure classification scheme, Proteins 27 (1997) 249 – 271.
[10] C. Bystroff, D. Baker, Prediction of local structure in proteins using
a library of sequence – structure motifs, J. Mol. Biol. 281 (1998)
565 – 577.
In the present paper, we have started analysing if
alphabet such as HMM-SA is suitable for deciphering some
local sequence – structure relationship. Our results clearly
show that sequence information specific of each letter of the
structural alphabet exists and that we have not over-split
sequence information.
In terms of protein structure prediction, simple local
libraries, such as the I-sites [27,28], introduced as local
constraints on protein available conformations, have shown
their usefulness for improving ab initio prediction [29,30].
Our results tend to prove that accurate HMM-SA is well
suited for prediction. First, we observe that the way the
letters are identified for embedded alphabets having
increasing number of letters is correlated to an increased
specificity in the associated amino acid sequence. Interestingly, the large number of letters of HMM-SA (27) is
classically mentioned as a problem to get some specific
sequence information [3]. Here, we observe that all of the
27 letters present some sequence signature. Finally, we also
observe that the relationship between letter and amino acid
sequence can be informative at the level of words of amino
acids. Such results are not surprising if one considers that
References
A.C. Camproux, P. Tufféry / Biochimica et Biophysica Acta 1724 (2005) 394 – 403
[11] A.G. de Brevern, C. Etchebest, S. Hazout, Bayesian probabilistic
approach for predicting backbone structures in terms of protein blocks,
Proteins 41 (2000) 271 – 287.
[12] C. Micheletti, F. Seno, A. Maritan, Recurrent oligomers in proteins: an
optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies, Proteins 40 (2000)
662 – 674.
[13] R. Kolodny, P. Koehl, L. Guibas, M. Levitt, Small libraries of protein
fragments model native protein structures accurately, J. Mol. Biol. 323
(2002) 297 – 307.
[14] C.G. Hunter, S. Subramaniam, Protein fragment clustering and
canonical local shapes, Proteins 50 (2003) 580 – 588.
[15] A.C. Camproux, P. Tuffery, J.P. Chevrolat, J.F. Boisvieux, S.
Hazout, Hidden Markov model approach for identifying the
modular framework of the protein backbone, Protein Eng. 12
(1999) 1063 – 1073.
[16] A.C. Camproux, R. Gautier, P. Tuffery, A hidden Markov model
derived structural alphabet for proteins, J. Mol. Biol. 339 (2004)
591 – 605.
[17] F. Guyon, A.C. Camproux, J. Hochez, P. Tuffery, SA-search: a web
tool for protein structure mining based on structural alphabet, Nar 32
(2004) W545 – W548.
[18] L.R. Rabiner, A tutorial on hidden Markov models and selected
applications in speech recognition, Proc. IEEE 77 (1989) 257 – 285.
[19] L.E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique
occurring in the statistical analysis of probabilistic functions of
Markov chains, Ann. Math. Stat. 41 (1970) 164 – 171.
[20] G. Schwartz, Estimating the dimension of a model, Ann. Stat. 6 (1978)
461 – 464.
[21] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math.
Stat. 22 (1951) 79 – 86.
403
[22] P. Argos, J. Palau, Amino acid distribution in protein secondary
structures, Int. J. Pept. Protein Res. 19 (1982) 380 – 393.
[23] J.S. Richardson, D.C. Richardson, Amino acid preferences for specific
locations at the ends of a-helices, Science 240 (1988) 1648 – 1652.
[24] L. Presta, Protein structure analysis and development of databases,
Protein Eng. 2 (1989) 395 – 397.
[25] R. Aurora, R. Srinivasan, G.D. Rose, Rule for a helix termination by
glycine, Science 264 (1994) 1126 – 1130.
[26] J.W. Seales, R. Srinivasan, G.D. Rose, Sequence determinants of the
capping box, a stabilizing motif at the N-termini of a-helices, Prot.
Sci. 3 (1994) 1741 – 1745.
[27] C. Bystroff, V. Thorsson, D. Baker, HMMSTR: a hidden Markov
model for local sequence – structure correlations in proteins, J. Mol.
Biol. 301 (2000) 173 – 190.
[28] C. Bystroff, Y. Shao, Fully automated ab initio protein structure
prediction using I-SITES, HMMSTR and ROSETTA, Bioinformatics
18 (2002) S54 – S61.
[29] K. Simons, R. Bonneau, I. Ruczinki, D. Baker, Ab initio protein
structure prediction of Casp III targets using ROSETTA, Proteins 5
(1999) 355 – 373.
[30] R. Bonneau, J. Tsai, I. Ruczinki, D. Chivian, C. Rohl, C.E.M. Strauss,
D. Baker, Rosetta in CASP4: progress in ab initio protein structure
prediction, Proteins 37 (2001) 119 – 126.
[31] R. Karchin, M. Cline, Y. Mandel-Gutfreund, K. Karplus, Hidden
Markov models that use predicted local structure for fold
recognition: alphabets on backbone geometry, Proteins 51 (2003)
504 – 514.
[32] A.G. de Brevern, H. Valadie, S. Hazout, C. Etchebest, Extension of a
local backbone description using a structural alphabet: a new
approach to the sequence – structure relationship, Protein Sci. 11
(2002) 2871 – 2886.