Relations of the numbers of protein sequences, families and folds

Protein Engineering vol.10 no.7 pp.757–761, 1997
Relations of the numbers of protein sequences, families and folds
Chun-Ting Zhang
Department of Physics, Tianjin University, Tianjin 300072, China
The relations among the numbers of protein sequences,
families and folds have been studied theoretically. It is
found that the number of families is related to the natural
logarithm of the number of sequences. The logarithmic
relation should not be changed regardless of what value of
the homology threshold is applied in the protein sequence
comparison routines. To study the relation between the
numbers of families and folds, the degenerate degree of a
fold has been introduced. The degenerate degree of a fold
is the number of protein families which adopt the same
fold. The distribution of the degenerate degrees of folds
has been found to be very likely exponential. Based on the
distribution, the average degenerate degree d is calculated.
The number of folds is simply equal to that of families
divided by the average degenerate degree of folds. It is
shown that d is an increasing function of time. The current
value of d is about 2. It will continue to increase and reach
the value of at least 3.3 in some years. By using the above
result, the numbers of protein folds for four species have
been estimated. In particular, the number of folds for
human proteins is estimated to be ≤5200.
Keywords: degeneracy/degenerate degree/distribution of
degenerate degrees/numerical relations/protein families/protein
folds/protein sequences
Introduction
Protein sequence pairs with more than 30% residue identity
are clustered together into superfamilies, or 30SEQ families
(Orengo et al., 1994). For convenience, the 30SEQ family is
also called family hereafter in this paper. It is well established
that in most cases each family adopts a unique fold structure,
while in the other cases different families may adopt the same
fold structure (Sander and Schneider, 1991; Holm et al., 1992;
Pascarella and Argos, 1992; Flores et al., 1993; Hilbert et al.,
1993; Holm and Sander, 1993; Orengo et al., 1993; Yee and
Dill, 1993; Lessel and Schomburg, 1994; Rufino and Blundell,
1994). This implies that the number of protein folds should
be less than the number of the families, which should be less
than the number of proteins. Therefore, it is reasonable to ask
how many folds there are in nature. In other words, there
exists an upper limit for the number of the unique folds. The
question was probably first raised by Chothia (1992), who
estimated the figure to be about 1000. Since then, several
research groups have tackled this problem again. However,
different results were reported. Blundell and Johnson (1993)
estimated the number to be less than 1000, in agreement with
the estimate of Chothia (1992), but Alexandrov and Go (1994)
and Orengo et al. (1994) reported much larger figures than
previously estimated, 6700 and 7920, respectively. Recently,
Wang (1996) gave a very low estimate of probably only 400.
© Oxford University Press
Obviously, this is an ongoing controversial issue. The fact that
there exists a limited number of protein folds is in agreement
with the principles of stereochemistry. Ptitsyn and Finkelstein
have pointed out that due to the stereochemical constraints,
the possible number of globular protein folds is limited (Ptitsyn
and Finkelstein, 1980; Finkelstein and Ptitsyn, 1987). This
conclusion is most welcome to researchers in the area of
protein structure prediction. The prediction of protein tertiary
structure from amino acid sequences based on the principle of
free energy minimization has not yet been successful. In this
case, the knowledge-based approach to predicting the tertiary
structures of proteins, such as the threading and profile methods
(Bowie et al., 1991; Jones et al., 1992), seems to be one of
the most promising approaches. The fact that there exists a
limited number of protein folds provides a solid basis for such
an approach. Therefore, further discussion on the above issue
is necessary and meaningful. In this paper, the relations among
the numbers of protein sequences, families and folds are
studied theoretically. The numbers of folds for proteins in four
species are estimated based on the theory established.
Result of analysis
Logarithmic relation
Three quantities are concerned in our case, i.e. the numbers
of protein sequences, families and folds, denoted by s, fa and
fo, respectively. Note that all three quantities are functions of
time. For example, s(t) indicates the cumulative number of
protein sequences found through the year t. fa(t) and fo(t)
indicate the cumulative numbers of protein families and folds
found through the year t, respectively. It is first important to
study the relationships among these quantities. Suppose that
there is a protein set consisting of s protein sequences. Let s
have an increment ∆s. Accordingly, fa has also an increment
∆fa. Obviously, for given s we should have
∆fa ~ ∆s
(1)
Now, for given ∆s, suppose that
∆fa ~
1
s0 1 s
(2)
where s0 is a constant to be determined later. Equation 2
should be explained. Since the 30SEQ families are based on
the identity of residues, for given ∆s, the larger the quantity
s, the lower is the probability of finding the new family
members, i.e. the smaller the quantity ∆fa. Consequently,
we have
∆fa 5 k
∆s
s0 1 s
(3)
where k is a proportionality constant. Integrating both sides
from t0 to t, we find
757
C.-T.Zhang
Fig. 1. Number of protein 30SEQ families fa versus the number of protein
sequences s. Data were obtained from Orengo et al. (1994). The solid curve
was drawn using the equation fa 5 k ln(1 1 s/s0) with k 5 20 795 and
s0 5 58 412. See the text for more details.
fa(t) 5 fa(t0) 1 k ln
(
s0 1 s(t)
s0 1 s(t0)
)
(4)
where fa(t) and fa(t0) are the cumulative numbers of protein
families found through the year t and t0, respectively, and
t . t0. Similarly, s(t) and s(t0) are the cumulative numbers of
protein sequences found through the year t and t0, respectively,
and t . t0. Choosing appropriate t0 such that fa(t0) 5 s(t0) 5
0, we find
fa(t) 5 k ln
(
11
s(t)
s0
)
(5)
The data for fa(t) and s(t) from the year 1960 through the
year 1992 were given by Orengo et al. (1994). Figure 1 shows
the relation between fa and s. The data points are fixed to
Equation 5 by using the nonlinear least-squares method. The
constants k and s0 are found to be
k 5 20 795, s0 5 58 412
(6)
The solid curve in Figure 1 is drawn according to Equations
5 and 6. We can see that the logarithmic relation shows good
validity, indicating the correctness of Equation 5. Note that
the last data point in Figure 1, i.e. s 5 28 000 and fa 5 7700,
was also given by Orengo et al. (1994).
It was correctly pointed out by Eisenhaber et al. (1995) that
the number of protein families critically depends on the
value of homology threshold applied in the protein sequence
comparison routines. However, the logarithmic form of Equation 5 should not be changed regardless of what value of the
homology threshold is applied. Actually, only the parameters
k and s0 in Equation 5 depend on the value of homology
threshold applied.
758
Fig. 2. Distribution of the degenerate degrees of folds. The integers on the
abscissa indicate the degenerate degrees of folds. The height of the bar
along the ordinate indicates the number of folds which have the same
degenerate degree as shown under the bar.
Degeneracy, degenerate degree and the distribution of
degenerate degrees
As pointed out previously, different families may adopt the
same fold structure. In other words, one fold may be associated
with more than one family. This phenomenon of protein
structure is called degeneracy. The number of families associated with one fold structure is called the degenerate degree of
the fold concerned. Generally, the degenerate degree is a
positive integer .1. For convenience, the degenerate degree
may be also be equal to 1, in which case there is no degeneracy
at all.
Recently, a Structural Classification of Proteins Database
(SCOP) has been established by Murzin et al. (1995). Based
on the sequence alignment, followed by clustering together of
structure with more than 30% residue identity, 559 protein
families and 286 folds were found in August 1995 (including
pre-release). For more details, see also the paper by Wang
(1996). The distribution of degenerate degrees for different
folds is shown in Figure 2. The 286 folds are divided into 13
degenerate classes. The degenerate degree of the first class
consisting of 197 folds is 1, i.e. no degeneracy. The degenerate
degree of the second class consisting of 38 folds is 2 and so
forth. Based on this result, the density matrix D for the
distribution of the degenerate degrees is defined as
D5
(
d1 d2 ... dn
p1 p2 ... pn
)
(7)
where d1, d2, ..., dn are the degenerate degrees and p1, p2, ...,
pn are the frequencies of occurrence for the first, second, ...
and nth degenerate class, respectively. Obviously,
n
Σp 51
i
i51
(8)
Relations of the numbers of protein sequences, families and folds
It has been reported that n 5 13 through the year 1995 (Murzin
et al., 1995; Wang, 1996), so
D5
(
1
2
3
4
5
6
7
8
9
14
15
19
20
197
38
20
12
5
2
1
2
4
2
1
1
1
286 286 286 286 286 286 286 286 286 286 286 286 286
)
(9)
The average degenerate degree d is calculated by
0 ø a(t) ø 1
indicates that the increasing rate of families is faster than that
of folds. This result is in agreement with the result that d(t)
. 1.
Simple mathematical inference shows that d(t) continues to
increase at least in the future years. From Equation 14, we have
∆fa(t) 5 d(t)∆fo(t) 1 fo(t)∆d(t)
n
d5
Σ
di pi
(10)
i51
n
(∆d)2 5
Σ p (d – d)
i
i
2
(12)
i51
where (∆d)2 represents the variance. Using the data in Equations
9 and 11, we find
∆d 5 2.441
fo(t)
5 d(t)
(14)
where fa(t) and fo(t) are the cumulative number of protein
families and folds found through the year t and d(t) is the
average degenerate degree of the folds associated with the
year t.
Orengo et al. (1994) introduced a very useful quantity a(t),
defined as
a(t)
∆fo(t)
∆fa(t)
∆fo(t)/∆t
∆fa(t)/∆t
(
)
(1 – d(t)a(t)
(20)
d(t) , 1/a(t)
(21)
As we know from Equation 11, d(1995) 5 1.955. According
to Orengo et al. (1994), a(1995) ™ 0.3. Hence Equation 21 is
valid. The value of d(t) will continue to increase in the future
until the following condition is satisfied:
d(t*) 5 1/a(t*)
(22)
where ∆d(t*) 5 0, i.e. d(t*) begins to reach its maximum
value dmax in the year t*. We think that dmax is an important
quantity for the study of protein structure. At present, we can
say that dmax ù 3.3 (i.e. 1/0.3).
Estimates of the numbers of protein families and folds for
four species
As pointed out in the Introduction, the estimate of the number
of possible protein folds, i.e. fo, is an important yet controversial
issue (Chothia, 1992; Blundell and Johnson, 1993; Alexandrov
and Go, 1994; Orengo et al., 1994; Wang, 1996). The theory
established above provides an alternative approach for estimating this figure. As is well known, there are (0.5–1.0)3105
protein sequences for the human species (Chothia, 1992).
Taking the middle value, we obtain s 5 0.753105 for humans.
Substituting this figure into Equations 5 and 6, we find
fa 5 17 175 for humans
(23)
that is, based on the 30SEQ families (Orengo et al., 1994),
the number of protein families for humans is about 17 175.
To estimate the number of folds for humans, we find by using
Equation 12
(16)
fo 5 fa/d ø 17 175/3.3 5 5200 for humans
fo(t – 1) and fa(t – 1) are the numbers of folds and families,
respectively, found through the year t – 1. According to their
explanation (Orengo et al., 1994), the quantity a(t) is the
percentage of newly determined non-homologous proteins
which adopt novel folds in the year t. Equation 15 may be
rewritten as
a(t) 5
fo(t)
(15)
where
∆fo(t) 5 fo(t) – fo(t – 1), ∆fa 5 fa(t) – fa(t – 1)
∆fa(t)
Since ∆fa(t) . 0, the condition that ∆d(t) . 0 is
(13)
where ∆d is the standard deviation. d and ∆d are two main
quantities describing the statistical characteristics of the distribution in Equation 7.
Note that d is not a constant; generally, d is a function of
time. Based on the definition of d (Equation 10), the relation
between fa(t) and fo(t) may be simply written as
fa(t)
∆d(t) 5
(11)
The variance of the degenerate degrees based on the density
matrix in Equation 7 is calculated as usual:
(19)
Using Equation 15, we find
which might be an important parameter for the study of protein
structure. Substituting Equation 9 into 10, we find
d 5 1.955
(18)
(17)
where the numerator (denominator) indicates the increasing
rate of the number of folds (families) and ∆t is the time
increment. That is, a(t) is the ratio of the two rates. In other
words, a(t) is the increasing rate ratio of fold/family. The fact
that (Orengo et al., 1994)
(24)
that is, the number of folds of human proteins is ø5200.
The numbers of families and the upper limit numbers of
folds for proteins in the species of Escherichia coli, yeast,
Caenorhabditis elegans and humans calculated by this method
are listed in Table I. An interesting question may be raised:
are the folds for one species relevant to those of another
species? An equivalent form of this question is: are there some
overlaps among the sets of folds for different species? The
answer seems to be ‘yes’. The principle that governs the
folding topologies of proteins is probably independent of
the species.
Discussion
The degenerate degree may be any positive integer except
zero. Looking at Figure 2 or Equation 9, we find that some
759
C.-T.Zhang
Table I. Estimates of the numbers of protein families and folds for four
species
Species
Number of genesa
Number of families
Number of foldsb
E.coli
Yeast
C.elegans
Human
4000
7000
15000
75000
1380
2350
4750
17175
420
710
1440
5200
aData obtained from Chothia
bThe upper limit values.
(1992).
integers, e.g. 10, 11, 12, 13, 16, 17 and 18, between 1 and 20
are absent. It seems that there is no reason for the absence of
these numbers. This is probably due to the bias of experimental
work. Furthermore, the maximum degenerate degree is unlikely
to be only 20. The fold with the degenerate degree of 20 is
the so-called β/α (TIM)-barrel (Murzin et al., 1995; Wang,
1996). If one more family is found in the future which
adopts the same fold of β/α (TIM)-barrel, then the maximum
degenerate degree is 21 in this case. Generally, the degenerate
degrees have the values of positive integers from 1 through
dmax successively. One of the most curious questions in the
protein-folding studies is dmax 5 ? Recently, based on a simple
lattice model of protein folding, Li et al. (1996) introduced a
new concept called the designability of protein structures. The
concept is quantified by counting the number of sequences
that uniquely fold into the same particular structure. The larger
the number, the more designable is the structure. Furthermore,
it was shown that the highly designable structures are more
stable against sequence mutations and thermal fluctuations and
also possess more secondary structure elements and tertiary
symmetries (Li et al., 1996). Although the concept was derived
from a simple model only, it has some implication for the real
protein folding. The degenerate degree of a fold proposed here
is actually the measurement of the designability of this fold.
The larger the degenerate degree, the more designable is the
fold. The β/α (TIM)-barrel is the most highly designable fold
we know so far. Therefore, the question dmax 5 ? is equivalent
to asking the maximum designability degree of folds.
The average degenerate degree d defined in Equation 10 is
an important parameter describing the distribution represented
by the density matrix Equation 7. A reliable conclusion of this
paper is that d will increase continuously in future years.
Furthermore, we conclude that in some years d should be
greater than at least 3.3. The present value of d is 1.955, as
shown in Equation 11. Holm and Sander (1996) estimated that
there will be 1600 families and 400 folds found by the end of
1997. If their estimate is correct, d will reach the value of 4.0
by the end of 1997. It is reasonable to predict that d will
continue to increase even after the year 1997. This prediction
may be confirmed by the following consideration. Using
Equation 17, we have
a(1997) 5
(400 – 286)/2
(1600 – 559)/2
5 0.11
(25)
Since d(1997) 5 4.0 , 1/a(1997) 5 9.1, the above prediction
follows immediately. Based on these figures, we may estimate
roughly the lower limit of t*. Equation 22 is far from being
satisfied in the year 1997. Therefore, the equation will be
satisfied at most in the year 1998, i.e. t* ù 1998. Even in
760
1998, it seems that Equation 22 will not be satisfied. It is
thought that t* . 2000 is very likely.
A fold with degenerate degree ù3 was defined as a superfold
(Orengo et al., 1994). The introduction of the concept of
superfold is important. We do not know why Orengo et al.
(1994) chose the integer 3 as the threshold value for defining
the superfold. We consider a suitable threshold value to define
the superfold is 2. Probably the folds with degenerate degree
2 had not been observed when the paper by Orengo et al.
(1994) was written. By our definition, except for the singlets,
in which there is no degeneracy at all, all the degenerate folds
are superfolds. Based on this definition, the percentage of the
superfolds over the folds is 31.1% by using the current density
matrix Equation 9. In other words, about one third of folds
are superfolds. By the concept of designability, the superfolds
are those folds which are probably more designable. The
tertiary structures for nine superfolds were shown in the paper
by Orengo et al. (1994), in which the TIM barrel and the
up–down 4 α-helical bundle, etc., were included. Interestingly,
there are really more secondary structure elements (helix and
sheet) and tertiary symmetries in these superfolds. If the
concept of designability is correct, it is expected that these
superfolds should possess more thermodynamic stability
against thermal fluctuations and other perturbations. Furthermore, the superfolds should fold more readily kinetically. All
of these could be examined experimentally.
The distribution represented by the density matrix Equation
9 is another interesting problem that needs to be discussed
further. To what distribution does Equation 9 correspond? Is
it normal or exponential? It is unlikely that Equation 9 is a
normal distribution; rather, it could be characterized well as
an exponential decay. We will discuss the implication of
the exponential distribution below by comparing the two
distributions. If the distribution is normal, the density function
g(d) is
g(d) 5
{
23
0,
1
√2πσ
–
e
(d–1)2
2σ2
,
dù1
(26)
d,1
where σ is the standard deviation and d is the degenerate
degree. If the distribution is exponential, the density function
e(d) is
e(d) 5
{
λe–λ(d–1),
dù1
0,
d,1
(27)
where λ 5 1/σ. Based on these equations, it is clear that when
d . 1 1 2σ, then g(d) , e(d). Using σ 5 ∆d 5 2.44 (see
Equation 13), we find when d . 6, the probability of occurrence
of an event in the normal distribution Equation 26 is less than
that of the exponential Equation 27. Actually, when d . 3σ
or d . 8, the probability of occurrence of an event in the
normal distribution Equation 26 is very small. In contrast,
when d . 8, the probability of the exponential distribution
Equation 27 is still considerably large compared with the
normal Equation 26. When d 5 20, the maximum degenerate
degree observed so far, the probability associated with this
event in the normal distribution Equation 26 is only 1.6 3
10–10 of that of the exponential Equation 27. In other words,
it is almost impossible for the event of d 5 20 to take place
if the distribution is normal. It is the exponential rather than
the normal distribution that makes the events of d . 8 take
Relations of the numbers of protein sequences, families and folds
place with a higher probability. Hence our overall feeling is
that the distribution Equation 9 is very likely exponential.
However, it is still too early to draw a definite conclusion
before more data are available.
The distribution of the degenerate degrees may depend on
the database used. To compare with that based on SCOP, the
collections of related folds in the Sali–Overington database
(Sali and Overington, 1994) are analyzed here. The 105
alignments collected in this database are viewed as 105 folds.
Although some folds in the Sali–Overington database are
identical with those in SCOP, the former is by no means a
subset of the latter. There are 162 30SEQ families in the
Sali–Overington database, which are associated with the 105
folds. We find seven degenerate classes, i.e. d 5 1, 2, 3, 4, 5,
8 and 16. The corresponding density matrix denoted by D1 is
D1 5
(
1
2
3
4
5
8
16
82 14
2
3
2
1
1
105 105 105 105 105 105 105
)
(28)
Accordingly, the average degenerate degree d1 5 1.54 and the
standard deviation ∆d1 5 1.76, which may be compared with
d 5 1.995 and ∆d 5 2.44 in SCOP. Generally, the two
distribution Equations 9 and 28 are similar.
Eisenhaber,F., Persson,B. and Argos,P. (1995) CRC Crit. Rev. Biochem. Mol.
Biol., 30, 1–94.
Finkelstein,A.V. and Ptitsyn,O.B. (1987) Prog. Biophys. Mol. Biol., 50,
171–190.
Flores,T.P., Orengo,C.A., Moss,D. and Thornton,J.M. (1993) Protein Sci., 2,
1811–1826.
Hilbert,M., Bohm,G. and Jaenicke,R. (1993) Proteins, 17, 138–151.
Holm,L. and Sander,C. (1993) Nucleic Acids Res., 22, 3600–3609.
Holm,L. and Sander,C. (1996) Science, 273, 595–602.
Holm,L., Ouzounis,C., Sander,C., Tuparev,G. and Vriend,G. (1992) Protein
Sci., 1, 1691–1698.
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 86–89.
Lessel,U. and Schomburg,D. (1994) Protein Engng, 7, 1175–1187.
Li,H., Helling,R., Tang,C. and Wingreen,N. (1996) Science, 273, 666–669.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol.,
247, 536–540.
Orengo,C.A., Flores,T.P., Taylor,W.R. and Thornton,J.M. (1993) Protein
Engng, 6, 485–500.
Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Nature, 372, 631–634.
Pascarella,S. and Argos,P. (1992) Protein Engng, 5, 121–137.
Ptitsyn, O.B. and Finkelstein, A.V. (1980) Q. Rev. Biophys., 13, 339–386.
Rufino,S.D. and Blundell,T.L. (1994) J. Comput.-Aided Mol. Des., 8, 5–27.
Sali,A. and Overington,J.P. (1994) Protein Sci., 3, 1582–1596.
Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68.
Wang,Z.-X. (1996) Proteins, 26, 186–191.
Yee,D.P. and Dill,K.A. (1993) Protein Sci., 2, 884–899.
Received December 11, 1996; revised March 12, 1997; accepted March
14, 1997
Conclusion
The relations among the numbers of protein sequences, families
and folds have been studied. A logarithmic relation between
the numbers of sequences and families has been found. It is
important to point out that the logarithmic form should not be
changed regardless of what value of the homology threshold
is applied to define the families. On the other hand, the relation
between the numbers of families and folds is much more
complicated than that between the sequences and families.
One of the contributions of this paper is that the concept of
the degenerate degree of a fold has been introduced. Based on
this, the distribution of the degenerate degrees has been studied
and found to be very likely exponential. The formalism
presented in this paper seems to provide a basis to facilitate
the further study of related problems of protein structures.
Data and materials
The data analyzed in this paper were based on SCOP, Release
of August 95 (Murzin et al., 1995), which were obtained via
URL:http://scop.mrc-lmb.cam.ac.uk/scop/. The SCOP, Release
of August 95, analyzed all released PDB entries available at
that time. The data on the numbers of sequences and families
were obtained from Dr Janet Thornton via e-mail, which were
based on the analysis of the SWISS-PROT database, Release
27 (Orengo et al., 1994).
Acknowledgments
The author thanks Dr Z.-X.Wang for some stimulating discussions. He is also
grateful to Dr Janet Thornton for sending the data used in Figure 1. This
study was supported in part by the Pandeng Project of China and grant
19577104 from the China Natural Science Foundation.
References
Alexandrov,N.N. and Go,N. (1994) Protein Sci., 3, 866–875.
Blundell,T.L. and Johnson,M.S. (1993) Protein Sci., 2, 877–883.
Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) Science, 253, 164–170.
Chothia,C. (1992) Nature, 357, 543–544.
761