Multi-domain Proteins in the Three Kingdoms of Life

doi:10.1016/j.jmb.2005.02.007
J. Mol. Biol. (2005) 348, 231–243
Multi-domain Proteins in the Three Kingdoms of
Life: Orphan Domains and Other Unassigned Regions
Diana Ekman, Åsa K. Björklund, Johannes Frey-Skött and
Arne Elofsson*
Stockholm Bioinformatics
Center, Stockholm University
SE-106 91 Stockholm, Sweden
Comparative studies of the proteomes from different organisms have
provided valuable information about protein domain distribution in the
kingdoms of life. Earlier studies have been limited by the fact that only
about 50% of the proteomes could be matched to a domain. Here, we have
extended these studies by including less well-defined domain definitions,
Pfam-B and clustered domains, MAS, in addition to Pfam-A and SCOP
domains. It was found that a significant fraction of these domain families
are homologous to Pfam-A or SCOP domains. Further, we show that all
regions that do not match a Pfam-A or SCOP domain contain a significantly
higher fraction of disordered structure. These unstructured regions may
be contained within orphan domains or function as linkers between
structured domains. Using several different definitions we have
re-estimated the number of multi-domain proteins in different organisms
and found that several methods all predict that eukaryotes have
approximately 65% multi-domain proteins, while the prokaryotes consist
of approximately 40% multi-domain proteins. However, these numbers are
strongly dependent on the exact choice of cut-off for domains in unassigned
regions. In conclusion, all eukaryotes have similar fractions of multidomain proteins and disorder, whereas a high fraction of repeating domain
is distinguished only in multicellular eukaryotes. This implies a role for
repeats in cell–cell contacts while the other two features are important for
intracellular functions.
q 2005 Published by Elsevier Ltd.
*Corresponding author
Keywords: protein domains; multi-domain protein; comparative genomics;
kingdoms of life; proteome
Introduction
Proteins are modular entities where “domain” is
a general designation for recurrent protein fragments with distinct structure, function and/or
evolutionary history. Protein domains may exist
alone, forming small single-domain proteins, but
are frequently part of larger polypeptide chains,
assembled by successive events of gene fusion.
During evolution multi-domain proteins have
often changed their domain arrangements and
compositions.
D.E. and A.K.B. contributed equally to this article.
Abbreviations used: SCOP, the structural classification
of proteins database; Pfam, the protein family database;
MAS, Mkdom after SCOP assignments; HMM, Hidden
Markov Model.
E-mail address of the corresponding author:
[email protected]
0022-2836/$ - see front matter q 2005 Published by Elsevier Ltd.
Domains are often defined either using a structural definition as “independently folding units” or
with an evolutionary based definition as “independently evolving units”. Although the two definitions are quite different, in many cases the
resulting domain families are equivalent, indicating
that domains are a fundamental unit of both protein
structure and evolution.1 A number of resources,
such as the structurally based databases, SCOP,2
CATH3 and FSSP4 or evolutionary based databases,
SMART,5 ProDom6 and Pfam,7 provide information
about protein domains. These databases are regularly updated and can be used to detect domains.
Today the complete sequences of more than 150
genomes, of which about 20 are eukaryotic, are
available. By comparative analysis between the
genomes, better understanding of the differences
between the three kingdoms of life has been
obtained. In particular, it has been noticed that the
proteomes of higher eukaryotes are significantly
232
more complex than the prokaryotic proteomes.
Since the first complete genome sequences were
made available, a number of groups have studied
the distribution of domains in different genomes8–10
and a particular interest has been directed toward
differences between prokaryotes and eukaryotes.9
One notable observation is that the fraction
of multi-domain proteins is distinctly larger in
eukaryota, ranging from two thirds to 80%.9
Already in the first studies of protein domains, it
was shown that the distribution of domain families
exhibit a power law behavior, i.e. that a few
domains exist in many copies while most exist in
only a few.11 Such a distribution can be explained by
the use of a simple model based on gene-duplication events.12 A similar distribution is seen for the
number of combination partners, where a few
domain families combine with many different
domains while most domain families have only
one or a few partners.9,13
Orphans14 (or ORFans15) are proteins with no (or
very few) homologs, i.e. proteins without any
known domains. As the number of completely
sequenced genomes has increased the number of
orphan genes has also grown, while the fraction
they represent is slowly diminishing. It was recently
reported in a study covering 60 microbial genomes
that about 14% of the genes are orphans.16 There are
two opposing explanations to how the high amount
of orphan proteins has evolved. Either new proteins
are created spontaneously14 or the orphan proteins
have evolved too far away from their closest
neighbors to be detected.17 Both explanations create
fundamental questions about protein evolution.
For the first scenario the process behind the
spontaneous creation is certainly not well understood, while for the second scenario it is not clear
what has happened to the intermediate proteins
during the evolutionary process. In addition to
orphan proteins there are segments of proteins that
cannot be assigned to a known domain family, but
they may still contain a domain and are here
designated orphan domains.
Besides the two possible explanations as to why
no known domains are found in orphan proteins,
another reason why a sequence is not matched to a
domain is that the region might actually belong to
one of the neighboring domains but is not aligned
to the HMM. This boils down to the problem of
defining what determines the borders of a domain,
especially when the domain contains non-conserved elements at the C or N-terminals. It is also
well known that domains might undergo quite
substantial changes,18 for instance significantly
increasing/decreasing the length of loops or
adding/deleting complete secondary structure
elements, without domain rearrangements. Therefore, short unassigned regions might belong to the
bordering domains but have either been missed in
the assignment procedure or belong to a nonconserved part of the domain family. Several
examples of domains that have been inserted within
other domains also exist19 which may complicate
Multi-domain Proteins
domain assignments. Another complication is that
regions where no domains can be detected may
have low sequence complexity or disordered
structure that may be under a different evolutionary
pressure.
As the unassigned regions constitute a major part
of the genomes they affect the outcome of comparative genomics studies. For instance the predicted ratio of single versus multi-domain proteins
is strongly dependent on how unassigned regions
are treated.
We try to complement earlier studies by comparing different methods for handling the problem
of unassigned regions and by using two different
domain definitions, Pfam and SCOP. We compare
structural and evolutionary aspects of regions of the
proteomes assigned to well-defined domains, less
well-defined domains and the orphan regions.
Finally we provide a more detailed estimate of the
fraction of single and multi-domain proteins in
the three kingdoms of life.
Results
Protein coverage
We have analyzed the proteomes from 21
completely sequenced organisms, seven eukaryotes, seven bacteria and seven archaea. Domains
were assigned in all species using both the Pfam7
database and the Superfamily database20 for SCOP
domains,2 for an overview of the assignment
procedure see Figure 1. Out of the more than
170,000 proteins studied, 65% were assigned one or
more SCOP domains and in 70% a Pfam-A domain
could be detected. To extend the Pfam-A assignments we have included the less well characterized
Pfam-B domain families and the parts that were not
assigned to SCOP domains were clustered using the
Mkdom-2.0 domain finding algorithm.21 These
clustered domains are below referred to as
Mkdom After SCOP (MAS) domains. The SCOPC
MAS assignments cover 86% of the proteins, and
Pfam-ACB cover 90%, see Table 1. About 3% of the
proteins were assigned by SCOPCMAS but not by
Pfam, hence in total 93% of the proteins could be
assigned to a domain. Clustering of the unassigned
parts after Pfam-A and Pfam-B increased the total
coverage by less than one percentage point, and
was therefore ignored.
On average, the assignments gave 1.6 SCOP, 1.9
Pfam-A, 2.8 SCOPCMAS and 2.8 Pfam-ACB
domains per protein. The average number of
domains per protein is higher in eukaryotes than
in prokaryotes, which may be a consequence of
eukaryotic proteins being longer (average lengths
460 versus 300 amino acid residues). As can be
expected, longer proteins contain more domains
than shorter, with an average of more than five
domains in proteins longer than 600 residues, while
about half of all proteins shorter than 100 residues
have no domain assignments. Domain assignments
233
Multi-domain Proteins
Figure 1. The procedure for domain assignment. (i) To each protein sequence, domains are assigned using Pfam-A or
SCOP. (ii) Remaining unassigned regions are further assigned with HMM and Blast to Pfam-B or with clustering to MAS
(Mkdom after SCOP) domains. (iii) Unassigned regions are divided into orphan domains (OD) if they are longer than the
cut-off and short domain adjacent regions (DARs), if they are shorter (here using cut-off 100 residues). If a region is twice
as long as the cut-off, two OD are assigned. Proteins with no domains assigned are termed orphan proteins, where the
number of domains is counted as number of OD that fit into the sequence. (iv) Finally, the domain assignments are
analysed. The number of domains is calculated for each of the two assignment methods, including orphan domains. The
same calculations are also done without Pfam-B/MAS assignments, only calculating number of Pfam-A/SCOP domains
and orphan domains.
Table 1. Summary of domain assignments
Pfam-A
Kingdom
Proteins
Archaea
Bacteria
Eukarya
All
13202
20443
142672
176317
AssP
(%)
71
74
69
70
CAssP
(%)
54
57
33
38
Pfam-ACB
AvD
AssP
(%)
1.4
1.4
2.0
1.9
89
89
90
90
CAssP
(%)
82
82
68
71
SCOP
AvD
AssP
(%)
1.8
1.8
3.0
2.8
63
61
66
65
CAssP
(%)
46
43
30
33
SCOPCMAS
AvD
AssP
(%)
1.2
1.3
1.7
1.6
79
76
88
86
CAssP
(%)
66
61
65
65
AvD
1.6
1.7
3.0
2.8
Four different methods, Pfam-A, Pfam-ACB, SCOP and SCOPCMAS, displaying the fraction of proteins with a domain assignment
(AssP). The fraction of proteins that are completely covered by assigned domains with no unassigned stretch larger than 100 residues
(CAssP) and the average number of assigned domains per assigned protein (AvD).
are provided, together with predicted secondary
structures available on the web†.
Amino acid coverage
Although a majority of the proteins are predicted
to contain one or more domains, on the residuelevel a fairly large fraction of the proteomes remains
unassigned. We have divided the proteomes into
different groups; (i) assigned to a Pfam-A or SCOP
domain, (ii) assigned to a Pfam-B or MAS domain,
(iii) unassigned short (!100 residues) domain
adjacent regions (DARs), (iv) long (O100 residues)
unassigned regions (orphan domains) and (v)
proteins without any domains assigned (orphan
proteins). In Figure 2 it can be seen that about half of
the residues are assigned to a Pfam-A or SCOP
domain and that Pfam-A covers a larger fraction
than SCOP in prokaryotic, but not in eukaryotic
genomes. Including the assignments from Pfam-B
† http://www.sbc.su.se/~arne/domains
or MAS increases the coverage to 65–75%. Interestingly, the highest coverage is found in proteins with
300–500 residues, since many of the shorter proteins
have no domain assignments, while the longer
proteins are only partially covered. Also, a larger
fraction of the longest proteins are covered by PfamB/MAS domains.
After domain assignments with Pfam-ACB or
SCOPCMAS 22–36% of the proteomes remain
completely unassigned. Domain-adjacent regions
constitute 9–13% of the residues with a similar
fraction in all kingdoms. The orphan domains, on
the other hand, are more abundant in eukaryotes,
especially when Pfam assignments are considered.
Finally, for all assignments 5–6% of the residues are
found in orphan proteins, except after the SCOPC
MAS assignments in prokaryotes where they are
twice as abundant.
Domain lengths
A domain is assigned to a protein after alignment
of the protein sequence with a hidden Markov
234
Multi-domain Proteins
Figure 2. Amino acid coverage with Pfam-ACB or SCOPCMAS is shown for the three kingdoms. The unassigned
parts are divided into short domain adjacent regions (DAR), long unassigned regions (orphan domains) and unassigned
proteins (orphan proteins). DARs are regions shorter than 100 residues that remain after domain assignments while
orphan domains are longer regions and orphan proteins are whole proteins without any assigned domains.
model representing the domain family. As the
domain is assumed to cover the aligned region,
which is often shorter than the HMM, many domain
assignments are shorter than their corresponding
models, see Figure 3. In fact, the Pfam-A assignments are on average 7% shorter than the HMMs
while the SCOP assignments are 10% shorter.
However, in the minuscule fraction of domains
known to contain inserts, the assigned domains are
sometimes considerably longer than the corresponding HMMs.
On average the MAS domains are the shortest
(median length 78 residues/average length 105
residues) followed by Pfam-B (81/116), while the
better studied SCOP (114/143) and Pfam-A
(155/194) domains are longer, see Figure 4. The
lengths of the domains from SCOP, MAS and PfamB all follow a similar pattern with a peak close to 60
Figure 3. The difference between
length of domain assignments and
length of HMM seeds, where alignments shorter than the HMM have
a negative residue difference. As
can be seen, many Pfam-A and
SCOP domain assignments are
shorter than the corresponding
HMM.
235
Multi-domain Proteins
Table 2. Fraction of PRC hits between domain family
HMMs for three different E-values (10K1, 10K3 and 10K5)
PRC E-value
PfamA to PfamA
PfamB to PfamA
SCOP to SCOP
(Superfamily)
SCOP to SCOP (Fold)
MAS TM to SCOP
MAS non-TM to
SCOP
10K5 (%)
10K3 (%)
10K1 (%)
3.5
3.8
2.4
9.1
8.5
4.4
27.3
23.1
9.5
1.0
0.9
9.6
2.3
3.8
14.5
6.1
33.0
18.3
PRC alignments comparing Pfam-B to Pfam-A, Pfam-A to PfamA, SCOP to SCOP and MAS to SCOP were done. SCOP hits are
displayed as number of hits to a family from another fold and
another superfamily. The MAS domains were divided into
transmembrane families (TM) and non-transmembrane (nonTM). Many MAS TM families had high hit-ratios to SCOP at
higher E-values, this can be accounted for by hits between
transmembrane helices.41
residues, while Pfam-A domains have a sharp peak
at 25. A few very abundant repeating domain
families such as the C2H2/C2HC zinc fingers can
largely explain this peak. It can also be noted that
there are fewer long (O200) domains assigned by
Pfam-B and MAS. After domain assignments with
SCOP and Pfam, about half of the remaining
unassigned regions were shorter than 50 residues
(49–55%). With the inclusion of Pfam-B/MAS this
fraction was increased to (79–81%) while only 10–
12% were longer than 100 residues (data not shown).
Detection of distant homologs
During this study, it was noted that many PfamB/MAS families show weak similarity to a PfamA/SCOP family. Therefore, the PRC program for
aligning and comparing HMMs was used to
evaluate distant homology between domain
families. According to the SCOP definitions, two
domains that belong to different superfamilies
should not be homologous,2 but there are a few
exceptions.20 To estimate the false positive rate we
applied PRC to SCOP superfamilies and found that
4.4% (2.3%) of the superfamilies (folds) show
significant similarity (E-value !10K3) to another
superfamily (fold), see Table 2. As we know that
some of these high-scoring pairs are likely to be
distantly related, see Materials and Methods, we
decided to keep using an E-value threshold of 10K3
to estimate the number of distantly related domain
families.
All Pfam-B HMMs were aligned against the
Pfam-A library and for 8.5% of the Pfam-B domains
significant homology to a Pfam-A family was
found, see Table 2. It was also noted that a similar
fraction of the Pfam-A families are homologous to
another Pfam-A family, confirming the well-known
fact that several Pfam-A families are related.
Further, the 500 largest MAS domain families
were aligned against the superfamily database.
Since few transmembrane (TM) protein structures have been resolved, SCOP is strongly
underrepresented in transmembrane domains;
consequently MAS domains frequently contain
TM-regions. Therefore, the MAS domains were
divided into two groups: TM and non-TM domains.
14.5% of the non-TM domains show similarity to a
SCOP superfamily, while as expected, only few
TM-domains were detected as homologs. With a
less strict threshold (10K1), more Pfam-B and MAS
domains give hits (15–25%). However, only 22% of
all SCOP families give hits to another member of the
same superfamily at the threshold 10K3 (data not
shown), therefore the estimates of distant homology
made above should be seen as a lower limit.
Discussion
A domain family in Pfam-A is based on a
manually curated multiple sequence alignment
and the detection of novel domains is often based
on clustering of unassigned regions. Therefore, the
definitions of domain boundaries are evolutionary
based, i.e. domains that have been seen alone or in
combination with other domains will be identified
as unique domains, and domain boundaries which
are not well conserved are often not included in the
HMMs. In contrast, the definitions of domain
families in SCOP are based on structure. Although
they have different origins, the domain definitions
of Pfam-A and SCOP are frequently quite similar.1
There is one significant distinction, however, in the
fact that domain families in SCOP are organized in a
hierarchical fashion as folds, superfamilies and
families. According to the definitions in SCOP,
two domains that belong to different superfamilies
do not have features that “suggest that a common
evolutionary origin is probable”.2 In contrast, two
domains belonging to different Pfam-A families
could very well be homologs, see Table 2. In
addition, the assignment of domains with the two
methods is different, as each Pfam-A family is
represented by one HMM while one SCOP superfamily may be represented by several HMMs.20
The difference is notable in the clustering level of
Pfam-A and SCOP as demonstrated by the number
of domain families that we detect (5385 and 1177,
respectively), i.e. a single SCOP domain superfamily frequently corresponds to several Pfam-A
domain families.
The same domain families are sometimes represented differently by Pfam-A and SCOP. The length
of a domain family may vary between the two
definitions and Pfam-A domain families are on
average longer than SCOP domain families. It is not
uncommon that one domain family from Pfam-A
has no counterpart in SCOP, as SCOP only contains
domain families of known structure. This is
especially evident for membrane proteins which
are poorly represented in SCOP.1 Still, for a general
comparison, it is of marginal importance if
homology or structure is used to define a domain
and, in spite of some differences, our estimates of
coverage and number of multi-domain proteins are
236
Multi-domain Proteins
Figure 4. Distribution of the
lengths of assigned domain
families (Pfam-A, Pfam-B, SCOP
and MAS). There are more short
family assignments with Pfam-A
than with the others and more
families of a length greater than
180 residues are assigned with
Pfam-A and SCOP than with
Pfam-B and MAS.
similar using either domain definition, see Figures 2,
4, 6 and 7. When repeating domains are considered,
however, the choice of domain definition has a
greater impact on the results, due to the differences
in clustering level, see Figure 8.
Pfam-B and MAS domains
Most earlier genome-comparative studies of
protein domains are based on the well studied
Pfam-A or SCOP domains that only cover less than
half of the proteomes. In contrast to Pfam-A and
SCOP, the families from Pfam-B and MAS are not
manually curated and are of varying quality. Many
of these assignments do not likely correspond to
unique and complete domain families. The families
are on average shorter than in Pfam-A and SCOP,
but show similar length distribution to that of SCOP
with a peak around 60 residues, see Figure 4.
We noted that in many cases the domains
detected by Pfam-B or MAS are homologous to
a Pfam-A or SCOP domain family. This can
be exemplified by the human protein
ENSP00000276208 that has 14 domains, according
both to Pfam-ACB and to SCOPCMAS, of which
the majority belong to the Immunoglobulin (Ig)
family (PF00047/b.1.1), see Figure 5. Some Pfam-A
Ig domains are not recognized as members of the
SCOP Ig-family or vice versa but are instead
assigned to a Pfam-B or MAS family. Therefore, it
seems likely that the Pfam-B/MAS domains are
distantly related to the Ig-domain family, but have
diverged too far to be detected. We have attempted
to quantify the similarity of Pfam-B families to
Pfam-A using a sensitive HMM-HMM alignment
program (PRC) and we noted that at least 8% of the
Pfam-B domain families are distantly related to a
Pfam-A family, see Table 2. In contrast, many of the
most common MAS families are found in membrane proteins, of which several have not been
structurally solved, and hence do not have any
counterpart in SCOP, while more than 14% of the
globular MAS domains are homologous to SCOP
families. Whether orphan domains and orphan
proteins also contain a large fraction of distant
homologs to better studied domains is obviously
not known but in a recent study by Siew & Fischer22
it was shown that most of the recently solved
structures of orphan proteins show similarity to
already known protein domains.
Structural features
The secondary structure of the unassigned
regions is of particular interest as it might reveal
clues about function and evolutionary origin. To
obtain a better understanding of the differences
between the classes of assigned and unassigned
regions in Figure 2, structural features were
predicted using state of the art methods. Each
residue was assigned to belong to one out of six
“structural” states assigned in the following order,
(i) transmembrane regions, (ii) disordered
regions, (iii) low-complexity regions, (iv) a-helices,
(v) b-sheets and (vi) coils, see Figure 6.
Transmembrane regions are clearly under-represented in SCOP domains (2%) but they are instead
included in the MAS domains (15%). In archaea and
bacteria, transmembrane residues constitute as
much as 22% of the MAS domains. Prokaryotic
proteins contain a larger fraction of transmembrane
residues than eukaryotes, which may explain why
SCOP covers less of the proteins than Pfam-A in
prokaryotes. In general, the unassigned regions
resemble quite well the assigned regions with
regards to transmembrane content, but the PfamA domains seem to be slightly over-represented in
transmembrane regions.
As can be expected, increased portions of
disordered and low-complexity regions are found
in the sequences not related to the well studied
Pfam-A or SCOP domains. Orphan proteins contain
237
Figure 5. Domain assignment for the human protein ENSP00000276208. The upper assignments are Pfam-A (Ig) and Pfam-B (B1–B4) while the lower is SCOP (Ig) and MAS
(M1–M3). Both Pfam and SCOP/MAS give 14 domains, most of which belong to the Immunoglobulin (Ig) family. The Pfam-B/MAS domains (B1, B3 and M1) that correspond to
SCOP/Pfam-A Ig-domains are likely to be distantly related members of the Ig-family.
Multi-domain Proteins
Figure 6. Structural features for regions assigned by
Pfam-A/SCOP or Pfam-B/MAS and unassigned
sequences: DARs, orphan domains (OD) and orphan
proteins (OP). The features are depicted in the following
order: disorder, low complexity, transmembrane region,
coil, b-sheet and a-helix (from top to bottom). The
distribution in eukaryotes is displayed in (a), and for
prokaryotes in (b).
the same fraction of disorder as do most proteins
with domain assignments. Disordered regions in
orphan proteins could be located both within
domains and in linker regions, hence we do not
know if there are structured domains in orphan
proteins. As has been documented earlier,23,24 we
found that disordered regions are much more
frequent in eukaryotes, including yeast, than in
genomes of the other kingdoms, with as much as
48% in eukaryotic orphan domains, compared to
12%–13% in prokaryotes, and even the well
defined SCOP/Pfam-A domains contain more
disorder in eukaryota. It has been shown that
disordered regions may function as flexible connecting loops between domains.23 As can be seen in
Figure 6, the sequence class with most disorder in
prokaryotes is the domain adjacent regions (DAR),
leading us to believe that these are linkers between
domains.
In eukaryotes, on the other hand, a similar
frequency of disordered stretches is found in the
orphan domains and in the domain adjacent
238
regions, as well as Pfam-B/MAS domains. This may
indicate that some of the disordered regions in
eukaryotes are linkers between domains while
others are part of domains. In many cases disordered regions have been found within domains
where they perform important functions such as
binding to other molecules.24,25 Disordered regions
have also been shown to evolve faster than other
parts of the proteomes,25,26 especially in higher
eukaryotes. This could explain why orphan
domains which have a large fraction of disorder,
have no or few detected homologs. Others have
suggested that many disordered protein regions are
actually remnants of protein evolution, which are
non-functional but have not yet been lost through
selection.27
Eukaryotes contain more multi-domain proteins
Next, we wanted to estimate the number of
proteins containing a specific number of domains in
each kingdom of life including information about
the unassigned regions. In order to predict the
number of domains in a protein it is necessary to
decide when to define an unassigned region as an
individual domain. Short regions between domains
may belong to linkers or connecting loops, but they
may also be part of the bordering domain. It is
difficult to define the domain borders, and many
non-conserved C or N-terminal elements might be
missed. This is the case for many Pfam-A domain
families where non-conserved elements close to the
domain borders are not included in the HMMs,1 but
also for the many families where the obtained
assignments are shorter than the HMMs. As
mentioned in Results, Pfam-A and SCOP assignments are on average 7% and 10% shorter than their
corresponding HMMs, hence the domain assignments have missed many non-conserved regions at
the ends. It is also well known that domains might
increase and decrease in size without the need for
domain rearrangements.18 Naturally, we cannot
rule out the possibility that these short unassigned
segments contain small domains, as for instance has
been seen by the insertion of a minidomain in
N-actylglucosamine-6-phosphate deacetylase from
T. maritima.28 We believe, however, that it is an
exception rather than the rule that these short
regions constitute a whole domain in its traditional
meaning.
In a previous study by Apic et al.,9 based on SCOP
domain assignments, the number of multi-domain
proteins was predicted to be 80% in eukaryotes and
65% in bacteria and archaea. Similar results were
obtained by Liu et al.10 using both evolutionary and
structural domains. Levitt and Gerstein8 on the
other hand, predicted that two thirds of the
eukaryotic proteomes consist of multi-domain
proteins. Neither of these estimates is robust
however, since they are obtained only from proteins
containing assigned domains, corresponding to
31–69% of all proteins. In contrast, we have
assigned domains in up to 90% of all proteins. In
Multi-domain Proteins
the earlier studies, quite conservative cut-offs
(30–70 amino acid residues) were used to assume
that an unassigned region constitutes a domain.8,9
Here, the predicted number of domains in each
protein has been studied using different cut-offs
ranging from 30–200 residues. We have also used
four different domain representations, (i) Pfam-A
assignments and unassigned regions larger than the
cut-off, (ii) Pfam-ACPfam-B assignments and unassigned regions, (iii) SCOP and unassigned regions
or (iv) SCOPCMAS and unassigned regions.
In Figure 7 it can be seen how the estimated
number of multi-domain proteins increases as the
cut-off length decreases. The fraction of eukaryotic
proteins that are predicted to have two or more
domains using Pfam decreases from 90% to 50%
when the cut-off is increased from 30 to 200 amino
acid residues. Interestingly, cut-off lengths of 70 and
100 residues give similar results regardless of which
of the four strategies we use, in spite of the
differences in average domain lengths and coverage. With shorter or longer cut-offs the variation is
larger, especially for archaea and bacteria.
With a cut-off at 100 residues, we estimate that
eukaryotes have 35% single-domain proteins, 20%
two-domain proteins and 45% with three or more
domains, see Figure 7(d). Archaea and bacteria
have very similar distributions of multi-domain
proteins with corresponding numbers 60%, 20%
and 20%. Our results correlate well with findings of
Levitt & Gerstein,8 but indicate that the proteomes
contain fewer multi-domain proteins than claimed
in other studies.9,10,29 We believe this to be a more
accurate prediction since larger fractions of the
proteomes are covered and because more attention
has been invested in evaluating unassigned regions.
However, depending on what cut-off for unassigned regions is used, this estimate varies. With
cut-off 50 the fraction of multi-domain proteins
increases to around 80% in eukaryotes and 60% in
prokaryotes, while with cut-off 200 these numbers
go down to 40–60% and 30–40%, respectively.
As databases with known domain families are
growing continuously and more orphan domains
are found, we may actually get coverage in all
regions where so far no domains have been
detected. Not until then can we truly estimate the
number of multi-domain proteins.
More than 8% of the multicellular proteomes
consists of domain repeats
A notable feature of many multi-domain proteins
is that they consist of repeats, i.e. two or more
domains from the same family adjacent to each
other. These repeats may have evolved through
fusion of domains, like other domain combinations,
but their frequencies and lengths suggest a particular mechanism for repeat formation. Some domain
repeats, such as the HEAT repeats, have complex
evolutionary patterns involving duplications of one
or several repeats as a cassette, but also various
deletions.30 Since the repeating domains behave
Multi-domain Proteins
239
Figure 7. Estimated numbers of multi-domain proteins using five different cut-offs for unassigned domains (30, 50, 70,
100 and 200) and four different domain representations: Pfam-A, Pfam-ACB, SCOP and SCOPCMAS. The results for
the different kingdoms are shown for (a) archaea, (b) bacteria and (c) eukarya. (d) Fraction of single-domain and multidomain proteins in each kingdom (A, B, and E) using a cut-off of 100 residues and the four different assignment methods,
Pfam-A, Pfam-ACB, SCOP and SCOPCMAS. Histogram displays the fraction of the proteomes that have been
predicted to contain one domain (black), two domains (grey) and more than two domains (white).
Figure 8. Distribution of repetitions in archaea, bacteria, yeast and multicellular organisms (Multi), showing the
fraction of all residues that are contained within a two-domain repeat, a three domain repeat, etc. Results for Pfam-A and
SCOP domain assignments are displayed. Repetitions of Pfam-B and MAS are nearly non-existent due to the low
clustering level, hence are not shown.
240
differently from other domains, we have chosen to
study them further.
Our study confirms previous results that there is
clearly a higher frequency of repeats in multicellular organisms, compared to the unicellular
organisms,9,31 see Figure 8, which suggests that
repeats have arisen late in evolution. The higher
frequency of repeats in multicellular organisms has
been suggested to provide them with an extra
source of variability to compensate for low generation rates.31 Repeats have an important function in
large structural complexes, cell adhesion and
signaling, hence may have evolved more extensively in multi-cellular organisms.9 This is demonstrated by the fact that yeast is more similar to other
unicellular organisms than to the other eukaryotes
with regards to repeats. The repeats in yeast cover a
similar fraction of residues as bacteria and archaea,
but more long repeats are found in yeast.
Although repetitions constitute a large fraction of
the proteomes in multicellular organisms, the larger
number of multi-domain proteins could not be
explained by repeating domains. The ratio of single
versus multi-domain proteins was not altered
markedly if repeating domains were ignored (data
not shown). An effect is only observed on proteins
containing many domains as they frequently contain repeats, e.g. repeats occur in 89% (Pfam-A) and
87% (SCOP) of eukaryotic proteins with more than
five assigned domains, compared to 21%/28% in
two-domain proteins.
In total, similar proportions of the proteomes are
covered with repeats using either SCOP or Pfam-A,
but the number of repeating units is higher with
Pfam, while SCOP gives more two-domain repeats
and on average longer repeating domains. The
differences in number of repeating domains can
largely be explained by the differences between
SCOP and Pfam domain definitions. Several common repeats, for instance the C2H2/C2HC zinc
fingers, are assigned to longer regions with SCOP
(on average 48 residues long) often overlapping
several Pfam domains (average length 23 residues),
hence making the number of repeating Pfam
domains larger. On the other hand the congregation
of multiple Pfam families into one SCOP superfamily has an impact on the number of domain
families that form repeats. One such example is the
most versatile SCOP superfamily, the P-loop containing nucleotide triphosphate hydrolases (c.37.1),
which constitutes a large fraction of all repeats with
SCOP, but in Pfam it corresponds to several families
and is frequently not detected in repeats. To get a
higher clustering level, Pfam families of common
evolutionary origin were grouped together using
the Pfam clans†. However, since most of the
repeating domain families are not included in the
clans, the number of repeats was increased by less
than one percent (data not shown).
† ftp://ftp.sanger.ac.uk/pub/databases/Pfam/
Multi-domain Proteins
Conclusions
Our understanding of different genomes is to a
large extent based on analysis of the parts that are
related to a known domain family, i.e. the parts that
match a Pfam-A or SCOP domain. However, such
an analysis leaves out more than half of the
proteome and the origin, structure and function of
these regions are to a large degree unknown. These
parts can be divided into four different groups:
domains that match a Pfam-B or MAS domain,
short domain-adjacent regions, orphan domains
and orphan proteins. Here, we show that all these
groups contain a higher fraction of low-complexity
and disorder than the better studied Pfam-A and
SCOP domains. Interestingly, regions neighboring
Pfam-A/SCOP domains seem to have slightly more
disorder than orphan proteins, which have a similar
fraction of disorder as partially assigned proteins,
while sequences matching Pfam-B or MAS are
slightly less disordered.
Between 10% and 14% of the proteins were
considered orphans, and as has been observed
earlier 16 many orphan proteins are short. In
addition to orphan proteins, the proteomes consist
of 5–15% orphan domains. The origin of the orphan
proteins and domains is not known, they either
have evolved too far from their nearest neighbor to
be assigned to a domain family, or they have been
created by some de novo mechanism. However, a
sensitive search method indicates that many of the
Pfam-B/MAS families are distantly related to a
Pfam-A/SCOP family, supporting the assumption
that many of these domains have evolved too far to
be detected by current methodologies. It remains
unknown if also many of the orphan regions are
distant homologs to already well-studied domains.
Identifying new domains in unassigned areas and
determining their boundaries is an intriguing task
for further investigations that requires more structural information, improved search methods as well
as genome comparisons on a larger scale.
On a general level, the different domain definitions (SCOP, SCOPCMAS, Pfam-A or Pfam-AC
B) provide a similar picture of domain distribution
in the genomes. When repeats are studied, however,
there are large differences between SCOP and Pfam.
The portion that is covered with repeats is similar
for the two methods, but the repeats with Pfam-A
are longer and more proteins with repeats are
found.
In addition, we have estimated the number of
multi-domain proteins in 21 fully sequenced
species with the domain definitions of Pfam and
SCOP with coverage in 65–70% of all proteins. This
fraction increased to 86–90% with the inclusion of
Pfam-B and MAS domains. Regardless of what
domain definition was used, 65% of the proteins in
eukaryotes and 40% in bacteria and archaea were
predicted to contain two or more domains, when
each unassigned region with 100 residues was
assumed to contain a domain. These results confirm
241
Multi-domain Proteins
that multi-domain proteins are more common in
eukaryotic genomes.
Interestingly, the yeast genomes have similar
fractions of multi-domain proteins and disordered
regions as the other eukaryotes, whereas they
resemble the prokaryotes with regards to repetitions. This suggests that repetitions are important
for multicellularity, e.g. in cell–cell contacts, while
complex domain architectures and disordered
regions have intracellular functions important for
all eukaryotic organisms.
Materials and Methods
Species
We have analyzed the proteomes of 21 species, seven
from each kingdom. Eukarya: Homo sapiens, Mus
musculus, Drosophila melanogaster, Caenorhabditis elegans,
Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe. Bacteria: Escherichia coli O157:H7,
Pseudomonas aeruginosa, Bacillus subtilis, Rickettsia conorii,
Mycoplasma pulmonis, Prochlorococcus marinus, Treponema
pallidum. Archaea: Aeropyrum pernix, Methanococcus
jannaschii, Nanoarchaeum equitans, Pyrococcus abyssi,
Thermoplasma volcanium, Archaeoglobus fulgidus, Methanosarcina mazei.
The species in each kingdom have been chosen from
distant taxonomic groups. The archaeal organisms come
from different taxonomic lineages with life requirements
differing from the hyperthermophilic aerobe A. pernix to
the methylotrophic marine methanogen M. mazei.
Bacteria have also been chosen from different parts of
the tree with species belonging to proteobacteria, cyanobacteria, firmicutes and spirochaetes. The eukaryotes can
be further divided between unicellular and multicellular
organisms, where an insect, a plant, a worm and two
vertebrates represent the latter.
The microbial sequences have been collected from the
National Center for Biotechnology Information (NCBI)†
and the eukaryotic genomes come from Ensembl‡ except
for D. melanogaster which comes from FlyBase32§ and
S. cerevisiae from Saccharomyces Genome Database33¶.
SCOP and Pfam-A assignments
The domain definitions used in this study come from
Pfam7 and from the Structural Classification of Proteins
(SCOP) database.2 For both databases the domain assignments were made by scanning libraries of HMMs against
the protein sequences using HMMER-2.0s. For the
Pfam-A assignments HMMs from Pfam version 12 were
used, while for the SCOP assignments the superfamily
database20 corresponding to SCOP version 1.63 was used.
A domain was assigned to a region of a gene if a match to
a domain HMM with an E-value better than 0.1 was
observed. It should be noted that the results differed very
little when a stricter cut-off (10K3) was used. For some
eukaryotes (H. sapiens, M. musculus, C. elegans and
† ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/
Bacteria/
‡ ftp://ftp.ensembl.org/
§ ftp://flybase.net/
¶ ftp://ftp.yeastgenome
s http://hmmer.wustl.edu/
A. thaliana) the Pfam assignments were collected from
Ensembla.
Pfam-B assignments
Regions without Pfam-A assignments were searched
against the Pfam-B database. Pfam-B families are generated from ProDom,6 which is automatically created
through clustering of proteins in SwissProt and TrEMBL
and any overlap with Pfam-A families has been removed.
Multiple alignments of Pfam-B families consisting of at
least five sequences were used to build HMMs using
HMMER-2.0. Smaller Pfam-B families were collected in a
database and regions without a hit to either Pfam-A or
Pfam-B HMMs were searched against this database using
BLAST.34
Mkdom after SCOP assignments-(MAS)
SCOP assignments were complemented with sequence
based clustering. For this Mkdom 2.021 was used, i.e. the
same clustering procedure that is employed to generate
the ProDom database which is the basis for Pfam-B.
Mkdom is based on recursive homology search with PSIBLAST, starting with the shortest sequence and extracting
all its homologs. Regions of more than 50 residues that
remain unassigned after SCOP assignments were
extracted and clustered with Mkdom 2.0. All clusters of
less than three sequences were discarded. We name these
domain families MAS (Mkdom After SCOP).
Insertions
Domains are usually adjacent to each other but there
are also domains that are inserted into other “parental”
domains. Inserts have been predicted to occur in 9%
of non-redundant PDB sequences.19 Overlap was
not allowed in this study and inserts were hence
excluded. Assignments retrieved from Ensembl included
inserts in approximately 0.1% of the proteins, but were
ignored.
Structural features
Proteins with assignments were divided into PfamA/SCOP domains, Pfam-B/MAS domains and
unassigned regions. The unassigned regions were further
divided between short (!100) domain adjacent regions
(DAR) and orphan domains. The secondary structure of
these sequences was examined using different prediction
programs. Transmembrane regions were predicted with
TMHMM 2.0,35 secondary structure with PSIPRED 2.3,36
disordered regions with DISOPRED 2.124 and lowcomplexity with SEG.37 They were predicted in the
following order: (i) transmembrane, (ii) disorder,
(iii) low complexity and (iv) secondary structure. Default
parameters were used with all programs. All predicted
structural features, as well as the domain assignments,
are available on webb.
Detection of distant homologs
The HMM-HMM alignment program PRC 1.3.1c, was
used to detect distant homology between domain
a
www.ensembl.org
http://www.sbc.su.se/~arne/domains
c
http://www.supfam.org/PRC/
b
242
families. For the 500 largest MAS domain families we
created multiple sequence alignments using Clustal W,38
then hidden Markov models were built using HMMER in
a standard way. E-values were not reported from PRC,
but instead calculated from the distribution of the scores
from non-related sequences in the same way as in
FASTA39 using no correction for the length of the domain
families. In an earlier study40 it was found that the exact
choice of method did not influence the results
significantly.
The MAS domains were divided into membrane and
non-membrane proteins by predicting the number of
transmembrane (TM) helices using TMHMM35 for each
member of the domain family. TM containing MAS
domains were defined as domains that on average
contained one or more TM regions.
Many of the SCOP hits can be false, but there are also
some superfamilies that are actually related.20 In our
results there were some folds that gave many hits at low
E-values between different superfamilies and folds, such
as the three folds: 6-bladed beta-propeller (b.68), 7-bladed
beta-propeller (b.69) and 8-bladed beta-propeller (b.70).
Another example is the fold a–a superhelix (a.118) where
there were high scoring hits between several of the
superfamilies.
Acknowledgements
This work was supported by grants from the
Swedish Natural Sciences Research Council, the
Carl Trygger foundation, and the European Union.
References
1. Elofsson, A. & Sonnhammer, E. L. L. (1999). A
comparison of sequence and structure protein
domain families as a basis for structural genomics.
Bioinformatics, 15, 480–500.
2. Murzin, A., Brenner, S., Hubbard, T. & Chothia, C.
(1995). SCOP: a structural classification of proteins
database for the investigation of sequences and
structures. J. Mol. Biol. 247, 536–540.
3. Orengo, C., Michi, A., Jones, S., Jones, D., Swindels,
M. B. & Thornton, J. (1997). Cath-a hierarchical
classification of protein domain structures. Structure,
5, 1093–1108.
4. Holm, L. & Sander, C. (1994). The fssp database of
structurally aligned protein fold families. Nucl. Acids
Res. 22, 3600–3609.
5. Schultz, J., Milpetz, F., Bork, P & Ponting, C. P. (1998).
Smart, a simple modular architecture research tool:
identification of signaling domains. Proc. Natl Acad.
Sci. USA, 95, 5857–5864.
6. Servant, F., Bru, C., Carrère, S., Courcelle, E., Gouzy, J.,
Peyruc, D. & Kahn, D. (2002). Prodom: automated
clustering of homologous domains. Brief. Bioinfor. 3,
246–251.
7. Sonnhammer, E., Eddy, S. & Durbin, R. (1997). Pfam:
a comprehensive database of protein domain families
based on seed alignments. Proteins: Struct. Funct.
Genet. 28, 405–420.
8. Gerstein, M. & Levitt, M. (1998). Comprehensive
assessment of automatic structural alignment against
a manual standard, the scop classification of proteins.
Protein Sci. 7, 445–456.
Multi-domain Proteins
9. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain
combinations in archaeal, eubacterial and eukaryotic
proteomes. J. Mol. Biol. 310, 311–325.
10. Liu, J. & Rost, B. (2004). Chop proteins into structural
domain-like fragments. Proteins: Struct. Funct. Bioinfor. 55, 678–688.
11. Gerstein, M. (1997). A structural census of genomes:
comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol. 274,
562–576.
12. Qian, J., Luscombe, N. M. & Gerstein, M. (2001).
Protein family and fold occurrence in genomes:
power-law behaviour and evolutionary model.
J. Mol. Biol. 313, 673–681.
13. Apic, G., Gough, J. & Teichmann, S. A. (2001). An
insight into domain combinations. Bioinformatics, 17,
S83–S89.
14. Rost, B. (2002). Did evolution leap to create the protein
universe? Curr. Opin. Struct. Biol. 12, 409–416.
15. Fischer, D. & Eisenberg, D. (1999). Finding families for
genomic orfans. Bioinformatics, 15, 759–762.
16. Siew, N. & Fischer, D. (2003). Analysis of singleton
orfans in fully sequenced microbial genomes. Proteins:
Struct. Funct. Genet. 53, 241–251.
17. Ramani, A. K. & Marcotte, E. M. (2003). Exploiting the
co-evolution of interacting proteins to discover
interaction specificity. J. Mol. Biol. 327, 273–284.
18. Grishin, N. V. (2001). Fold change in evolution of
protein structures. J. Struct. Biol. 134, 167–185.
19. Aroul-Selvam, R., Hubbard, T. & Sasidharan, R.
(2004). Domain insertions in protein structures.
J. Mol. Biol. 338, 633–641.
20. Gough, J., Karplus, K., Hughey, R. & Chothia, C.
(2001). Assignment of homology togenome sequences
using a library of hidden Markov models that
represent all proteins of known structure. J. Mol.
Biol. 313, 903–919.
21. Gouzy, J., Corpetand, F. & Kahn, D. (1999). Whole
genome protein domain analysis using a new method
for domain clustering. Comp. Chem. 23, 333–340.
22. Siew, N. & Fischer, D. (2004). Structural biology sheds
light on the puzzle of genomic orfans. J. Mol. Biol. 342,
369–373.
23. Liu, J., Tan, H. & Rost, B. (2002). Loopy proteins
appear conserved in evolution. J. Mol. Biol. 322, 53–64.
24. Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. &
Jones, D. T. (2004). Prediction and functional analysis
of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337, 635–645.
25. Brown, C., Takayama, S., Campen, A., Vise, P.,
Marshall, T., Oldfield, C. et al. (2002). Evolutionary
rate heterogeneity in proteins with long disordered
regions. Mol. Evol. 55, 104–110.
26. Pandey, N., Ganapathi, M., Kumar, K., Dasgupta, D.,
Sutar, S. & Dash, D. (2004). Comparative analysis of
protein unfoldedness in human housekeeping and
non-housekeeping proteins. Bioinformatics, 20,
2904–2910.
27. Lovell, S. (2003). Are non-functional, unfolded proteins (“junk proteins”) common in the genome? FEBS
Letters, 554, 237–239.
28. Bradley, P., Chivian, D., Meiler, J., Misura, K. M., Rohl,
C. A., Schief, W. R. et al. (2003). Rosetta predictions in
casp5: successes, failures, and prospects for complete
automation. Proteins: Struct. Funct. Genet. 6, 457–468.
29. Teichmann, S., Park, J. & Chothia, C. (1998). Structural
assignments to the mycoplasma genitalium proteins
Multi-domain Proteins
30.
31.
32.
33.
34.
35.
show extensive gene duplications and domain
rearrangements. Proc. Natl Acad. Sci. USA, 95, 14658–
14663.
Andrade, M., Petosa, C., O’Donoghue, S. I., Muller,
C. W. & Bork, P. (2001). Comparison of arm and heat
protein repeats. J. Mol. Biol. 309, 1–8.
Marcotte, E. M., Pellegrini, M., Yeates, T. O. &
Eisenberg, D. (1999). A census of protein repeats.
J. Mol. Biol. 293, 151–160.
Consortium, T. F. (2003). The flybase database of the
drosophila genome projects and community literature. Nucl. Acids Res. 31, 172–175.
Dolinski, K., Balakrishnan, R., Christie, K. R.,
Costanzo, M. C., Dwight, S. S., Engel, S. R. et al.
(2004). Saccharomyces genome database. Methods
Enzymol. 266, 554–571.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &
Lipman, D. J. (1990). Basic local alignment search tool.
J. Mol. Biol. 215, 403–410.
Sonnhammer, E., von Heijne, G. & Krogh, A. (1998).
A hidden markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf.
Intell. Syst. Mol. Biol. 6, 175–182.
243
36. Jones, D. (1999). Protein secondary structure prediction based on position-specific scoring matrices.
J. Mol. Biol. 292, 195–202.
37. Wooton, J. & Federhen, S. (1996). Analysis of
compositionally biased regions in sequence databases. Methods Enzymol. 266, 554–571.
38. Thompson, J. D., Higgins, D. & Gibson, T. (1994).
Clustal W: improving the sensitivity of progressive
multiple sequence alignment through sequence
weighting, positions-specific gap penalties and
weight matrix choice. Nucl. Acids Res. 22, 4673–4680.
39. Pearson, W. R. & Lipman, D. J. (1988). Improved tools
for biological sequence analysis. Proc. Natl Acad. Sci.
USA, 85, 2444–2448.
40. Wallner, B., Fang, H., Ohlson, T., Frey-Skött, J. &
Elofsson, A. (2004). Using evolutionary information
for the query and target improves fold recognition.
Proteins: Struct. Funct. Genet. 54, 342–350.
41. Hedman, M., Deloof, H., von Heijne, G. & Elofsson, A.
(2002). Improved detection of homologous membrane
proteins by inclusion of information form topology
prediction. Protein Sci. 11, 652–658.
Edited by J. Thornton
(Received 27 September 2004; received in revised form 31 January 2005; accepted 2 February 2005)