doi:10.1016/j.jmb.2005.02.007 J. Mol. Biol. (2005) 348, 231–243 Multi-domain Proteins in the Three Kingdoms of Life: Orphan Domains and Other Unassigned Regions Diana Ekman, Åsa K. Björklund, Johannes Frey-Skött and Arne Elofsson* Stockholm Bioinformatics Center, Stockholm University SE-106 91 Stockholm, Sweden Comparative studies of the proteomes from different organisms have provided valuable information about protein domain distribution in the kingdoms of life. Earlier studies have been limited by the fact that only about 50% of the proteomes could be matched to a domain. Here, we have extended these studies by including less well-defined domain definitions, Pfam-B and clustered domains, MAS, in addition to Pfam-A and SCOP domains. It was found that a significant fraction of these domain families are homologous to Pfam-A or SCOP domains. Further, we show that all regions that do not match a Pfam-A or SCOP domain contain a significantly higher fraction of disordered structure. These unstructured regions may be contained within orphan domains or function as linkers between structured domains. Using several different definitions we have re-estimated the number of multi-domain proteins in different organisms and found that several methods all predict that eukaryotes have approximately 65% multi-domain proteins, while the prokaryotes consist of approximately 40% multi-domain proteins. However, these numbers are strongly dependent on the exact choice of cut-off for domains in unassigned regions. In conclusion, all eukaryotes have similar fractions of multidomain proteins and disorder, whereas a high fraction of repeating domain is distinguished only in multicellular eukaryotes. This implies a role for repeats in cell–cell contacts while the other two features are important for intracellular functions. q 2005 Published by Elsevier Ltd. *Corresponding author Keywords: protein domains; multi-domain protein; comparative genomics; kingdoms of life; proteome Introduction Proteins are modular entities where “domain” is a general designation for recurrent protein fragments with distinct structure, function and/or evolutionary history. Protein domains may exist alone, forming small single-domain proteins, but are frequently part of larger polypeptide chains, assembled by successive events of gene fusion. During evolution multi-domain proteins have often changed their domain arrangements and compositions. D.E. and A.K.B. contributed equally to this article. Abbreviations used: SCOP, the structural classification of proteins database; Pfam, the protein family database; MAS, Mkdom after SCOP assignments; HMM, Hidden Markov Model. E-mail address of the corresponding author: [email protected] 0022-2836/$ - see front matter q 2005 Published by Elsevier Ltd. Domains are often defined either using a structural definition as “independently folding units” or with an evolutionary based definition as “independently evolving units”. Although the two definitions are quite different, in many cases the resulting domain families are equivalent, indicating that domains are a fundamental unit of both protein structure and evolution.1 A number of resources, such as the structurally based databases, SCOP,2 CATH3 and FSSP4 or evolutionary based databases, SMART,5 ProDom6 and Pfam,7 provide information about protein domains. These databases are regularly updated and can be used to detect domains. Today the complete sequences of more than 150 genomes, of which about 20 are eukaryotic, are available. By comparative analysis between the genomes, better understanding of the differences between the three kingdoms of life has been obtained. In particular, it has been noticed that the proteomes of higher eukaryotes are significantly 232 more complex than the prokaryotic proteomes. Since the first complete genome sequences were made available, a number of groups have studied the distribution of domains in different genomes8–10 and a particular interest has been directed toward differences between prokaryotes and eukaryotes.9 One notable observation is that the fraction of multi-domain proteins is distinctly larger in eukaryota, ranging from two thirds to 80%.9 Already in the first studies of protein domains, it was shown that the distribution of domain families exhibit a power law behavior, i.e. that a few domains exist in many copies while most exist in only a few.11 Such a distribution can be explained by the use of a simple model based on gene-duplication events.12 A similar distribution is seen for the number of combination partners, where a few domain families combine with many different domains while most domain families have only one or a few partners.9,13 Orphans14 (or ORFans15) are proteins with no (or very few) homologs, i.e. proteins without any known domains. As the number of completely sequenced genomes has increased the number of orphan genes has also grown, while the fraction they represent is slowly diminishing. It was recently reported in a study covering 60 microbial genomes that about 14% of the genes are orphans.16 There are two opposing explanations to how the high amount of orphan proteins has evolved. Either new proteins are created spontaneously14 or the orphan proteins have evolved too far away from their closest neighbors to be detected.17 Both explanations create fundamental questions about protein evolution. For the first scenario the process behind the spontaneous creation is certainly not well understood, while for the second scenario it is not clear what has happened to the intermediate proteins during the evolutionary process. In addition to orphan proteins there are segments of proteins that cannot be assigned to a known domain family, but they may still contain a domain and are here designated orphan domains. Besides the two possible explanations as to why no known domains are found in orphan proteins, another reason why a sequence is not matched to a domain is that the region might actually belong to one of the neighboring domains but is not aligned to the HMM. This boils down to the problem of defining what determines the borders of a domain, especially when the domain contains non-conserved elements at the C or N-terminals. It is also well known that domains might undergo quite substantial changes,18 for instance significantly increasing/decreasing the length of loops or adding/deleting complete secondary structure elements, without domain rearrangements. Therefore, short unassigned regions might belong to the bordering domains but have either been missed in the assignment procedure or belong to a nonconserved part of the domain family. Several examples of domains that have been inserted within other domains also exist19 which may complicate Multi-domain Proteins domain assignments. Another complication is that regions where no domains can be detected may have low sequence complexity or disordered structure that may be under a different evolutionary pressure. As the unassigned regions constitute a major part of the genomes they affect the outcome of comparative genomics studies. For instance the predicted ratio of single versus multi-domain proteins is strongly dependent on how unassigned regions are treated. We try to complement earlier studies by comparing different methods for handling the problem of unassigned regions and by using two different domain definitions, Pfam and SCOP. We compare structural and evolutionary aspects of regions of the proteomes assigned to well-defined domains, less well-defined domains and the orphan regions. Finally we provide a more detailed estimate of the fraction of single and multi-domain proteins in the three kingdoms of life. Results Protein coverage We have analyzed the proteomes from 21 completely sequenced organisms, seven eukaryotes, seven bacteria and seven archaea. Domains were assigned in all species using both the Pfam7 database and the Superfamily database20 for SCOP domains,2 for an overview of the assignment procedure see Figure 1. Out of the more than 170,000 proteins studied, 65% were assigned one or more SCOP domains and in 70% a Pfam-A domain could be detected. To extend the Pfam-A assignments we have included the less well characterized Pfam-B domain families and the parts that were not assigned to SCOP domains were clustered using the Mkdom-2.0 domain finding algorithm.21 These clustered domains are below referred to as Mkdom After SCOP (MAS) domains. The SCOPC MAS assignments cover 86% of the proteins, and Pfam-ACB cover 90%, see Table 1. About 3% of the proteins were assigned by SCOPCMAS but not by Pfam, hence in total 93% of the proteins could be assigned to a domain. Clustering of the unassigned parts after Pfam-A and Pfam-B increased the total coverage by less than one percentage point, and was therefore ignored. On average, the assignments gave 1.6 SCOP, 1.9 Pfam-A, 2.8 SCOPCMAS and 2.8 Pfam-ACB domains per protein. The average number of domains per protein is higher in eukaryotes than in prokaryotes, which may be a consequence of eukaryotic proteins being longer (average lengths 460 versus 300 amino acid residues). As can be expected, longer proteins contain more domains than shorter, with an average of more than five domains in proteins longer than 600 residues, while about half of all proteins shorter than 100 residues have no domain assignments. Domain assignments 233 Multi-domain Proteins Figure 1. The procedure for domain assignment. (i) To each protein sequence, domains are assigned using Pfam-A or SCOP. (ii) Remaining unassigned regions are further assigned with HMM and Blast to Pfam-B or with clustering to MAS (Mkdom after SCOP) domains. (iii) Unassigned regions are divided into orphan domains (OD) if they are longer than the cut-off and short domain adjacent regions (DARs), if they are shorter (here using cut-off 100 residues). If a region is twice as long as the cut-off, two OD are assigned. Proteins with no domains assigned are termed orphan proteins, where the number of domains is counted as number of OD that fit into the sequence. (iv) Finally, the domain assignments are analysed. The number of domains is calculated for each of the two assignment methods, including orphan domains. The same calculations are also done without Pfam-B/MAS assignments, only calculating number of Pfam-A/SCOP domains and orphan domains. Table 1. Summary of domain assignments Pfam-A Kingdom Proteins Archaea Bacteria Eukarya All 13202 20443 142672 176317 AssP (%) 71 74 69 70 CAssP (%) 54 57 33 38 Pfam-ACB AvD AssP (%) 1.4 1.4 2.0 1.9 89 89 90 90 CAssP (%) 82 82 68 71 SCOP AvD AssP (%) 1.8 1.8 3.0 2.8 63 61 66 65 CAssP (%) 46 43 30 33 SCOPCMAS AvD AssP (%) 1.2 1.3 1.7 1.6 79 76 88 86 CAssP (%) 66 61 65 65 AvD 1.6 1.7 3.0 2.8 Four different methods, Pfam-A, Pfam-ACB, SCOP and SCOPCMAS, displaying the fraction of proteins with a domain assignment (AssP). The fraction of proteins that are completely covered by assigned domains with no unassigned stretch larger than 100 residues (CAssP) and the average number of assigned domains per assigned protein (AvD). are provided, together with predicted secondary structures available on the web†. Amino acid coverage Although a majority of the proteins are predicted to contain one or more domains, on the residuelevel a fairly large fraction of the proteomes remains unassigned. We have divided the proteomes into different groups; (i) assigned to a Pfam-A or SCOP domain, (ii) assigned to a Pfam-B or MAS domain, (iii) unassigned short (!100 residues) domain adjacent regions (DARs), (iv) long (O100 residues) unassigned regions (orphan domains) and (v) proteins without any domains assigned (orphan proteins). In Figure 2 it can be seen that about half of the residues are assigned to a Pfam-A or SCOP domain and that Pfam-A covers a larger fraction than SCOP in prokaryotic, but not in eukaryotic genomes. Including the assignments from Pfam-B † http://www.sbc.su.se/~arne/domains or MAS increases the coverage to 65–75%. Interestingly, the highest coverage is found in proteins with 300–500 residues, since many of the shorter proteins have no domain assignments, while the longer proteins are only partially covered. Also, a larger fraction of the longest proteins are covered by PfamB/MAS domains. After domain assignments with Pfam-ACB or SCOPCMAS 22–36% of the proteomes remain completely unassigned. Domain-adjacent regions constitute 9–13% of the residues with a similar fraction in all kingdoms. The orphan domains, on the other hand, are more abundant in eukaryotes, especially when Pfam assignments are considered. Finally, for all assignments 5–6% of the residues are found in orphan proteins, except after the SCOPC MAS assignments in prokaryotes where they are twice as abundant. Domain lengths A domain is assigned to a protein after alignment of the protein sequence with a hidden Markov 234 Multi-domain Proteins Figure 2. Amino acid coverage with Pfam-ACB or SCOPCMAS is shown for the three kingdoms. The unassigned parts are divided into short domain adjacent regions (DAR), long unassigned regions (orphan domains) and unassigned proteins (orphan proteins). DARs are regions shorter than 100 residues that remain after domain assignments while orphan domains are longer regions and orphan proteins are whole proteins without any assigned domains. model representing the domain family. As the domain is assumed to cover the aligned region, which is often shorter than the HMM, many domain assignments are shorter than their corresponding models, see Figure 3. In fact, the Pfam-A assignments are on average 7% shorter than the HMMs while the SCOP assignments are 10% shorter. However, in the minuscule fraction of domains known to contain inserts, the assigned domains are sometimes considerably longer than the corresponding HMMs. On average the MAS domains are the shortest (median length 78 residues/average length 105 residues) followed by Pfam-B (81/116), while the better studied SCOP (114/143) and Pfam-A (155/194) domains are longer, see Figure 4. The lengths of the domains from SCOP, MAS and PfamB all follow a similar pattern with a peak close to 60 Figure 3. The difference between length of domain assignments and length of HMM seeds, where alignments shorter than the HMM have a negative residue difference. As can be seen, many Pfam-A and SCOP domain assignments are shorter than the corresponding HMM. 235 Multi-domain Proteins Table 2. Fraction of PRC hits between domain family HMMs for three different E-values (10K1, 10K3 and 10K5) PRC E-value PfamA to PfamA PfamB to PfamA SCOP to SCOP (Superfamily) SCOP to SCOP (Fold) MAS TM to SCOP MAS non-TM to SCOP 10K5 (%) 10K3 (%) 10K1 (%) 3.5 3.8 2.4 9.1 8.5 4.4 27.3 23.1 9.5 1.0 0.9 9.6 2.3 3.8 14.5 6.1 33.0 18.3 PRC alignments comparing Pfam-B to Pfam-A, Pfam-A to PfamA, SCOP to SCOP and MAS to SCOP were done. SCOP hits are displayed as number of hits to a family from another fold and another superfamily. The MAS domains were divided into transmembrane families (TM) and non-transmembrane (nonTM). Many MAS TM families had high hit-ratios to SCOP at higher E-values, this can be accounted for by hits between transmembrane helices.41 residues, while Pfam-A domains have a sharp peak at 25. A few very abundant repeating domain families such as the C2H2/C2HC zinc fingers can largely explain this peak. It can also be noted that there are fewer long (O200) domains assigned by Pfam-B and MAS. After domain assignments with SCOP and Pfam, about half of the remaining unassigned regions were shorter than 50 residues (49–55%). With the inclusion of Pfam-B/MAS this fraction was increased to (79–81%) while only 10– 12% were longer than 100 residues (data not shown). Detection of distant homologs During this study, it was noted that many PfamB/MAS families show weak similarity to a PfamA/SCOP family. Therefore, the PRC program for aligning and comparing HMMs was used to evaluate distant homology between domain families. According to the SCOP definitions, two domains that belong to different superfamilies should not be homologous,2 but there are a few exceptions.20 To estimate the false positive rate we applied PRC to SCOP superfamilies and found that 4.4% (2.3%) of the superfamilies (folds) show significant similarity (E-value !10K3) to another superfamily (fold), see Table 2. As we know that some of these high-scoring pairs are likely to be distantly related, see Materials and Methods, we decided to keep using an E-value threshold of 10K3 to estimate the number of distantly related domain families. All Pfam-B HMMs were aligned against the Pfam-A library and for 8.5% of the Pfam-B domains significant homology to a Pfam-A family was found, see Table 2. It was also noted that a similar fraction of the Pfam-A families are homologous to another Pfam-A family, confirming the well-known fact that several Pfam-A families are related. Further, the 500 largest MAS domain families were aligned against the superfamily database. Since few transmembrane (TM) protein structures have been resolved, SCOP is strongly underrepresented in transmembrane domains; consequently MAS domains frequently contain TM-regions. Therefore, the MAS domains were divided into two groups: TM and non-TM domains. 14.5% of the non-TM domains show similarity to a SCOP superfamily, while as expected, only few TM-domains were detected as homologs. With a less strict threshold (10K1), more Pfam-B and MAS domains give hits (15–25%). However, only 22% of all SCOP families give hits to another member of the same superfamily at the threshold 10K3 (data not shown), therefore the estimates of distant homology made above should be seen as a lower limit. Discussion A domain family in Pfam-A is based on a manually curated multiple sequence alignment and the detection of novel domains is often based on clustering of unassigned regions. Therefore, the definitions of domain boundaries are evolutionary based, i.e. domains that have been seen alone or in combination with other domains will be identified as unique domains, and domain boundaries which are not well conserved are often not included in the HMMs. In contrast, the definitions of domain families in SCOP are based on structure. Although they have different origins, the domain definitions of Pfam-A and SCOP are frequently quite similar.1 There is one significant distinction, however, in the fact that domain families in SCOP are organized in a hierarchical fashion as folds, superfamilies and families. According to the definitions in SCOP, two domains that belong to different superfamilies do not have features that “suggest that a common evolutionary origin is probable”.2 In contrast, two domains belonging to different Pfam-A families could very well be homologs, see Table 2. In addition, the assignment of domains with the two methods is different, as each Pfam-A family is represented by one HMM while one SCOP superfamily may be represented by several HMMs.20 The difference is notable in the clustering level of Pfam-A and SCOP as demonstrated by the number of domain families that we detect (5385 and 1177, respectively), i.e. a single SCOP domain superfamily frequently corresponds to several Pfam-A domain families. The same domain families are sometimes represented differently by Pfam-A and SCOP. The length of a domain family may vary between the two definitions and Pfam-A domain families are on average longer than SCOP domain families. It is not uncommon that one domain family from Pfam-A has no counterpart in SCOP, as SCOP only contains domain families of known structure. This is especially evident for membrane proteins which are poorly represented in SCOP.1 Still, for a general comparison, it is of marginal importance if homology or structure is used to define a domain and, in spite of some differences, our estimates of coverage and number of multi-domain proteins are 236 Multi-domain Proteins Figure 4. Distribution of the lengths of assigned domain families (Pfam-A, Pfam-B, SCOP and MAS). There are more short family assignments with Pfam-A than with the others and more families of a length greater than 180 residues are assigned with Pfam-A and SCOP than with Pfam-B and MAS. similar using either domain definition, see Figures 2, 4, 6 and 7. When repeating domains are considered, however, the choice of domain definition has a greater impact on the results, due to the differences in clustering level, see Figure 8. Pfam-B and MAS domains Most earlier genome-comparative studies of protein domains are based on the well studied Pfam-A or SCOP domains that only cover less than half of the proteomes. In contrast to Pfam-A and SCOP, the families from Pfam-B and MAS are not manually curated and are of varying quality. Many of these assignments do not likely correspond to unique and complete domain families. The families are on average shorter than in Pfam-A and SCOP, but show similar length distribution to that of SCOP with a peak around 60 residues, see Figure 4. We noted that in many cases the domains detected by Pfam-B or MAS are homologous to a Pfam-A or SCOP domain family. This can be exemplified by the human protein ENSP00000276208 that has 14 domains, according both to Pfam-ACB and to SCOPCMAS, of which the majority belong to the Immunoglobulin (Ig) family (PF00047/b.1.1), see Figure 5. Some Pfam-A Ig domains are not recognized as members of the SCOP Ig-family or vice versa but are instead assigned to a Pfam-B or MAS family. Therefore, it seems likely that the Pfam-B/MAS domains are distantly related to the Ig-domain family, but have diverged too far to be detected. We have attempted to quantify the similarity of Pfam-B families to Pfam-A using a sensitive HMM-HMM alignment program (PRC) and we noted that at least 8% of the Pfam-B domain families are distantly related to a Pfam-A family, see Table 2. In contrast, many of the most common MAS families are found in membrane proteins, of which several have not been structurally solved, and hence do not have any counterpart in SCOP, while more than 14% of the globular MAS domains are homologous to SCOP families. Whether orphan domains and orphan proteins also contain a large fraction of distant homologs to better studied domains is obviously not known but in a recent study by Siew & Fischer22 it was shown that most of the recently solved structures of orphan proteins show similarity to already known protein domains. Structural features The secondary structure of the unassigned regions is of particular interest as it might reveal clues about function and evolutionary origin. To obtain a better understanding of the differences between the classes of assigned and unassigned regions in Figure 2, structural features were predicted using state of the art methods. Each residue was assigned to belong to one out of six “structural” states assigned in the following order, (i) transmembrane regions, (ii) disordered regions, (iii) low-complexity regions, (iv) a-helices, (v) b-sheets and (vi) coils, see Figure 6. Transmembrane regions are clearly under-represented in SCOP domains (2%) but they are instead included in the MAS domains (15%). In archaea and bacteria, transmembrane residues constitute as much as 22% of the MAS domains. Prokaryotic proteins contain a larger fraction of transmembrane residues than eukaryotes, which may explain why SCOP covers less of the proteins than Pfam-A in prokaryotes. In general, the unassigned regions resemble quite well the assigned regions with regards to transmembrane content, but the PfamA domains seem to be slightly over-represented in transmembrane regions. As can be expected, increased portions of disordered and low-complexity regions are found in the sequences not related to the well studied Pfam-A or SCOP domains. Orphan proteins contain 237 Figure 5. Domain assignment for the human protein ENSP00000276208. The upper assignments are Pfam-A (Ig) and Pfam-B (B1–B4) while the lower is SCOP (Ig) and MAS (M1–M3). Both Pfam and SCOP/MAS give 14 domains, most of which belong to the Immunoglobulin (Ig) family. The Pfam-B/MAS domains (B1, B3 and M1) that correspond to SCOP/Pfam-A Ig-domains are likely to be distantly related members of the Ig-family. Multi-domain Proteins Figure 6. Structural features for regions assigned by Pfam-A/SCOP or Pfam-B/MAS and unassigned sequences: DARs, orphan domains (OD) and orphan proteins (OP). The features are depicted in the following order: disorder, low complexity, transmembrane region, coil, b-sheet and a-helix (from top to bottom). The distribution in eukaryotes is displayed in (a), and for prokaryotes in (b). the same fraction of disorder as do most proteins with domain assignments. Disordered regions in orphan proteins could be located both within domains and in linker regions, hence we do not know if there are structured domains in orphan proteins. As has been documented earlier,23,24 we found that disordered regions are much more frequent in eukaryotes, including yeast, than in genomes of the other kingdoms, with as much as 48% in eukaryotic orphan domains, compared to 12%–13% in prokaryotes, and even the well defined SCOP/Pfam-A domains contain more disorder in eukaryota. It has been shown that disordered regions may function as flexible connecting loops between domains.23 As can be seen in Figure 6, the sequence class with most disorder in prokaryotes is the domain adjacent regions (DAR), leading us to believe that these are linkers between domains. In eukaryotes, on the other hand, a similar frequency of disordered stretches is found in the orphan domains and in the domain adjacent 238 regions, as well as Pfam-B/MAS domains. This may indicate that some of the disordered regions in eukaryotes are linkers between domains while others are part of domains. In many cases disordered regions have been found within domains where they perform important functions such as binding to other molecules.24,25 Disordered regions have also been shown to evolve faster than other parts of the proteomes,25,26 especially in higher eukaryotes. This could explain why orphan domains which have a large fraction of disorder, have no or few detected homologs. Others have suggested that many disordered protein regions are actually remnants of protein evolution, which are non-functional but have not yet been lost through selection.27 Eukaryotes contain more multi-domain proteins Next, we wanted to estimate the number of proteins containing a specific number of domains in each kingdom of life including information about the unassigned regions. In order to predict the number of domains in a protein it is necessary to decide when to define an unassigned region as an individual domain. Short regions between domains may belong to linkers or connecting loops, but they may also be part of the bordering domain. It is difficult to define the domain borders, and many non-conserved C or N-terminal elements might be missed. This is the case for many Pfam-A domain families where non-conserved elements close to the domain borders are not included in the HMMs,1 but also for the many families where the obtained assignments are shorter than the HMMs. As mentioned in Results, Pfam-A and SCOP assignments are on average 7% and 10% shorter than their corresponding HMMs, hence the domain assignments have missed many non-conserved regions at the ends. It is also well known that domains might increase and decrease in size without the need for domain rearrangements.18 Naturally, we cannot rule out the possibility that these short unassigned segments contain small domains, as for instance has been seen by the insertion of a minidomain in N-actylglucosamine-6-phosphate deacetylase from T. maritima.28 We believe, however, that it is an exception rather than the rule that these short regions constitute a whole domain in its traditional meaning. In a previous study by Apic et al.,9 based on SCOP domain assignments, the number of multi-domain proteins was predicted to be 80% in eukaryotes and 65% in bacteria and archaea. Similar results were obtained by Liu et al.10 using both evolutionary and structural domains. Levitt and Gerstein8 on the other hand, predicted that two thirds of the eukaryotic proteomes consist of multi-domain proteins. Neither of these estimates is robust however, since they are obtained only from proteins containing assigned domains, corresponding to 31–69% of all proteins. In contrast, we have assigned domains in up to 90% of all proteins. In Multi-domain Proteins the earlier studies, quite conservative cut-offs (30–70 amino acid residues) were used to assume that an unassigned region constitutes a domain.8,9 Here, the predicted number of domains in each protein has been studied using different cut-offs ranging from 30–200 residues. We have also used four different domain representations, (i) Pfam-A assignments and unassigned regions larger than the cut-off, (ii) Pfam-ACPfam-B assignments and unassigned regions, (iii) SCOP and unassigned regions or (iv) SCOPCMAS and unassigned regions. In Figure 7 it can be seen how the estimated number of multi-domain proteins increases as the cut-off length decreases. The fraction of eukaryotic proteins that are predicted to have two or more domains using Pfam decreases from 90% to 50% when the cut-off is increased from 30 to 200 amino acid residues. Interestingly, cut-off lengths of 70 and 100 residues give similar results regardless of which of the four strategies we use, in spite of the differences in average domain lengths and coverage. With shorter or longer cut-offs the variation is larger, especially for archaea and bacteria. With a cut-off at 100 residues, we estimate that eukaryotes have 35% single-domain proteins, 20% two-domain proteins and 45% with three or more domains, see Figure 7(d). Archaea and bacteria have very similar distributions of multi-domain proteins with corresponding numbers 60%, 20% and 20%. Our results correlate well with findings of Levitt & Gerstein,8 but indicate that the proteomes contain fewer multi-domain proteins than claimed in other studies.9,10,29 We believe this to be a more accurate prediction since larger fractions of the proteomes are covered and because more attention has been invested in evaluating unassigned regions. However, depending on what cut-off for unassigned regions is used, this estimate varies. With cut-off 50 the fraction of multi-domain proteins increases to around 80% in eukaryotes and 60% in prokaryotes, while with cut-off 200 these numbers go down to 40–60% and 30–40%, respectively. As databases with known domain families are growing continuously and more orphan domains are found, we may actually get coverage in all regions where so far no domains have been detected. Not until then can we truly estimate the number of multi-domain proteins. More than 8% of the multicellular proteomes consists of domain repeats A notable feature of many multi-domain proteins is that they consist of repeats, i.e. two or more domains from the same family adjacent to each other. These repeats may have evolved through fusion of domains, like other domain combinations, but their frequencies and lengths suggest a particular mechanism for repeat formation. Some domain repeats, such as the HEAT repeats, have complex evolutionary patterns involving duplications of one or several repeats as a cassette, but also various deletions.30 Since the repeating domains behave Multi-domain Proteins 239 Figure 7. Estimated numbers of multi-domain proteins using five different cut-offs for unassigned domains (30, 50, 70, 100 and 200) and four different domain representations: Pfam-A, Pfam-ACB, SCOP and SCOPCMAS. The results for the different kingdoms are shown for (a) archaea, (b) bacteria and (c) eukarya. (d) Fraction of single-domain and multidomain proteins in each kingdom (A, B, and E) using a cut-off of 100 residues and the four different assignment methods, Pfam-A, Pfam-ACB, SCOP and SCOPCMAS. Histogram displays the fraction of the proteomes that have been predicted to contain one domain (black), two domains (grey) and more than two domains (white). Figure 8. Distribution of repetitions in archaea, bacteria, yeast and multicellular organisms (Multi), showing the fraction of all residues that are contained within a two-domain repeat, a three domain repeat, etc. Results for Pfam-A and SCOP domain assignments are displayed. Repetitions of Pfam-B and MAS are nearly non-existent due to the low clustering level, hence are not shown. 240 differently from other domains, we have chosen to study them further. Our study confirms previous results that there is clearly a higher frequency of repeats in multicellular organisms, compared to the unicellular organisms,9,31 see Figure 8, which suggests that repeats have arisen late in evolution. The higher frequency of repeats in multicellular organisms has been suggested to provide them with an extra source of variability to compensate for low generation rates.31 Repeats have an important function in large structural complexes, cell adhesion and signaling, hence may have evolved more extensively in multi-cellular organisms.9 This is demonstrated by the fact that yeast is more similar to other unicellular organisms than to the other eukaryotes with regards to repeats. The repeats in yeast cover a similar fraction of residues as bacteria and archaea, but more long repeats are found in yeast. Although repetitions constitute a large fraction of the proteomes in multicellular organisms, the larger number of multi-domain proteins could not be explained by repeating domains. The ratio of single versus multi-domain proteins was not altered markedly if repeating domains were ignored (data not shown). An effect is only observed on proteins containing many domains as they frequently contain repeats, e.g. repeats occur in 89% (Pfam-A) and 87% (SCOP) of eukaryotic proteins with more than five assigned domains, compared to 21%/28% in two-domain proteins. In total, similar proportions of the proteomes are covered with repeats using either SCOP or Pfam-A, but the number of repeating units is higher with Pfam, while SCOP gives more two-domain repeats and on average longer repeating domains. The differences in number of repeating domains can largely be explained by the differences between SCOP and Pfam domain definitions. Several common repeats, for instance the C2H2/C2HC zinc fingers, are assigned to longer regions with SCOP (on average 48 residues long) often overlapping several Pfam domains (average length 23 residues), hence making the number of repeating Pfam domains larger. On the other hand the congregation of multiple Pfam families into one SCOP superfamily has an impact on the number of domain families that form repeats. One such example is the most versatile SCOP superfamily, the P-loop containing nucleotide triphosphate hydrolases (c.37.1), which constitutes a large fraction of all repeats with SCOP, but in Pfam it corresponds to several families and is frequently not detected in repeats. To get a higher clustering level, Pfam families of common evolutionary origin were grouped together using the Pfam clans†. However, since most of the repeating domain families are not included in the clans, the number of repeats was increased by less than one percent (data not shown). † ftp://ftp.sanger.ac.uk/pub/databases/Pfam/ Multi-domain Proteins Conclusions Our understanding of different genomes is to a large extent based on analysis of the parts that are related to a known domain family, i.e. the parts that match a Pfam-A or SCOP domain. However, such an analysis leaves out more than half of the proteome and the origin, structure and function of these regions are to a large degree unknown. These parts can be divided into four different groups: domains that match a Pfam-B or MAS domain, short domain-adjacent regions, orphan domains and orphan proteins. Here, we show that all these groups contain a higher fraction of low-complexity and disorder than the better studied Pfam-A and SCOP domains. Interestingly, regions neighboring Pfam-A/SCOP domains seem to have slightly more disorder than orphan proteins, which have a similar fraction of disorder as partially assigned proteins, while sequences matching Pfam-B or MAS are slightly less disordered. Between 10% and 14% of the proteins were considered orphans, and as has been observed earlier 16 many orphan proteins are short. In addition to orphan proteins, the proteomes consist of 5–15% orphan domains. The origin of the orphan proteins and domains is not known, they either have evolved too far from their nearest neighbor to be assigned to a domain family, or they have been created by some de novo mechanism. However, a sensitive search method indicates that many of the Pfam-B/MAS families are distantly related to a Pfam-A/SCOP family, supporting the assumption that many of these domains have evolved too far to be detected by current methodologies. It remains unknown if also many of the orphan regions are distant homologs to already well-studied domains. Identifying new domains in unassigned areas and determining their boundaries is an intriguing task for further investigations that requires more structural information, improved search methods as well as genome comparisons on a larger scale. On a general level, the different domain definitions (SCOP, SCOPCMAS, Pfam-A or Pfam-AC B) provide a similar picture of domain distribution in the genomes. When repeats are studied, however, there are large differences between SCOP and Pfam. The portion that is covered with repeats is similar for the two methods, but the repeats with Pfam-A are longer and more proteins with repeats are found. In addition, we have estimated the number of multi-domain proteins in 21 fully sequenced species with the domain definitions of Pfam and SCOP with coverage in 65–70% of all proteins. This fraction increased to 86–90% with the inclusion of Pfam-B and MAS domains. Regardless of what domain definition was used, 65% of the proteins in eukaryotes and 40% in bacteria and archaea were predicted to contain two or more domains, when each unassigned region with 100 residues was assumed to contain a domain. These results confirm 241 Multi-domain Proteins that multi-domain proteins are more common in eukaryotic genomes. Interestingly, the yeast genomes have similar fractions of multi-domain proteins and disordered regions as the other eukaryotes, whereas they resemble the prokaryotes with regards to repetitions. This suggests that repetitions are important for multicellularity, e.g. in cell–cell contacts, while complex domain architectures and disordered regions have intracellular functions important for all eukaryotic organisms. Materials and Methods Species We have analyzed the proteomes of 21 species, seven from each kingdom. Eukarya: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe. Bacteria: Escherichia coli O157:H7, Pseudomonas aeruginosa, Bacillus subtilis, Rickettsia conorii, Mycoplasma pulmonis, Prochlorococcus marinus, Treponema pallidum. Archaea: Aeropyrum pernix, Methanococcus jannaschii, Nanoarchaeum equitans, Pyrococcus abyssi, Thermoplasma volcanium, Archaeoglobus fulgidus, Methanosarcina mazei. The species in each kingdom have been chosen from distant taxonomic groups. The archaeal organisms come from different taxonomic lineages with life requirements differing from the hyperthermophilic aerobe A. pernix to the methylotrophic marine methanogen M. mazei. Bacteria have also been chosen from different parts of the tree with species belonging to proteobacteria, cyanobacteria, firmicutes and spirochaetes. The eukaryotes can be further divided between unicellular and multicellular organisms, where an insect, a plant, a worm and two vertebrates represent the latter. The microbial sequences have been collected from the National Center for Biotechnology Information (NCBI)† and the eukaryotic genomes come from Ensembl‡ except for D. melanogaster which comes from FlyBase32§ and S. cerevisiae from Saccharomyces Genome Database33¶. SCOP and Pfam-A assignments The domain definitions used in this study come from Pfam7 and from the Structural Classification of Proteins (SCOP) database.2 For both databases the domain assignments were made by scanning libraries of HMMs against the protein sequences using HMMER-2.0s. For the Pfam-A assignments HMMs from Pfam version 12 were used, while for the SCOP assignments the superfamily database20 corresponding to SCOP version 1.63 was used. A domain was assigned to a region of a gene if a match to a domain HMM with an E-value better than 0.1 was observed. It should be noted that the results differed very little when a stricter cut-off (10K3) was used. For some eukaryotes (H. sapiens, M. musculus, C. elegans and † ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ Bacteria/ ‡ ftp://ftp.ensembl.org/ § ftp://flybase.net/ ¶ ftp://ftp.yeastgenome s http://hmmer.wustl.edu/ A. thaliana) the Pfam assignments were collected from Ensembla. Pfam-B assignments Regions without Pfam-A assignments were searched against the Pfam-B database. Pfam-B families are generated from ProDom,6 which is automatically created through clustering of proteins in SwissProt and TrEMBL and any overlap with Pfam-A families has been removed. Multiple alignments of Pfam-B families consisting of at least five sequences were used to build HMMs using HMMER-2.0. Smaller Pfam-B families were collected in a database and regions without a hit to either Pfam-A or Pfam-B HMMs were searched against this database using BLAST.34 Mkdom after SCOP assignments-(MAS) SCOP assignments were complemented with sequence based clustering. For this Mkdom 2.021 was used, i.e. the same clustering procedure that is employed to generate the ProDom database which is the basis for Pfam-B. Mkdom is based on recursive homology search with PSIBLAST, starting with the shortest sequence and extracting all its homologs. Regions of more than 50 residues that remain unassigned after SCOP assignments were extracted and clustered with Mkdom 2.0. All clusters of less than three sequences were discarded. We name these domain families MAS (Mkdom After SCOP). Insertions Domains are usually adjacent to each other but there are also domains that are inserted into other “parental” domains. Inserts have been predicted to occur in 9% of non-redundant PDB sequences.19 Overlap was not allowed in this study and inserts were hence excluded. Assignments retrieved from Ensembl included inserts in approximately 0.1% of the proteins, but were ignored. Structural features Proteins with assignments were divided into PfamA/SCOP domains, Pfam-B/MAS domains and unassigned regions. The unassigned regions were further divided between short (!100) domain adjacent regions (DAR) and orphan domains. The secondary structure of these sequences was examined using different prediction programs. Transmembrane regions were predicted with TMHMM 2.0,35 secondary structure with PSIPRED 2.3,36 disordered regions with DISOPRED 2.124 and lowcomplexity with SEG.37 They were predicted in the following order: (i) transmembrane, (ii) disorder, (iii) low complexity and (iv) secondary structure. Default parameters were used with all programs. All predicted structural features, as well as the domain assignments, are available on webb. Detection of distant homologs The HMM-HMM alignment program PRC 1.3.1c, was used to detect distant homology between domain a www.ensembl.org http://www.sbc.su.se/~arne/domains c http://www.supfam.org/PRC/ b 242 families. For the 500 largest MAS domain families we created multiple sequence alignments using Clustal W,38 then hidden Markov models were built using HMMER in a standard way. E-values were not reported from PRC, but instead calculated from the distribution of the scores from non-related sequences in the same way as in FASTA39 using no correction for the length of the domain families. In an earlier study40 it was found that the exact choice of method did not influence the results significantly. The MAS domains were divided into membrane and non-membrane proteins by predicting the number of transmembrane (TM) helices using TMHMM35 for each member of the domain family. TM containing MAS domains were defined as domains that on average contained one or more TM regions. Many of the SCOP hits can be false, but there are also some superfamilies that are actually related.20 In our results there were some folds that gave many hits at low E-values between different superfamilies and folds, such as the three folds: 6-bladed beta-propeller (b.68), 7-bladed beta-propeller (b.69) and 8-bladed beta-propeller (b.70). Another example is the fold a–a superhelix (a.118) where there were high scoring hits between several of the superfamilies. Acknowledgements This work was supported by grants from the Swedish Natural Sciences Research Council, the Carl Trygger foundation, and the European Union. References 1. Elofsson, A. & Sonnhammer, E. L. L. (1999). A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics, 15, 480–500. 2. Murzin, A., Brenner, S., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. 3. Orengo, C., Michi, A., Jones, S., Jones, D., Swindels, M. B. & Thornton, J. (1997). Cath-a hierarchical classification of protein domain structures. Structure, 5, 1093–1108. 4. Holm, L. & Sander, C. (1994). The fssp database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600–3609. 5. Schultz, J., Milpetz, F., Bork, P & Ponting, C. P. (1998). Smart, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. Sci. USA, 95, 5857–5864. 6. Servant, F., Bru, C., Carrère, S., Courcelle, E., Gouzy, J., Peyruc, D. & Kahn, D. (2002). Prodom: automated clustering of homologous domains. Brief. Bioinfor. 3, 246–251. 7. Sonnhammer, E., Eddy, S. & Durbin, R. (1997). Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins: Struct. Funct. Genet. 28, 405–420. 8. Gerstein, M. & Levitt, M. (1998). Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci. 7, 445–456. Multi-domain Proteins 9. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311–325. 10. Liu, J. & Rost, B. (2004). Chop proteins into structural domain-like fragments. Proteins: Struct. Funct. Bioinfor. 55, 678–688. 11. Gerstein, M. (1997). A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol. 274, 562–576. 12. Qian, J., Luscombe, N. M. & Gerstein, M. (2001). Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681. 13. Apic, G., Gough, J. & Teichmann, S. A. (2001). An insight into domain combinations. Bioinformatics, 17, S83–S89. 14. Rost, B. (2002). Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol. 12, 409–416. 15. Fischer, D. & Eisenberg, D. (1999). Finding families for genomic orfans. Bioinformatics, 15, 759–762. 16. Siew, N. & Fischer, D. (2003). Analysis of singleton orfans in fully sequenced microbial genomes. Proteins: Struct. Funct. Genet. 53, 241–251. 17. Ramani, A. K. & Marcotte, E. M. (2003). Exploiting the co-evolution of interacting proteins to discover interaction specificity. J. Mol. Biol. 327, 273–284. 18. Grishin, N. V. (2001). Fold change in evolution of protein structures. J. Struct. Biol. 134, 167–185. 19. Aroul-Selvam, R., Hubbard, T. & Sasidharan, R. (2004). Domain insertions in protein structures. J. Mol. Biol. 338, 633–641. 20. Gough, J., Karplus, K., Hughey, R. & Chothia, C. (2001). Assignment of homology togenome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313, 903–919. 21. Gouzy, J., Corpetand, F. & Kahn, D. (1999). Whole genome protein domain analysis using a new method for domain clustering. Comp. Chem. 23, 333–340. 22. Siew, N. & Fischer, D. (2004). Structural biology sheds light on the puzzle of genomic orfans. J. Mol. Biol. 342, 369–373. 23. Liu, J., Tan, H. & Rost, B. (2002). Loopy proteins appear conserved in evolution. J. Mol. Biol. 322, 53–64. 24. Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. (2004). Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337, 635–645. 25. Brown, C., Takayama, S., Campen, A., Vise, P., Marshall, T., Oldfield, C. et al. (2002). Evolutionary rate heterogeneity in proteins with long disordered regions. Mol. Evol. 55, 104–110. 26. Pandey, N., Ganapathi, M., Kumar, K., Dasgupta, D., Sutar, S. & Dash, D. (2004). Comparative analysis of protein unfoldedness in human housekeeping and non-housekeeping proteins. Bioinformatics, 20, 2904–2910. 27. Lovell, S. (2003). Are non-functional, unfolded proteins (“junk proteins”) common in the genome? FEBS Letters, 554, 237–239. 28. Bradley, P., Chivian, D., Meiler, J., Misura, K. M., Rohl, C. A., Schief, W. R. et al. (2003). Rosetta predictions in casp5: successes, failures, and prospects for complete automation. Proteins: Struct. Funct. Genet. 6, 457–468. 29. Teichmann, S., Park, J. & Chothia, C. (1998). Structural assignments to the mycoplasma genitalium proteins Multi-domain Proteins 30. 31. 32. 33. 34. 35. show extensive gene duplications and domain rearrangements. Proc. Natl Acad. Sci. USA, 95, 14658– 14663. Andrade, M., Petosa, C., O’Donoghue, S. I., Muller, C. W. & Bork, P. (2001). Comparison of arm and heat protein repeats. J. Mol. Biol. 309, 1–8. Marcotte, E. M., Pellegrini, M., Yeates, T. O. & Eisenberg, D. (1999). A census of protein repeats. J. Mol. Biol. 293, 151–160. Consortium, T. F. (2003). The flybase database of the drosophila genome projects and community literature. Nucl. Acids Res. 31, 172–175. Dolinski, K., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R. et al. (2004). Saccharomyces genome database. Methods Enzymol. 266, 554–571. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Sonnhammer, E., von Heijne, G. & Krogh, A. (1998). A hidden markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182. 243 36. Jones, D. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202. 37. Wooton, J. & Federhen, S. (1996). Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266, 554–571. 38. Thompson, J. D., Higgins, D. & Gibson, T. (1994). Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673–4680. 39. Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence analysis. Proc. Natl Acad. Sci. USA, 85, 2444–2448. 40. Wallner, B., Fang, H., Ohlson, T., Frey-Skött, J. & Elofsson, A. (2004). Using evolutionary information for the query and target improves fold recognition. Proteins: Struct. Funct. Genet. 54, 342–350. 41. Hedman, M., Deloof, H., von Heijne, G. & Elofsson, A. (2002). Improved detection of homologous membrane proteins by inclusion of information form topology prediction. Protein Sci. 11, 652–658. Edited by J. Thornton (Received 27 September 2004; received in revised form 31 January 2005; accepted 2 February 2005)
© Copyright 2026 Paperzz