ã 2003 Oxford University Press Nucleic Acids Research, 2003, Vol. 31, No. 2 653±660 DOI: 10.1093/nar/gkg156 The phylogenetic diversity of eukaryotic transcription Richard M. R. Coulson* and Christos A. Ouzounis Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK Received September 16, 2002; Revised October 30, 2002; Accepted November 18, 2002 ABSTRACT Eukaryotic transcription is a highly regulated process involving interactions between large numbers of proteins. To analyse the phylogenetic distribution of the components of this process, six crown eukaryote group genomes were queried with a reference set of transcription-associated (TA) proteins. On average, one in 10 proteins encoded by these genomes were found to be homologous to sequences in the reference set. Analysis of families identi®ed using an accurate sequence clustering algorithm and containing both TA proteins and eukaryotic sequences showed that in two-thirds of the families the homologues originate from a single kingdom. Furthermore, in only 15% of the fungalspeci®c clusters are the homologues present in both budding and ®ssion yeast, as compared with the metazoan-speci®c clusters where 53% of the homologues originate from two or more species. Families whose members comprise general transcription factor or RNA polymerase subunits exhibit a low degree of taxon speci®city, suggesting that the transcription initiation complex is highly conserved. This contrasts with transcriptional regulator families, that are primarily taxon-speci®c, indicating proteins controlling gene activation exhibit considerable sequence diversity across the eukaryotic domain. requirements for chromatin-remodelling events and speci®c histone modi®cations (4). Although the general mechanistic principles of gene regulation in eukaryotes are well established (5), the phylogenetic diversity of the transcriptional machinery has not yet been comprehensively explored in the light of entire eukaryotic genome sequences. Previously, four entire archaeal genomes were pro®led using a set of known transcription-associated (TA) proteins, extracted from the protein sequence databases via keyword searches (6). This showed that transcription in Archaea is highly divergent, as represented by the presence of substantially different sets of TA protein homologues in the four species examined. Subsequently, the full clustering of all known TA proteins showed TA protein families to be primarily taxon-speci®c up to the domain level, with very little sharing between Archaea, Bacteria and Eukarya (7). Furthermore, ~45% of Arabidopsis thaliana transcription factors were found to belong to families that are speci®c to plants (8). To assess the conservation of proteins participating in the eukaryotic transcriptional process, a TA protein sequence reference set was used to identify homologues in the entire genome sequences of six species of the crown eukaryote group: Schizosaccharomyces pombe (9), Saccharomyces cerevisiae (10), A.thaliana (11), Caenorhabditis elegans (12), Drosophila melanogaster (13) and Homo sapiens (14,15). TA proteins from both eukaryotes and prokaryotes and their homologues in the above species were further considered. The degree of divergence and species distribution of proteins involved in the eukaryotic transcriptional process was then assessed by full clustering of both the TA proteins and their identi®ed homologues in the six species examined. INTRODUCTION MATERIALS AND METHODS Mechanisms controlling transcriptional activation are fundamentally different between prokaryotes and eukaryotes (1). In eukaryotes, protein coding genes are usually transcribed by RNA polymerase II (RNAP II) and typically contain common core-promoter elements recognised by general transcription initiation factors (GTFs) and gene-speci®c DNA elements recognised by regulatory factors (2). Initiation of transcription requires assembly of a pre-initiation complex (PIC); this assembly is nucleated by binding of the TATA-box binding polypeptide subunit of TFIID (TBP). After TBP binding, there follows concerted recruitment of TFIIB, a complex of RNAP II with TFIIF, TFIIE and TFIIH (3). The exact timing of PIC formation varies at different promoters, as do the Extraction of the reference set TA proteins were obtained using the DESCRIPTION and KEYWORD records of Swiss-Prot and the DESCRIPTION record of SP-TrEMBL (16) containing the word `transcription', as previously described (7). Given the high degree of manual curation of the Swiss-Prot database, this keywordbased extraction step recovers a wide range of proteins involved in all aspects of transcription. Data extraction, linking and indexing was performed by the Sequence Retrieval System (SRS), version 6.0 and the Icarus language (17). All data were stored in a MySQL relational database system. *To whom correspondence should be addressed. Tel: +44 1223 494417; Fax: +44 1223 494471; Email: [email protected] 654 Nucleic Acids Research, 2003, Vol. 31, No. 2 Table 1. BLAST search using the TA protein reference set against six eukaryotic genomes Species Sizea TA homologues % Genome TA proteinsb % TA data set H.sapiens D.melanogaster C.elegans A.thaliana S.cerevisiae S.pombe 29 304 13 710 19 099 27 406 6292 4882 3471 1370 1504 3273 605 525 11.8 10.0 7.9 11.9 9.6 10.8 4519 4277 3711 2870 2257 2154 45.8 43.3 37.6 29.1 22.9 21.8 Columns: Size, number of protein sequence entries; TA homologues, number of TA protein homologues identi®ed using BLAST; % Genome, percentage of the genome entries identi®ed as TA homologues; TA proteins, sequences in the TA protein reference set having a homologue in the corresponding species; % TA data set, percentage of the TA reference set with a homologue in the corresponding species. Table 1 is sorted using the values of the last two columns. aNo. of database targets: 100 693. bNo. of query sequences: 9874 (869 678 hits were obtained with an E-value cut-off <10±6). Sequence comparison The TA protein sequence reference set was ®ltered for composition bias using CAST (18) before being used to search against the six complete eukaryotic genomes. The search was performed using BLAST version 2.0 (19) and sequence similarities with an E-value threshold <10±6 were considered as signi®cant. Only TA proteins with eukaryotic homologues were then further considered. Validation 330 sequences in the Saccharomyces Genome Database (SGD) (20) contain `transcription' in their Gene Ontology (GO) annotations (21) and 74% (244) of these sequences have signi®cant matches to the TA protein reference set, corresponding to the majority of known transcription factors. The remaining 86 sequences belonging to the GO class `transcription' have not been identi®ed as homologues. Of the 605 budding yeast TA protein homologues identi®ed (Table 1), 329 are assigned to other GO classes (mostly related with gene expression) and 32 are unassigned. This suggests that the BLAST searches of complete genomes performed (Table 1) identify the majority of the members of TA protein families within the six eukaryotic genomes. Furthermore, annotations from the TRANSFAC database of transcription factors (22) were used to validate the clustering (see below). Sequence clustering Eukaryotic homologues and TA proteins from the reference set were selected for further clustering, using TRIBE-MCL (23), at in¯ation values 2.0 and 5.0. The TRIBE-MCL clustering algorithm has been used to detect families of related protein sequences on the basis of sequence similarity. This algorithm uses the Markov Clustering aLgorithm (MCL) for graph ¯ow simulation, operating on all-against-all matrices containing similarity values obtained by fast database search algorithms, such as BLAST. Thus, TRIBE-MCL avoids expensive sequence comparison operations, used to delineate the consistency of sequence similarity searches as previously shown (24). TRIBE-MCL is ideally suited to cluster very large sequence data sets, such as the one described herein. Only clusters containing both TA proteins and eukaryotic homologues were further considered (Table 2). Table 2. Distribution of protein families containing TA proteins and their eukaryotic homologues Distribution 2.0 # Viridiplantae Fungi Metazoa Speci®c to a kingdom 90 178 249 (517) 12.1 23.9 33.4 69.3 120 171 378 (669) 13.9 19.8 43.7 77.3 Fungi::Metazoa Fungi::Viridiplantae Metazoa::Viridiplantae Fungi::Metazoa::Viridiplantae Shared between kingdoms 34 13 42 140 (229) 4.6 1.7 5.6 18.8 30.7 39 10 46 101 (196) 4.5 1.2 5.3 11.7 22.7 TA proteins and eukaryotic familiesa TA-protein-only familiesb Eukaryotic-only families Total familiesc 746 17 188 951 100.0 865 40 418 1323 100.0 % 5.0 # % Columns: Distribution, kingdom distribution of the identi®ed families; 2.0 and 5.0, in¯ation values controlling cluster granularity in the two clustering operations; # and %, the absolute and relative percentage values for the families with members in TA proteins and the six eukaryotic genomes. aAnalysed in Figure 1. bThese families were not further considered because they did not meet the clustering criteria. cNo. of sequences clustered: 16 568 (10 748 eukaryotic and 5820 TA proteins). All data are available at: http://www.ebi.ac.uk/research/ cgg/transcription/crown_eukaryotes/. RESULTS To estimate the proportion of each eukaryotic genome that encodes TA proteins, 9874 known TA protein sequences were extracted from Swiss-Prot and SP-TrEMBL (16) and subsequently used as queries to search the entire genomes (100 693 sequences in total) of the six aforementioned species. In total, 5820 TA proteins identify 10 748 eukaryotic homologues. The TA proteins without eukaryotic homologues correspond to archaeal and bacterial sequences. The percentage of the genome that encodes for TA proteins is on average 10%, with some slight variations across species (Table 1). Interestingly, Nucleic Acids Research, 2003, Vol. 31, No. 2 655 Figure 1. Taxonomic distribution of TRIBE-MCL families. The sectors of the Venn diagram represent the seven possible eukaryotic kingdom groupings from which TA protein homologues within a cluster could derive. The number shown in each sector is the number of low granularity clusters in which the constituent homologues originate from the kingdom/s that comprise the grouping. The panels show (except for the one bounded in grey) the number of times a particular species distribution of homologues is observed within a kingdom groupingÐthe ®gure in square brackets is the percentage of clusters within the grouping that show the pattern. The colour of the box (displaying the number of clusters in the group) in these panels is speci®c for a particular group of kingdoms and is identical in colour to the corresponding sector of the Venn diagram. The percentage next to the box is the proportion of clusters out of the total number of clusters (746) that fall into the grouping. there is more variation within the animal kingdom, ranging from 8% for worms to 12% for humans, compared with the fungi. The degree of conservation between the eukaryotic TA homologues was assessed by clustering a data set consisting of the TA proteins with homologues in the six target genomes (5820 sequences) and their corresponding homologues (10 748 sequences). Clustering was performed using TRIBE-MCL, an ef®cient algorithm for the large-scale detection of protein families, previously used to identify protein families within the draft human genome (23). Two clustering operations were performed with in¯ation values of 2.0 and 5.0 to produce clusters of different granularity levels. In¯ation value determines cluster granularity, with lower values producing larger clusters, containing more distantly-related sequences. The clustering operation performed at the lower in¯ation value generates 951 families of which 78% contain both eukaryotic sequences and TA proteins (Table 2), with the remaining proportion representing either clusters of very remote sequences or false positive BLAST hits (data not shown). At the higher in¯ation value, 65% of the 1323 families similarly contain sequences from both sets (Table 2). The precision of the clustering algorithm TRIBE-MCL has been estimated to lie between 87 and 98% for InterPro families (using two different criteria), and over 80% for in¯ation value 2 and up to 87% for in¯ation value 5 with SCOP families (23). Complete genome sequences used in this analysis provide a representative view of the TA protein content of a genome, as opposed to just the inclusion of all available sequences from the database where biases may occur. At both granularity levels, the majority of protein families contain eukaryotic sequences that originate from a single kingdom, while only 30% of these families are shared across any of the three eukaryotic kingdoms (Table 2). This observation suggests that the sharing of TA proteins in the eukaryotic domain is limited, consistent with previous observations (7,8). It is also striking that for the families shared by at least two different kingdoms, more than half of them are present in all three kingdoms, suggesting that once a TA protein family is shared, it is more likely to be widely spread within the eukaryotic domain (Table 2). To investigate the species speci®city of eukaryotic transcription, the clusters of low granularity (encompassing the higher number of protein family members) were further analysed (Fig. 1). Only clusters containing at least one TA protein and at least one eukaryotic sequence were selected. This criterion was necessary in order to obtain reliable descriptions of the clusters, supported by functional annotations from the Swiss-Prot database records. For families shared between kingdoms, the most common species distributions are the ones in which all species are represented. For example, out of 140 families shared between all three kingdoms, 102 (73%) contain sequences from all six species (Fig. 1, top left-hand panel). This pattern is similar to the one observed at the kingdom level, suggesting that if a TA protein family is shared across kingdoms, then it is likely to have members present in all the species within the kingdom. In the kingdom-speci®c clusters, the pattern of TA protein family sharing is less pronounced with 30.5% of the families present in all three metazoan species (Fig. 1, bottom right- 656 Nucleic Acids Research, 2003, Vol. 31, No. 2 Figure 2. Phylogenetic diversity of human GTF subunit families. The 12 subunits of RNAP II (Rpb-1 to -12) are indicated as white-bounded ovals. The subunit components of the basal transcription factors (TBP and TFII-B, -E, -F, -H) and cofactors (TFIIA and hTAFIIs) are shown as white-bounded circles. The colour used to ®ll the circles and ovals indicates the kingdom origins of the homologues within these human subunit-containing clusters (see Fig. 1 for the colour codes). hTAFII68 is displayed in white as the sequence is absent from the analysis because its corresponding Swiss-Prot entry does not meet the TA protein extraction criteria. There are only two clusters encoding the three subunits of TFIIA, as TFIIAa and TFIIAb are produced post-translationally from the same precursor polypeptide. The total number of clusters encompassing the proteins shown is 40. hand panel) and only 14.6% present in both fungal species (Fig. 1, top middle panel). Additionally, the degree of species speci®city is markedly different for yeasts and animals: 85.4% of the fungal-speci®c TA protein families are present either in S.pombe or S.cerevisiae, whereas only 47% of the metazoan families are restricted to a single species. Hence, it appears that the degree of sharing of kingdom-speci®c TA protein families is related to the evolutionary distance between the species within a kingdom and not their level of cellular complexity. In summary, 33.4, 23.9 and 12.1% of the TA protein families, contain homologues derived solely from metazoa, fungi and viridiplantae, respectively (Fig. 1). The comparatively low percentage of viridiplantae-speci®c clusters may result from the relative lack of experimental data on plant transcription and fewer plant-speci®c proteins in the TA protein reference set. The amount of taxon sharing was then investigated in TA protein clusters containing either GTF components or regulatory factors. Human sequences present in the extracted TA protein sequence set that participate in transcription initiation and mRNA synthesis were identi®ed by their gene description entries. Analysis of the taxonomic distribution of the TA protein homologues within the low granularity (larger) clusters containing these human sequences, showed that the 12 families containing the subunits of human RNAP II also contain homologues from all three kingdoms (Fig. 2). However, in the 15 families containing the human proteins that comprise the `basal' transcription factors (TBP, TFIIB, TFIIE, TFIIF and TFIIH), only in 53% of the clusters do the homologues originate from all three eukaryotic kingdoms (Fig. 2, silver panel). In three of these families, the homologues are observed only in metazoa and fungi, and in four of them they are metazoan-speci®c. The homologues present in 69% of the 13 families containing the TFIIA subunits and the coactivator subunits of TFIID, TBPassociated factors (hTAFIIs), are detected in all of the three kingdoms (Fig. 2, gold panel). Of the above 40 families, none are species-speci®c: for any H.sapiens sequence belonging to one of the clusters, there exist homologues either in the other metazoan species or even in other kingdoms. This indicates the PIC is highly conserved in eukaryotes. The degree of taxon restrictedness of regulatory proteins was determined by analysing the phylogenetic distribution of the eukaryotic sequences present in low granularity families whose TA protein members are also present in the TRANSFAC database of transcription factors (22). This procedure identi®ed 136 clusters of which 125 (92%) are linked to only one TRANSFAC family (Fig. 3). In only 21% Figure 3. (Opposite) Association of TRANSFAC transcription factor classi®cations with TA protein homologues. The ®ve TRANSFAC superclasses (labelled top-centre in the black framed boxes) are divided into classes and the classes are divided into families. The decimal classi®cation number and description of a family is listed if any of its members are also present in the low granularity TRIBE-MCL clusters. If these TRANSFAC families are further divided into subfamilies, the last digits of the subfamily decimal classi®cation number is displayed top-left of the pie chart. The description and full classi®cation number (last digits are in bold) of the subfamily is shown in the grey rectangle. The colour of each segment of the pie chart represents the species from which the TA protein homologues in the cluster originate (see KEY box for colour codes). The radius of each segment is proportional to the number of occurrences of these homologues normalised by genome size. The bottom of the KEY box shows the relationship between the normalised number of occurrences and segment radius. Pie charts with TRIBE-MCL cluster identi®ers beneath them indicate that TA proteins in the cluster are present in other TRANSFAC families and/or subfamilies. The classi®cations of these families and subfamilies, along with the cluster identi®er, are noted in the grey rectangles. Nucleic Acids Research, 2003, Vol. 31, No. 2 657 658 Nucleic Acids Research, 2003, Vol. 31, No. 2 (28) of the families do the TA protein homologues originate from multiple kingdoms and of these families, almost half (13) contain homologues present in all three kingdoms (with eight of these families containing sequences present in all six species). The remaining 108 (79%) of the TRANSFAC-linked TA protein families are identi®ed as kingdom-speci®c: nine (6%) have homologues found uniquely in plants, 30 (22%) in yeasts and 69 (51%) in animals, possibly re¯ecting some bias in the experimental analyses of transcription in these organisms. The degree of sharing of the kingdom-speci®c homologues across species is markedly different for fungi and metazoa. Two-thirds (46) of the metazoan-speci®c families are shared between at least two species, whereas only one tenth (3) of the fungal-speci®c families occur in both yeasts. Hence, the levels of taxon speci®city observed for families containing well-characterised eukaryotic transcription factors are similar to those observed for the entire data set (Fig. 1). The levels of abundance of the TA protein homologues (present in the TRANSFAC-linked clusters; Fig. 3) within the eukaryotic genomes were assessed by comparing the number of occurrences of the homologues, normalised by genome size, per 10 000 genes. The kingdom-speci®c TA protein homologues are present in each of the genomes on average 3.4 (66.2) times in every 10 000 genes, whereas the homologues present in multiple kingdoms are nearly twice as abundant (6.7 6 19.3). Additionally, the variation in abundance of the kingdom-shared families is 3.1 times higher. For example, in the three clusters that contain members of the heteromeric CCAAT factor family (Fig. 3, Family 4.8.1), the levels of occurrence of the homologues range from 0.5 to 4.7 per 10 000 genes. However, for a C2H2-zinc ®nger family (Fig. 3, Family 2.3.1) the range is from 0.4 to 185.6 and for a MYB family 1.0 to 55.5 (Fig. 3, Family 3.5.1), per 10 000 genes. This increased range in variation primarily results from the relative underand over-representation, respectively, of the plant homologues in these clusters (Fig. 3). The percentage of metazoan genomes encoding TA protein homologues increases as the level of cellular complexity of the organism increases (Table 1). To examine if this trend is also observed with transcriptional regulators, the percentage of the genome that encodes the TA protein homologues present in the 51 TRANSFAC-linked families containing sequences from all three animal species was determined. This analysis revealed that the trend is similar with 1.7% of the worm genome encoding sequences homologous to TRANSFAC entries and 3.1% of the ¯y and 4.4% of the human genomes encoding such sequences. This absolute increase of 2.7% from worms to humans accounts for 69.2% of the overall increase of 3.9% in the proportion of the human genome encoding TA proteins as compared with the worm genome, possibly also re¯ecting the extent of experimental work on gene expression in vertebrates (Table 1). These increases are also observed in Figure 3, where the average difference between the number of H.sapiens and C.elegans sequences (normalised by genome size) in a family is 186.4%. The average difference between D.melanogaster and C.elegans is 93.3% and between H.sapiens and D.melanogaster, 76.8%. Thus, the increase in the number of homologues of these DNA-binding proteins within the genomes of these species appears to correlate with their increasing cellular diversity. DISCUSSION A salient feature of the protein families identi®ed by the clustering procedure described above, is the degree of taxon speci®city displayed by the TA protein homologues within the families. A prior expectation based on previous studies (7,8) of the outcome of this analysis may have been that a ~10% of families would contain homologues present in all three eukaryotic kingdoms. Of the remaining families, a minority would be speci®c to unicellular organisms with the majority being speci®c to multicellular organisms, with a large proportion of the animal homologues being species-speci®c. Figure 1 shows 14% of the low granularity clusters contain homologues present in all six eukaryotes (in agreement with the above expectation), implying that proteins closely related to these sequences are widely dispersed throughout the eukaryotic domain. In 69% of the families identi®ed, the TA protein homologues within a family are unique to one of the three kingdoms. As these data are based on sequences identi®ed by complete genome searches, this suggests that the sharing of TA proteins at the eukaryotic kingdom level is limited and taxon-speci®c, although the number of complete eukaryotic genomes is still limited. However, the degree of species speci®city differs markedly in the fungal and animal kingdoms: 53% of metazoan-speci®c clusters contain homologues present in multiple species and this contrasts greatly with the fungal-speci®c clusters in which only 15% of them contain homologues shared between both yeasts. Given the similarities of the biologies of budding and ®ssion yeast (25) in comparison with the very substantial differences in the levels of cellular diversi®cation between nematodes and vertebrates, the observation that fungal TA protein families show greater species speci®city than do the metazoan TA protein families is unexpected. The taxonomic origins of the homologues present in the clusters whose TA protein members play a role in transcription initiation are far more uniform than the origins of the regulatory factor homologues. Homologues from all three eukaryotic kingdoms are represented in each of the 12 human RNAP II subunit families, suggesting that RNAP II is highly conserved across the eukaryotic domain. In addition, 53% of families encoding basal transcription factor components and 69% encoding transcriptional cofactors contain homologues originating from the three eukaryotic kingdoms, implying that proteins encoding these functions show higher rates of evolution than do the subunits of RNAP II. This increased level of divergence may be expected of the cofactors as they mediate gene-speci®c transcription, but not of the basal components whose role in transcription initiation is generic. In only 9% of the well-characterised transcriptional regulator families (i.e. those containing TA proteins linked to TRANSFAC) do the homologues originate from the three kingdoms. The percentage of metazoan-speci®c, regulator families in which the homologues originate from two or more species is 67% (compared with 53% for the entire data set), whereas in only 10% (15% overall) of the fungal-speci®c families are the homologues present in both yeasts. Hence, the taxon distribution patterns observed for the transcriptional regulator families are similar, but slightly more pronounced, than those observed for the entire data set. Nucleic Acids Research, 2003, Vol. 31, No. 2 The TA protein families present solely in animals display 3.6 times (6.7 times for the transcriptional regulator families) the level of homologue sharing between species than do fungi, despite their considerable differences in cellular complexity. Additionally, the proportion of each metazoan genome encoding TA proteins increases with complexity and 70% of this increase from C.elegans to H.sapiens can be accounted for by sequences encoding regulators. However, there is strong evidence that evolutionary changes in developmental gene regulation have shaped large-scale differences in animal body plans and parts, and regulatory DNA is the predominant source of genetic diversity underlying this variation (26). The low in¯ation clustering operation identi®ed 56, 77 and 170 Hox genes in C.elegans, D.melanogaster and H.sapiens, respectively. These homeobox genes are involved in the determination of the animal body plan. After normalising for genome size, it would appear that humans have 2.0 times more Hox genes than worms (58.0 versus 29.3) and ¯ies 1.9 times the number (56.2 versus 29.3), whereas the vertebrate genome encodes only 3% more than the insect genome (Fig. 3, Family 3.1.1). The normalised number of MYB genes in C.elegans, D.melanogaster and H.sapiens is, respectively, 1.1, 5.1 and 2.4 and in A.thaliana this is substantially elevated to 55.5 (Fig. 3, Family 3.5.1). Hence, TRIBE-MCL identi®es 152 MYB-containing sequences from the predicted plant proteome, compared with 190 members of the MYB superfamily using the entire set of genomic sequences (8). However, both data sets show that the Arabidopsis MYB family is highly ampli®ed in comparison with the animal families. MYB proteins are widely spread throughout the eukaryotic domain and control cell proliferation and differentiation, with the additional Arabidopsis genes controlling many aspects of plant secondary metabolism as well (27). The small variation in normalised MYB gene number across the three metazoan species is similar to that observed with the Hox genes and supports the idea that it is primarily the evolution of gene expression that underlies morphological diversity. An important caveat however is the possibility of a higher degree of differential splicing and post-translational modi®cation, as well as the presence of unidenti®ed TA-protein-coding genes with complex gene structure, in higher animals concealing TA protein diversity. An explanation for the very low amount of homologue sharing between budding and ®ssion yeast in the fungalspeci®c TA protein families could lie in their evolutionary distances: C.elegans is closer to H.sapiens than S.pombe is to S.cerevisiae (28). This analysis suggests that the eukaryotic transcriptional process, despite (or because of) it having a fundamental role is highly diversi®ed in the six species of the crown eukaryote group examined, as exempli®ed by the presence or absence of TA protein homologues in their entire genomes. Possible reasons for this high degree of speci®city may include the different needs to respond to environmental and genetic stimuli as well as some form of `genetic immunity': it may be bene®cial for a species to retain a speci®c repertoire of TA proteins, possibly avoiding catastrophic changes in genetic control by laterally transferred genes from other species. Speciation itself may be largely due to subtle changes in gene regulation via the selective 659 acquisition or loss of TA proteins, as well as changes in their speci®city. Metabolic enzymes are in general highly conserved in sequence and phylogenetic distribution (29), yet a direct comparison of the phylogenetic distribution of TA proteins in eukaryotes with other biological processes, including metabolism, is not currently feasible using the same approach. It would be interesting to examine the extent to which other fundamental biological processes, such as translation or DNA repair, exhibit similar patterns of taxon speci®city. To achieve this, a more elaborate classi®cation of the roles of gene products for eukaryotes is required, such as the one provided by the GO classi®cation scheme (21). Once suf®cient coverage of eukaryotic genomes has been achieved, it would be interesting to perform comparative studies across species and other biological processes. ACKNOWLEDGEMENTS R.M.R.C. and C.A.O. would like to thank the Medical Research Council for supporting this work through a Special Training Fellowship in Bioinformatics to R.M.R.C. REFERENCES 1. Struhl,K. (1999) Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell, 98, 1±4. 2. Roeder,R.G. (1996) The role of general initiation factors in transcription by RNA polymerase II. Trends Biochem. Sci., 21, 327±335. 3. Woychik,N.A. and Hampsey,M. (2002) The RNA polymerase II machinery: structure illuminates function. Cell, 108, 453±463. 4. Emerson,B.M. (2002) Speci®city of gene regulation. Cell, 109, 267±270. 5. Ptashne,M. and Gann,A. (2002) Genes and Signals. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. 6. Kyrpides,N.C. and Ouzounis,C.A. (1999) Transcription in archaea. Proc. Natl Acad. Sci. USA, 96, 8545±8550. 7. Coulson,R.M., Enright,A.J. and Ouzounis,C.A. (2001) Transcriptionassociated protein families are primarily taxon-speci®c. Bioinformatics, 17, 95±97. 8. Riechmann,J.L., Heard,J., Martin,G., Reuber,L., Jiang,C., Keddie,J., Adam,L., Pineda,O., Ratcliffe,O.J., Samaha,R.R. et al. (2000) Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science, 290, 2105±2110. 9. Wood,V., Gwilliam,R., Rajandream,M.A., Lyne,M., Lyne,R., Stewart,A., Sgouros,J., Peat,N., Hayles,J., Baker,S. et al. (2002) The genome sequence of Schizosaccharomyces pombe. Nature, 415, 871±880. 10. Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M. et al. (1996) Life with 6000 genes. Science, 274, 546, 563±547. 11. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the ¯owering plant Arabidopsis thaliana. Nature, 408, 796±815. 12. The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 2012±2018. 13. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al. (2000) The genome sequence of Drosophila melanogaster. Science, 287, 2185±2195. 14. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C., Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860±921. 15. Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. et al. (2001) The sequence of the human genome. Science, 291, 1304±1351. 660 Nucleic Acids Research, 2003, Vol. 31, No. 2 16. Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45±48. 17. Etzold,T. and Argos,P. (1993) SRSÐan indexing and retrieval tool for ¯at ®le data libraries. Comput. Appl. Biosci., 9, 49±57. 18. Promponas,V.J., Enright,A.J., Tsoka,S., Kreil,D.P., Leroy,C., Hamodrakas,S., Sander,C. and Ouzounis,C.A. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics, 16, 915±922. 19. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389±3402. 20. Dwight,S.S., Harris,M.A., Dolinski,K., Ball,C.A., Binkley,G., Christie,K.R., Fisk,D.G., Issel-Tarver,L., Schroeder,M., Sherlock,G. et al. (2002) Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res., 30, 69±72. 21. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the uni®cation of biology. The Gene Ontology Consortium. Nature Genet., 25, 25±29. 22. Wingender,E., Dietze,P., Karas,H. and Knuppel,R. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., 24, 238±241. 23. Enright,A.J., Van Dongen,S. and Ouzounis,C.A. (2002) An ef®cient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30, 1575±1584. 24. Enright,A.J. and Ouzounis,C.A. (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16, 451±457. 25. Sipiczki,M. (2000) Where does ®ssion yeast sit on the tree of life? Genome Biol., 1, r1011.1011±r1011.1014. 26. Carroll,S.B. (2000) Endless forms: the evolution of gene regulation and morphological diversity. Cell, 101, 577±580. 27. Stracke,R., Werber,M. and Weisshaar,B. (2001) The R2R3-MYB gene family in Arabidopsis thaliana. Curr. Opin. Plant Biol., 4, 447±456. 28. Feng,D.F., Cho,G. and Doolittle,R.F. (1997) Determining divergence times with a protein clock: update and reevaluation. Proc. Natl Acad. Sci. USA, 94, 13028±13033. 29. Doolittle,R.F., Feng,D.F., Tsang,S., Cho,G. and Little,E. (1996) Determining divergence times of the major kingdoms of living organisms with a protein clock. Science, 271, 470±477.
© Copyright 2026 Paperzz