The phylogenetic diversity of eukaryotic transcription

ã 2003 Oxford University Press
Nucleic Acids Research, 2003, Vol. 31, No. 2 653±660
DOI: 10.1093/nar/gkg156
The phylogenetic diversity of eukaryotic transcription
Richard M. R. Coulson* and Christos A. Ouzounis
Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation,
Cambridge CB10 1SD, UK
Received September 16, 2002; Revised October 30, 2002; Accepted November 18, 2002
ABSTRACT
Eukaryotic transcription is a highly regulated process involving interactions between large numbers
of proteins. To analyse the phylogenetic distribution
of the components of this process, six crown
eukaryote group genomes were queried with a
reference set of transcription-associated (TA) proteins. On average, one in 10 proteins encoded by
these genomes were found to be homologous to
sequences in the reference set. Analysis of families
identi®ed using an accurate sequence clustering
algorithm and containing both TA proteins and
eukaryotic sequences showed that in two-thirds of
the families the homologues originate from a single
kingdom. Furthermore, in only 15% of the fungalspeci®c clusters are the homologues present in
both budding and ®ssion yeast, as compared with
the metazoan-speci®c clusters where 53% of the
homologues originate from two or more species.
Families whose members comprise general transcription factor or RNA polymerase subunits exhibit
a low degree of taxon speci®city, suggesting that
the transcription initiation complex is highly conserved. This contrasts with transcriptional regulator
families, that are primarily taxon-speci®c, indicating
proteins controlling gene activation exhibit considerable sequence diversity across the eukaryotic
domain.
requirements for chromatin-remodelling events and speci®c
histone modi®cations (4). Although the general mechanistic
principles of gene regulation in eukaryotes are well established (5), the phylogenetic diversity of the transcriptional
machinery has not yet been comprehensively explored in the
light of entire eukaryotic genome sequences.
Previously, four entire archaeal genomes were pro®led
using a set of known transcription-associated (TA) proteins,
extracted from the protein sequence databases via keyword
searches (6). This showed that transcription in Archaea is
highly divergent, as represented by the presence of substantially different sets of TA protein homologues in the four
species examined. Subsequently, the full clustering of all
known TA proteins showed TA protein families to be
primarily taxon-speci®c up to the domain level, with very
little sharing between Archaea, Bacteria and Eukarya (7).
Furthermore, ~45% of Arabidopsis thaliana transcription
factors were found to belong to families that are speci®c
to plants (8). To assess the conservation of proteins participating in the eukaryotic transcriptional process, a TA protein
sequence reference set was used to identify homologues in the
entire genome sequences of six species of the crown eukaryote
group: Schizosaccharomyces pombe (9), Saccharomyces
cerevisiae (10), A.thaliana (11), Caenorhabditis elegans
(12), Drosophila melanogaster (13) and Homo sapiens
(14,15). TA proteins from both eukaryotes and prokaryotes
and their homologues in the above species were further
considered. The degree of divergence and species distribution
of proteins involved in the eukaryotic transcriptional process
was then assessed by full clustering of both the TA proteins
and their identi®ed homologues in the six species examined.
INTRODUCTION
MATERIALS AND METHODS
Mechanisms controlling transcriptional activation are fundamentally different between prokaryotes and eukaryotes (1). In
eukaryotes, protein coding genes are usually transcribed by
RNA polymerase II (RNAP II) and typically contain common
core-promoter elements recognised by general transcription
initiation factors (GTFs) and gene-speci®c DNA elements
recognised by regulatory factors (2). Initiation of transcription
requires assembly of a pre-initiation complex (PIC); this
assembly is nucleated by binding of the TATA-box binding
polypeptide subunit of TFIID (TBP). After TBP binding, there
follows concerted recruitment of TFIIB, a complex of
RNAP II with TFIIF, TFIIE and TFIIH (3). The exact timing
of PIC formation varies at different promoters, as do the
Extraction of the reference set
TA proteins were obtained using the DESCRIPTION and
KEYWORD records of Swiss-Prot and the DESCRIPTION
record of SP-TrEMBL (16) containing the word `transcription', as previously described (7). Given the high degree of
manual curation of the Swiss-Prot database, this keywordbased extraction step recovers a wide range of proteins
involved in all aspects of transcription. Data extraction,
linking and indexing was performed by the Sequence
Retrieval System (SRS), version 6.0 and the Icarus language
(17). All data were stored in a MySQL relational database
system.
*To whom correspondence should be addressed. Tel: +44 1223 494417; Fax: +44 1223 494471; Email: [email protected]
654
Nucleic Acids Research, 2003, Vol. 31, No. 2
Table 1. BLAST search using the TA protein reference set against six eukaryotic genomes
Species
Sizea
TA homologues
% Genome
TA proteinsb
% TA data set
H.sapiens
D.melanogaster
C.elegans
A.thaliana
S.cerevisiae
S.pombe
29 304
13 710
19 099
27 406
6292
4882
3471
1370
1504
3273
605
525
11.8
10.0
7.9
11.9
9.6
10.8
4519
4277
3711
2870
2257
2154
45.8
43.3
37.6
29.1
22.9
21.8
Columns: Size, number of protein sequence entries; TA homologues, number of TA protein homologues
identi®ed using BLAST; % Genome, percentage of the genome entries identi®ed as TA homologues; TA
proteins, sequences in the TA protein reference set having a homologue in the corresponding species; % TA
data set, percentage of the TA reference set with a homologue in the corresponding species. Table 1 is sorted
using the values of the last two columns.
aNo. of database targets: 100 693.
bNo. of query sequences: 9874 (869 678 hits were obtained with an E-value cut-off <10±6).
Sequence comparison
The TA protein sequence reference set was ®ltered for
composition bias using CAST (18) before being used to search
against the six complete eukaryotic genomes. The search was
performed using BLAST version 2.0 (19) and sequence
similarities with an E-value threshold <10±6 were considered
as signi®cant. Only TA proteins with eukaryotic homologues
were then further considered.
Validation
330 sequences in the Saccharomyces Genome Database
(SGD) (20) contain `transcription' in their Gene Ontology
(GO) annotations (21) and 74% (244) of these sequences have
signi®cant matches to the TA protein reference set, corresponding to the majority of known transcription factors. The
remaining 86 sequences belonging to the GO class `transcription' have not been identi®ed as homologues. Of the 605
budding yeast TA protein homologues identi®ed (Table 1),
329 are assigned to other GO classes (mostly related with gene
expression) and 32 are unassigned. This suggests that the
BLAST searches of complete genomes performed (Table 1)
identify the majority of the members of TA protein families
within the six eukaryotic genomes. Furthermore, annotations
from the TRANSFAC database of transcription factors (22)
were used to validate the clustering (see below).
Sequence clustering
Eukaryotic homologues and TA proteins from the reference
set were selected for further clustering, using TRIBE-MCL
(23), at in¯ation values 2.0 and 5.0. The TRIBE-MCL
clustering algorithm has been used to detect families of
related protein sequences on the basis of sequence similarity.
This algorithm uses the Markov Clustering aLgorithm (MCL)
for graph ¯ow simulation, operating on all-against-all matrices
containing similarity values obtained by fast database search
algorithms, such as BLAST. Thus, TRIBE-MCL avoids
expensive sequence comparison operations, used to delineate
the consistency of sequence similarity searches as previously
shown (24). TRIBE-MCL is ideally suited to cluster very
large sequence data sets, such as the one described herein.
Only clusters containing both TA proteins and eukaryotic
homologues were further considered (Table 2).
Table 2. Distribution of protein families containing TA proteins and their
eukaryotic homologues
Distribution
2.0
#
Viridiplantae
Fungi
Metazoa
Speci®c to a kingdom
90
178
249
(517)
12.1
23.9
33.4
69.3
120
171
378
(669)
13.9
19.8
43.7
77.3
Fungi::Metazoa
Fungi::Viridiplantae
Metazoa::Viridiplantae
Fungi::Metazoa::Viridiplantae
Shared between kingdoms
34
13
42
140
(229)
4.6
1.7
5.6
18.8
30.7
39
10
46
101
(196)
4.5
1.2
5.3
11.7
22.7
TA proteins and eukaryotic familiesa
TA-protein-only familiesb
Eukaryotic-only families
Total familiesc
746
17
188
951
100.0
865
40
418
1323
100.0
%
5.0
#
%
Columns: Distribution, kingdom distribution of the identi®ed families; 2.0
and 5.0, in¯ation values controlling cluster granularity in the two clustering
operations; # and %, the absolute and relative percentage values for the
families with members in TA proteins and the six eukaryotic genomes.
aAnalysed in Figure 1.
bThese families were not further considered because they did not meet the
clustering criteria.
cNo. of sequences clustered: 16 568 (10 748 eukaryotic and 5820 TA
proteins).
All data are available at: http://www.ebi.ac.uk/research/
cgg/transcription/crown_eukaryotes/.
RESULTS
To estimate the proportion of each eukaryotic genome that
encodes TA proteins, 9874 known TA protein sequences were
extracted from Swiss-Prot and SP-TrEMBL (16) and subsequently used as queries to search the entire genomes (100 693
sequences in total) of the six aforementioned species. In total,
5820 TA proteins identify 10 748 eukaryotic homologues. The
TA proteins without eukaryotic homologues correspond to
archaeal and bacterial sequences. The percentage of the
genome that encodes for TA proteins is on average 10%, with
some slight variations across species (Table 1). Interestingly,
Nucleic Acids Research, 2003, Vol. 31, No. 2
655
Figure 1. Taxonomic distribution of TRIBE-MCL families. The sectors of the Venn diagram represent the seven possible eukaryotic kingdom groupings from
which TA protein homologues within a cluster could derive. The number shown in each sector is the number of low granularity clusters in which the constituent homologues originate from the kingdom/s that comprise the grouping. The panels show (except for the one bounded in grey) the number of times a
particular species distribution of homologues is observed within a kingdom groupingÐthe ®gure in square brackets is the percentage of clusters within the
grouping that show the pattern. The colour of the box (displaying the number of clusters in the group) in these panels is speci®c for a particular group of
kingdoms and is identical in colour to the corresponding sector of the Venn diagram. The percentage next to the box is the proportion of clusters out of the
total number of clusters (746) that fall into the grouping.
there is more variation within the animal kingdom, ranging
from 8% for worms to 12% for humans, compared with the
fungi.
The degree of conservation between the eukaryotic TA
homologues was assessed by clustering a data set consisting of
the TA proteins with homologues in the six target genomes
(5820 sequences) and their corresponding homologues (10 748
sequences). Clustering was performed using TRIBE-MCL, an
ef®cient algorithm for the large-scale detection of protein
families, previously used to identify protein families within
the draft human genome (23). Two clustering operations were
performed with in¯ation values of 2.0 and 5.0 to produce
clusters of different granularity levels. In¯ation value determines cluster granularity, with lower values producing larger
clusters, containing more distantly-related sequences. The
clustering operation performed at the lower in¯ation value
generates 951 families of which 78% contain both eukaryotic
sequences and TA proteins (Table 2), with the remaining
proportion representing either clusters of very remote
sequences or false positive BLAST hits (data not shown). At
the higher in¯ation value, 65% of the 1323 families similarly
contain sequences from both sets (Table 2). The precision of
the clustering algorithm TRIBE-MCL has been estimated to
lie between 87 and 98% for InterPro families (using two
different criteria), and over 80% for in¯ation value 2 and up to
87% for in¯ation value 5 with SCOP families (23).
Complete genome sequences used in this analysis provide a
representative view of the TA protein content of a genome, as
opposed to just the inclusion of all available sequences from
the database where biases may occur. At both granularity
levels, the majority of protein families contain eukaryotic
sequences that originate from a single kingdom, while only
30% of these families are shared across any of the three
eukaryotic kingdoms (Table 2). This observation suggests that
the sharing of TA proteins in the eukaryotic domain is limited,
consistent with previous observations (7,8). It is also striking
that for the families shared by at least two different kingdoms,
more than half of them are present in all three kingdoms,
suggesting that once a TA protein family is shared, it is more
likely to be widely spread within the eukaryotic domain
(Table 2).
To investigate the species speci®city of eukaryotic transcription, the clusters of low granularity (encompassing the
higher number of protein family members) were further
analysed (Fig. 1). Only clusters containing at least one TA
protein and at least one eukaryotic sequence were selected.
This criterion was necessary in order to obtain reliable
descriptions of the clusters, supported by functional annotations from the Swiss-Prot database records. For families
shared between kingdoms, the most common species distributions are the ones in which all species are represented. For
example, out of 140 families shared between all three
kingdoms, 102 (73%) contain sequences from all six species
(Fig. 1, top left-hand panel). This pattern is similar to the one
observed at the kingdom level, suggesting that if a TA protein
family is shared across kingdoms, then it is likely to have
members present in all the species within the kingdom.
In the kingdom-speci®c clusters, the pattern of TA protein
family sharing is less pronounced with 30.5% of the families
present in all three metazoan species (Fig. 1, bottom right-
656
Nucleic Acids Research, 2003, Vol. 31, No. 2
Figure 2. Phylogenetic diversity of human GTF subunit families. The 12 subunits of RNAP II (Rpb-1 to -12) are indicated as white-bounded ovals. The
subunit components of the basal transcription factors (TBP and TFII-B, -E, -F, -H) and cofactors (TFIIA and hTAFIIs) are shown as white-bounded circles.
The colour used to ®ll the circles and ovals indicates the kingdom origins of the homologues within these human subunit-containing clusters (see Fig. 1 for
the colour codes). hTAFII68 is displayed in white as the sequence is absent from the analysis because its corresponding Swiss-Prot entry does not meet the
TA protein extraction criteria. There are only two clusters encoding the three subunits of TFIIA, as TFIIAa and TFIIAb are produced post-translationally
from the same precursor polypeptide. The total number of clusters encompassing the proteins shown is 40.
hand panel) and only 14.6% present in both fungal species
(Fig. 1, top middle panel). Additionally, the degree of species
speci®city is markedly different for yeasts and animals: 85.4%
of the fungal-speci®c TA protein families are present either in
S.pombe or S.cerevisiae, whereas only 47% of the metazoan
families are restricted to a single species. Hence, it appears
that the degree of sharing of kingdom-speci®c TA protein
families is related to the evolutionary distance between the
species within a kingdom and not their level of cellular
complexity. In summary, 33.4, 23.9 and 12.1% of the TA
protein families, contain homologues derived solely from
metazoa, fungi and viridiplantae, respectively (Fig. 1). The
comparatively low percentage of viridiplantae-speci®c clusters may result from the relative lack of experimental data on
plant transcription and fewer plant-speci®c proteins in the TA
protein reference set.
The amount of taxon sharing was then investigated in TA
protein clusters containing either GTF components or regulatory factors. Human sequences present in the extracted TA
protein sequence set that participate in transcription initiation
and mRNA synthesis were identi®ed by their gene description
entries. Analysis of the taxonomic distribution of the TA
protein homologues within the low granularity (larger)
clusters containing these human sequences, showed that the
12 families containing the subunits of human RNAP II also
contain homologues from all three kingdoms (Fig. 2).
However, in the 15 families containing the human proteins
that comprise the `basal' transcription factors (TBP, TFIIB,
TFIIE, TFIIF and TFIIH), only in 53% of the clusters do the
homologues originate from all three eukaryotic kingdoms
(Fig. 2, silver panel). In three of these families, the
homologues are observed only in metazoa and fungi, and in
four of them they are metazoan-speci®c. The homologues
present in 69% of the 13 families containing the TFIIA
subunits and the coactivator subunits of TFIID, TBPassociated factors (hTAFIIs), are detected in all of the three
kingdoms (Fig. 2, gold panel). Of the above 40 families, none
are species-speci®c: for any H.sapiens sequence belonging to
one of the clusters, there exist homologues either in the other
metazoan species or even in other kingdoms. This indicates
the PIC is highly conserved in eukaryotes.
The degree of taxon restrictedness of regulatory proteins
was determined by analysing the phylogenetic distribution of
the eukaryotic sequences present in low granularity families
whose TA protein members are also present in the
TRANSFAC database of transcription factors (22). This
procedure identi®ed 136 clusters of which 125 (92%) are
linked to only one TRANSFAC family (Fig. 3). In only 21%
Figure 3. (Opposite) Association of TRANSFAC transcription factor classi®cations with TA protein homologues. The ®ve TRANSFAC superclasses (labelled
top-centre in the black framed boxes) are divided into classes and the classes are divided into families. The decimal classi®cation number and description of a
family is listed if any of its members are also present in the low granularity TRIBE-MCL clusters. If these TRANSFAC families are further divided into
subfamilies, the last digits of the subfamily decimal classi®cation number is displayed top-left of the pie chart. The description and full classi®cation number
(last digits are in bold) of the subfamily is shown in the grey rectangle. The colour of each segment of the pie chart represents the species from which the TA
protein homologues in the cluster originate (see KEY box for colour codes). The radius of each segment is proportional to the number of occurrences of these
homologues normalised by genome size. The bottom of the KEY box shows the relationship between the normalised number of occurrences and segment
radius. Pie charts with TRIBE-MCL cluster identi®ers beneath them indicate that TA proteins in the cluster are present in other TRANSFAC families and/or
subfamilies. The classi®cations of these families and subfamilies, along with the cluster identi®er, are noted in the grey rectangles.
Nucleic Acids Research, 2003, Vol. 31, No. 2
657
658
Nucleic Acids Research, 2003, Vol. 31, No. 2
(28) of the families do the TA protein homologues originate
from multiple kingdoms and of these families, almost half (13)
contain homologues present in all three kingdoms (with eight
of these families containing sequences present in all six
species). The remaining 108 (79%) of the TRANSFAC-linked
TA protein families are identi®ed as kingdom-speci®c: nine
(6%) have homologues found uniquely in plants, 30 (22%) in
yeasts and 69 (51%) in animals, possibly re¯ecting some bias
in the experimental analyses of transcription in these organisms. The degree of sharing of the kingdom-speci®c homologues across species is markedly different for fungi and
metazoa. Two-thirds (46) of the metazoan-speci®c families
are shared between at least two species, whereas only one
tenth (3) of the fungal-speci®c families occur in both yeasts.
Hence, the levels of taxon speci®city observed for families
containing well-characterised eukaryotic transcription factors
are similar to those observed for the entire data set (Fig. 1).
The levels of abundance of the TA protein homologues
(present in the TRANSFAC-linked clusters; Fig. 3) within the
eukaryotic genomes were assessed by comparing the number
of occurrences of the homologues, normalised by genome
size, per 10 000 genes. The kingdom-speci®c TA protein
homologues are present in each of the genomes on average 3.4
(66.2) times in every 10 000 genes, whereas the homologues
present in multiple kingdoms are nearly twice as abundant
(6.7 6 19.3). Additionally, the variation in abundance of the
kingdom-shared families is 3.1 times higher. For example, in
the three clusters that contain members of the heteromeric
CCAAT factor family (Fig. 3, Family 4.8.1), the levels of
occurrence of the homologues range from 0.5 to 4.7 per 10 000
genes. However, for a C2H2-zinc ®nger family (Fig. 3, Family
2.3.1) the range is from 0.4 to 185.6 and for a MYB family 1.0
to 55.5 (Fig. 3, Family 3.5.1), per 10 000 genes. This increased
range in variation primarily results from the relative underand over-representation, respectively, of the plant homologues
in these clusters (Fig. 3).
The percentage of metazoan genomes encoding TA protein
homologues increases as the level of cellular complexity of the
organism increases (Table 1). To examine if this trend is also
observed with transcriptional regulators, the percentage of the
genome that encodes the TA protein homologues present in
the 51 TRANSFAC-linked families containing sequences
from all three animal species was determined. This analysis
revealed that the trend is similar with 1.7% of the worm
genome encoding sequences homologous to TRANSFAC
entries and 3.1% of the ¯y and 4.4% of the human genomes
encoding such sequences. This absolute increase of 2.7% from
worms to humans accounts for 69.2% of the overall increase of
3.9% in the proportion of the human genome encoding TA
proteins as compared with the worm genome, possibly also
re¯ecting the extent of experimental work on gene expression
in vertebrates (Table 1). These increases are also observed in
Figure 3, where the average difference between the number of
H.sapiens and C.elegans sequences (normalised by genome
size) in a family is 186.4%. The average difference between
D.melanogaster and C.elegans is 93.3% and between
H.sapiens and D.melanogaster, 76.8%. Thus, the increase in
the number of homologues of these DNA-binding proteins
within the genomes of these species appears to correlate with
their increasing cellular diversity.
DISCUSSION
A salient feature of the protein families identi®ed by the
clustering procedure described above, is the degree of taxon
speci®city displayed by the TA protein homologues within the
families. A prior expectation based on previous studies (7,8) of
the outcome of this analysis may have been that a ~10% of
families would contain homologues present in all three
eukaryotic kingdoms. Of the remaining families, a minority
would be speci®c to unicellular organisms with the majority
being speci®c to multicellular organisms, with a large
proportion of the animal homologues being species-speci®c.
Figure 1 shows 14% of the low granularity clusters contain
homologues present in all six eukaryotes (in agreement with
the above expectation), implying that proteins closely related
to these sequences are widely dispersed throughout the
eukaryotic domain. In 69% of the families identi®ed, the TA
protein homologues within a family are unique to one of the
three kingdoms. As these data are based on sequences
identi®ed by complete genome searches, this suggests that
the sharing of TA proteins at the eukaryotic kingdom level is
limited and taxon-speci®c, although the number of complete
eukaryotic genomes is still limited. However, the degree of
species speci®city differs markedly in the fungal and animal
kingdoms: 53% of metazoan-speci®c clusters contain homologues present in multiple species and this contrasts greatly
with the fungal-speci®c clusters in which only 15% of them
contain homologues shared between both yeasts. Given the
similarities of the biologies of budding and ®ssion yeast (25)
in comparison with the very substantial differences in the
levels of cellular diversi®cation between nematodes and
vertebrates, the observation that fungal TA protein families
show greater species speci®city than do the metazoan TA
protein families is unexpected.
The taxonomic origins of the homologues present in the
clusters whose TA protein members play a role in transcription initiation are far more uniform than the origins of the
regulatory factor homologues. Homologues from all three
eukaryotic kingdoms are represented in each of the 12 human
RNAP II subunit families, suggesting that RNAP II is highly
conserved across the eukaryotic domain. In addition, 53% of
families encoding basal transcription factor components and
69% encoding transcriptional cofactors contain homologues
originating from the three eukaryotic kingdoms, implying that
proteins encoding these functions show higher rates of
evolution than do the subunits of RNAP II. This increased
level of divergence may be expected of the cofactors as they
mediate gene-speci®c transcription, but not of the basal
components whose role in transcription initiation is generic. In
only 9% of the well-characterised transcriptional regulator
families (i.e. those containing TA proteins linked to
TRANSFAC) do the homologues originate from the three
kingdoms. The percentage of metazoan-speci®c, regulator
families in which the homologues originate from two or more
species is 67% (compared with 53% for the entire data set),
whereas in only 10% (15% overall) of the fungal-speci®c
families are the homologues present in both yeasts. Hence, the
taxon distribution patterns observed for the transcriptional
regulator families are similar, but slightly more pronounced,
than those observed for the entire data set.
Nucleic Acids Research, 2003, Vol. 31, No. 2
The TA protein families present solely in animals display
3.6 times (6.7 times for the transcriptional regulator families)
the level of homologue sharing between species than do fungi,
despite their considerable differences in cellular complexity.
Additionally, the proportion of each metazoan genome
encoding TA proteins increases with complexity and 70% of
this increase from C.elegans to H.sapiens can be accounted for
by sequences encoding regulators. However, there is strong
evidence that evolutionary changes in developmental gene
regulation have shaped large-scale differences in animal body
plans and parts, and regulatory DNA is the predominant
source of genetic diversity underlying this variation (26). The
low in¯ation clustering operation identi®ed 56, 77 and 170
Hox genes in C.elegans, D.melanogaster and H.sapiens,
respectively. These homeobox genes are involved in the
determination of the animal body plan. After normalising for
genome size, it would appear that humans have 2.0 times more
Hox genes than worms (58.0 versus 29.3) and ¯ies 1.9 times
the number (56.2 versus 29.3), whereas the vertebrate genome
encodes only 3% more than the insect genome (Fig. 3, Family
3.1.1).
The normalised number of MYB genes in C.elegans,
D.melanogaster and H.sapiens is, respectively, 1.1, 5.1 and
2.4 and in A.thaliana this is substantially elevated to 55.5
(Fig. 3, Family 3.5.1). Hence, TRIBE-MCL identi®es 152
MYB-containing sequences from the predicted plant proteome, compared with 190 members of the MYB superfamily
using the entire set of genomic sequences (8). However, both
data sets show that the Arabidopsis MYB family is highly
ampli®ed in comparison with the animal families. MYB
proteins are widely spread throughout the eukaryotic domain
and control cell proliferation and differentiation, with the
additional Arabidopsis genes controlling many aspects of
plant secondary metabolism as well (27). The small variation
in normalised MYB gene number across the three metazoan
species is similar to that observed with the Hox genes and
supports the idea that it is primarily the evolution of gene
expression that underlies morphological diversity. An important caveat however is the possibility of a higher degree of
differential splicing and post-translational modi®cation, as
well as the presence of unidenti®ed TA-protein-coding genes
with complex gene structure, in higher animals concealing TA
protein diversity.
An explanation for the very low amount of homologue
sharing between budding and ®ssion yeast in the fungalspeci®c TA protein families could lie in their evolutionary
distances: C.elegans is closer to H.sapiens than S.pombe is to
S.cerevisiae (28). This analysis suggests that the eukaryotic
transcriptional process, despite (or because of) it having a
fundamental role is highly diversi®ed in the six species of the
crown eukaryote group examined, as exempli®ed by the
presence or absence of TA protein homologues in their entire
genomes. Possible reasons for this high degree of speci®city
may include the different needs to respond to environmental
and genetic stimuli as well as some form of `genetic
immunity': it may be bene®cial for a species to retain a
speci®c repertoire of TA proteins, possibly avoiding catastrophic changes in genetic control by laterally transferred
genes from other species. Speciation itself may be largely due
to subtle changes in gene regulation via the selective
659
acquisition or loss of TA proteins, as well as changes in
their speci®city.
Metabolic enzymes are in general highly conserved in
sequence and phylogenetic distribution (29), yet a direct
comparison of the phylogenetic distribution of TA proteins
in eukaryotes with other biological processes, including
metabolism, is not currently feasible using the same approach.
It would be interesting to examine the extent to which other
fundamental biological processes, such as translation or DNA
repair, exhibit similar patterns of taxon speci®city. To achieve
this, a more elaborate classi®cation of the roles of gene
products for eukaryotes is required, such as the one provided
by the GO classi®cation scheme (21). Once suf®cient
coverage of eukaryotic genomes has been achieved, it would
be interesting to perform comparative studies across species
and other biological processes.
ACKNOWLEDGEMENTS
R.M.R.C. and C.A.O. would like to thank the Medical
Research Council for supporting this work through a Special
Training Fellowship in Bioinformatics to R.M.R.C.
REFERENCES
1. Struhl,K. (1999) Fundamentally different logic of gene regulation in
eukaryotes and prokaryotes. Cell, 98, 1±4.
2. Roeder,R.G. (1996) The role of general initiation factors in transcription
by RNA polymerase II. Trends Biochem. Sci., 21, 327±335.
3. Woychik,N.A. and Hampsey,M. (2002) The RNA polymerase II
machinery: structure illuminates function. Cell, 108, 453±463.
4. Emerson,B.M. (2002) Speci®city of gene regulation. Cell, 109, 267±270.
5. Ptashne,M. and Gann,A. (2002) Genes and Signals. Cold Spring Harbor
Laboratory Press, Cold Spring Harbor, NY.
6. Kyrpides,N.C. and Ouzounis,C.A. (1999) Transcription in archaea.
Proc. Natl Acad. Sci. USA, 96, 8545±8550.
7. Coulson,R.M., Enright,A.J. and Ouzounis,C.A. (2001) Transcriptionassociated protein families are primarily taxon-speci®c. Bioinformatics,
17, 95±97.
8. Riechmann,J.L., Heard,J., Martin,G., Reuber,L., Jiang,C., Keddie,J.,
Adam,L., Pineda,O., Ratcliffe,O.J., Samaha,R.R. et al. (2000)
Arabidopsis transcription factors: genome-wide comparative analysis
among eukaryotes. Science, 290, 2105±2110.
9. Wood,V., Gwilliam,R., Rajandream,M.A., Lyne,M., Lyne,R., Stewart,A.,
Sgouros,J., Peat,N., Hayles,J., Baker,S. et al. (2002) The genome
sequence of Schizosaccharomyces pombe. Nature, 415, 871±880.
10. Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M. et al.
(1996) Life with 6000 genes. Science, 274, 546, 563±547.
11. The Arabidopsis Genome Initiative (2000) Analysis of the genome
sequence of the ¯owering plant Arabidopsis thaliana. Nature, 408,
796±815.
12. The C. elegans Sequencing Consortium (1998) Genome sequence of the
nematode C. elegans: a platform for investigating biology. Science, 282,
2012±2018.
13. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D.,
Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al.
(2000) The genome sequence of Drosophila melanogaster. Science, 287,
2185±2195.
14. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C.,
Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001)
Initial sequencing and analysis of the human genome. Nature, 409,
860±921.
15. Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G.,
Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. et al. (2001) The
sequence of the human genome. Science, 291, 1304±1351.
660
Nucleic Acids Research, 2003, Vol. 31, No. 2
16. Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence
database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28,
45±48.
17. Etzold,T. and Argos,P. (1993) SRSÐan indexing and retrieval tool for
¯at ®le data libraries. Comput. Appl. Biosci., 9, 49±57.
18. Promponas,V.J., Enright,A.J., Tsoka,S., Kreil,D.P., Leroy,C.,
Hamodrakas,S., Sander,C. and Ouzounis,C.A. (2000) CAST: an iterative
algorithm for the complexity analysis of sequence tracts. Complexity
analysis of sequence tracts. Bioinformatics, 16, 915±922.
19. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs. Nucleic Acids
Res., 25, 3389±3402.
20. Dwight,S.S., Harris,M.A., Dolinski,K., Ball,C.A., Binkley,G.,
Christie,K.R., Fisk,D.G., Issel-Tarver,L., Schroeder,M., Sherlock,G. et al.
(2002) Saccharomyces Genome Database (SGD) provides secondary
gene annotation using the Gene Ontology (GO). Nucleic Acids Res., 30,
69±72.
21. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M.,
Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. (2000) Gene
ontology: tool for the uni®cation of biology. The Gene Ontology
Consortium. Nature Genet., 25, 25±29.
22. Wingender,E., Dietze,P., Karas,H. and Knuppel,R. (1996) TRANSFAC:
a database on transcription factors and their DNA binding sites.
Nucleic Acids Res., 24, 238±241.
23. Enright,A.J., Van Dongen,S. and Ouzounis,C.A. (2002) An ef®cient
algorithm for large-scale detection of protein families. Nucleic Acids
Res., 30, 1575±1584.
24. Enright,A.J. and Ouzounis,C.A. (2000) GeneRAGE: a robust algorithm
for sequence clustering and domain detection. Bioinformatics, 16,
451±457.
25. Sipiczki,M. (2000) Where does ®ssion yeast sit on the tree of life?
Genome Biol., 1, r1011.1011±r1011.1014.
26. Carroll,S.B. (2000) Endless forms: the evolution of gene regulation and
morphological diversity. Cell, 101, 577±580.
27. Stracke,R., Werber,M. and Weisshaar,B. (2001) The R2R3-MYB gene
family in Arabidopsis thaliana. Curr. Opin. Plant Biol., 4, 447±456.
28. Feng,D.F., Cho,G. and Doolittle,R.F. (1997) Determining divergence
times with a protein clock: update and reevaluation. Proc. Natl Acad. Sci.
USA, 94, 13028±13033.
29. Doolittle,R.F., Feng,D.F., Tsang,S., Cho,G. and Little,E. (1996)
Determining divergence times of the major kingdoms of living organisms
with a protein clock. Science, 271, 470±477.