Comparative Analysis of the Genomes of Cyanobacteria and Plants

Genome Informatics 13: 173–182 (2002)
173
Comparative Analysis of the Genomes of Cyanobacteria
and Plants
Naoki Sato
[email protected]
Department of Molecular Biology, Saitama University, Shimo-Ohkubo 255, Saitama
338-8570, Japan
Abstract
Chloroplast genome originates from the genome of ancestral cyanobacterial endosymbiont. The
comparison of the genomes of cyanobacteria and plants has been made possible by the advance
in genome sequencing. I report here current results of our computational efforts to compare the
genomes of cyanobacteria and plants and to trace the process of evolution of cyanobacteria, chloroplasts and plants. Cyanobacteria form a clearly defined monophyletic clade with reasonable level
of diversity and are ideal for testing various approaches of genome comparison. Analysis of short
sequence features such as genome signature was found to be useful in characterizing cyanobacterial genomes. Comparison of genome contents was performed by homology grouping of predicted
protein coding sequences, rather than orthologue-based comparison, to minimize effects of multidomain proteins and large protein families, both of which are important in cyanobacterial genomes.
Comparison of the genomes of six species of cyanobacteria suggests that there are a number of
species-specific additions of protein genes, and this information is useful in reconstructing phylogenetic relationship. The homology groups in cyanobacteria were used as a reference to compare
plants and non-photosynthetic organisms. The results suggest that 238 groups that are common
to all organisms analyzed may define a minimal set of gene groups. In addition, only 80 groups are
identified as the gene groups that could not have been acquired by plants without cyanobacterial
endosymbiosis. Further study is needed to identify plant genes of cyanobacterial origin.
Keywords: chloroplast, cyanobacterial evolution, comparative genomics, genome signature, homology group, plant genomics
1
Introduction
Chloroplast genome is believed to originate from the genome of an ancestor of extant cyanobacteria
[21, 22, 25]. However, the situation is far more complicated than the simple endosymbiosis. Since the
smallest genome of present day cyanobacteria is of about 1.7 Mbp (Prochlorococcus marinus MED4),
the size of the original genome of endosymbiont might be of this order. In contrast, the sizes of all
known chloroplast genomes are between 80 − 200 kbp. The chloroplast genome encodes genes for
rRNA and tRNA, as well as genes for photosystem proteins, subunits of ATP synthase, ribosome
subunits, etc. The large difference in the sizes of chloroplast genomes and cyanobacterial genomes is
partly due to the vast loss of genes from the original endosymbiont genome early during the evolution
of chloroplast. Some of the endosymbiont genes were not just lost, but were actually transferred to
the cell nucleus [1, 16]. These nuclear genes encode proteins that are targeted to the chloroplast.
Each of these proteins is synthesized with a transit peptide in the N-terminus, which enables it to be
recognized by the protein transport machinery now called translocon that located in the chloroplast
envelope membrane. The nuclear-encoded chloroplast proteins include many proteins involved in
photosynthesis as well as in the gene expression within the chloroplast. In addition, the plant nuclear
genome also includes genes that had been transferred from the mitochondrial genome. There are cases
in which an originally mitochondrial enzyme encoded in the nuclear genome is also targeted to the
chloroplast. The T7-phage type RNA polymerase is a typical example [12].
174
Sato
Comparative genomics is a powerful tool in analyzing relationship of various different organisms.
The sequence of rRNA has been used to estimate phylogeny of various organisms including cyanobacteria [9], but the data of whole genome are expected to give more reliable phylogeny [3, 6]. Different
ways of using the whole genomic data have been reported, such as genome signature analysis that uses
dinucleotide relative abundance [4, 15], genome alignment [27] and comparison of genome content [23],
among others. There are two major approaches in comparing genome content. One approach that
uses orthologues [23, 24, 27] is straightforward and easy to understand, but orthologues are usually
difficult to define because of the presence of protein families as well as multi-domain proteins. Another
approach that uses homology groups [10] rather than individual genes can tolerate the problem of gene
families. The problem of multi-domain proteins (see below) is still to be solved.
I report here initial analysis of the relationship between cyanobacteria and chloroplasts or plants.
Traditional approach has been to enumerate enzymes or genes encoded in the plant nucleus or chloroplasts that are similar to the counterpart in cyanobacteria rather than to the counterparts in eukaryotes
or bacteria. By this manual comparison, the core subunits and sigma factors of the chloroplast RNA
polymerase has been demonstrated to be of cyanobacterial origin. Most of the ribosomal proteins,
molecular chaperons, subunits of translocon, enzymes for pigment synthesis are also shown to originate
from the counterparts of the endosymbiont. However, this approach is not powerful enough to enumerate all the possible homologues. One problem is that previous such efforts used only Synechocystis
sp. PCC 6803 as a representative of cyanobacteria and Arabidopsis thaliana as a representative of
plants [1, 16]. As I show in this article below, each cyanobacterial genome includes many speciesspecific genes. We should define first the genes that are common in all cyanobacteria, and then we
will be able to compare them with the plant genes. In this comparison, we should also use various
genomes of non-photosynthetic organisms as a negative control. The second point is that there are
many paralogues in the genomes of cyanobacteria and plants. This makes difficult to identify orthologues of cyanobacteria and plants. In addition, there are a number of multi-domain proteins in the
cyanobacteria especially in Anabaena sp. 7120 [19] (also called Nostoc sp. PCC 7120). The multidomain proteins are incredibly abundant in filamentous strains of cyanobacteria, such as Anabaena
sp. 7120 and Nostoc punctiforme PCC 73102 [17]. These proteins are difficult to classify and no
direct orthologues of these proteins can be found in other species of cyanobacteria or plants. In the
homology group method, the multi-domain proteins are hidden under large homology groups. This
problem should be solved in the future.
In the present report, I will describe results of analysis of short sequence features as well as
analysis of homology groups of cyanobacteria. The homology group method was extended to compare
cyanobacteria and plants.
2
2.1
Methods
Database
The following database sequences were used. Cyanobacteria: Synechocystis sp. PCC 6803 (Sy),
GenBank AB001339 and Cyanobase Synecho.p.aa and Synecho.p.table [13]; Anabaena sp. 7120
(An), GenBank BA000019 and Cyanobase chromo.p.aa and chromo.p.table [14]; Nostoc punctiforme
PCC 73102 (Np), JGI Web site (microbe4) [17]; Prochlorococcus marinus MED4 (Pm1), JGI Web
site (microbe2); Prochlorococcus marinus MIT9313 (Pm2), JGI Web site (microbe2b); Synechococcus sp. WH8102 (S81), JGI Web site (2351364). The Synechocystis database was recently reannotated, but the data used in the present study was those from the initial release. The DNA
sequence remained unchanged after the first release. The annotation of open reading frame has
been updated recently, but there are no large changes of open reading frames. The databases
(finished contigs and draft CDS translations) of JGI [29] were retrieved in January 2002. Plant:
Arabidopsis thaliana [26] all predicted protein sequences (Ath), faa file from the NCBI genome
database site (mirrored in GenomeNet [28] under /pub/db/ncbi/genbank/genomes/A_thaliana/).
175
Comparative Analysis of the Genomes of Cyanobacteria and Plants
Other organisms: Escherichia coli K-12 (Ec), GenBank U00096; Bacillus subtilis 168 (Bs), GenBank
AL009126; Helicobacter pylori 26695 (Hp), GenBank AE000511; Saccharomyces cerevisiae all predicted protein sequences (Sc), *.faa and *.ptt files from the NCBI genome database site (mirrored
in GenomeNet [28] under /pub/db/ncbi/genbank/genomes/S_cerevisiae/). Other draft bacterial
sequences (Rhodopseudomonas palustris and Rhodobacter sphaeroides) were also retrieved from the
FTP site of JGI [29]. In the following text, the genus names in roman style are used to denote the
organisms listed above. The short symbol in two- or three characters such as Sy or Ath will also be
used where appropriate such as in the tables and figures.
2.2
Basic Computational Analysis
All the sequence manipulations and the analysis of short sequence feature including palindromes
(command ‘sites’) and GC skew (command ‘genlist’) were done with the SISEQ package [20], version
130pre24. The dinucleotide relative abundance or genome signature analysis [4, 15] was performed by
another software ‘dinucf’. Phylogenetic analysis was performed by the Phylip package [7]. Most of the
calculation was done in a Linux workstation with Alpha 21264 processor and MacOS X workstations
with G4 processor.
2.3
Homology Group Analysis
Homology group analysis was performed essentially according to the BLAST2 (version 2.2.1) scores.
All the predicted amino acid sequences of the three species of cyanobacteria (Sy, An and Np) were
mutually analyzed by the BLASTP program [2] using the cutoff E value of 10−8 . The results were
processed by PERL scripts to obtain homology groups. In this method, a group is formed if a single
member has similarity to another member. Because of the presence of various multi-domain proteins,
the largest group is very large including various unrelated preteins. But this situation does not seriously
influence the genome comparison as described below. Similar analysis was also performed with all the
six species of cyanobacteria. Then, homology of the proteins of three species of cyanobacteria were
searched against all the predicted protein sequences of other three species of cyanobacteria (Pm1,
Pm2, and S81), Ec, Bs, Sc (including mitochondrial genome), and Ath (including mitochondrial and
chloroplast genomes) individually. The results were compiled according to the groups established with
the three cyanobacterial sequences.
Table 1: Low and high frequency sequences in the genomes of Anabaena and selected bacteria.
Obs/Exp is the ratio of observed and expected number of occurrence of each sequence. The expected value was estimated from the base composition of the sequence and the GC content of genome.
Dinucleotide relative abundance (DRA) was not considered in the estimation of expected values. An,
Anabaena sp. PCC 7120; Np, Nostoc punctiforme; Sy, Synechocystis sp. PCC 6803; Hp, Helicobacter pylori; Ec, Escherichia coli K-12; Pm1, Prochlorococcus marinus MED4. RE, restriction enzyme.
These values were calculated by the SISEQ package [20].
Genome
Size
(kbp)
An
6,413 41.4
0.006
0.031
0.011 AhaII (NarI, AatII),
GCGATCGC
122.1
Np
9,216 41.5
0.113
0.088
0.366 ApaLI, AvaIVP, BglII,
GCGATCGC
120.7
Sy
3,573 47.7
0.534
0.418
0.243 BssHII ,MluI
1,667 38.9
0.255
0.130
GCGATCGC
0.639 ScaI, XhoI, AatII, KpnI, GCGATCGC
70.1
Hp
Ec
4,639 50.8
0.259
0.303
0.765 AvrII, XbaI
(none)
Pm1
1,674 30.9
0.601
1.364
0.813 none
GCTGCAGC
GC%
Low frequency RE sites
(Obs/Exp)
AvaI
AvaII
AvaIII
Other RE sites
CYCGRG GGWCC ATGCAT
(Obs/Exp <0.05)
ApaLI, AvaIVP, AvrII,
BamHI, BanII (SacI),
BglII, BsiWI (= SplI),
BstBI (= AsuII), FspI,
NcoI, NspI, PmlI, PstI,
SacII, SalI, SphI.
NcoI, NdeI, SacI
High-frequency
palindrome
Octamer Obs/Exp
sequence
SalI, HpaI, SnaBI
10.4
13.3
176
Sato
Table 2: Genome signature profile of photosynthetic prokaryotes and selected bacteria. Bold: >1.2;
Underline: <0.85. Dinucleotide relative abundances (DRA) was calculated by the dinucf program
according to the formulae of Karlin and Cardon [15]. Distance of a pair of genomes was calculated
as a mean of the absolute differences in the 16 DRA values [4]. An, Anabaena sp. PCC 7120; Np,
Nostoc punctiforme; Sy, Synechocystis sp. PCC 6803; Pm1, Prochlorococcus marinus MED4; S81,
Synechococcus sp. WH8102; Bs, Bacillus subtilis; Ec, Escherichia coli K-12; Rp, Rhodopseudomonas
palustris; Rs, Rhodobacter sphaeroides.
Spe
cies GC
CG
Dinucleotide relative abundances
TT CC TC CT
AT TA
AA GG GA AG
Distance (x1000)
TG
CA
AC
GT
An
1.16 0.78 0.93 0.84 1.15 1.11 0.92 0.99 1.09 0.89
Np
1.24 0.81 0.92 0.81 1.17 1.05 0.94 1.02 1.09 0.86
Sy
1.02 0.75 1.00 0.75 1.32 1.36 0.86 0.85 1.05 0.79
An Np Sy Pm1 S81 Bs Ec Rp
30
115 128
Pm1 1.17 0.51 0.92 0.79 1.17 1.28 1.08 1.09 1.00 0.72
109 109 135
S81 1.10 0.87 1.13 0.43 1.09 0.95 1.08 1.00 1.25 0.85
123 113 199 173
Bs
1.27 1.04 1.02 0.65 1.24 0.97 1.06 0.91 1.08 0.75
115
92 143 141 109
Ec
1.28 1.16 1.10 0.75 1.21 0.91 0.92 0.82 1.12 0.88
107
95 151 202 136
Rp
1.20 1.31 1.43 0.44 1.08 0.75 1.24 0.87 1.02 0.86
217 203 263 252 147 158 154
Rs
1.12 1.16 1.57 0.39 0.99 0.85 1.31 0.99 0.97 0.75
229 220 285 234 153 169 202
82
93
Figure 1: Phylogenetic tree estimated from dinucleotide relative abundance (DRA) distances. This
tree was constructed by the neighbor program of
the Phylip package.
3
3.1
Results and Discussion
Short Sequence Features of Cyanobacterial Genomes
As an initial characterization of the cyanobacterial genomes, short sequence features of cyanobacterial
genomes were analyzed. Frequency of short palindrome sequences such as restriction sites are summarized in Table 1. Anabaena genome is virtually the only bacterial genome known to date in which
many restriction sites are unusually underrepresented. There are only 18 AvaI sites within the whole
genome, and this is 0.006 times less than the expected abundance. A number of other restriction
sites are also highly underrepresented. I analyzed more than 40 bacterial genomes but only Anabaena,
Nostoc and Helicobacter showed such peculiarity. However, Nostoc has much more AvaI and AvaIII
sites than does Anabaena. In contrast, Anabaena, Nostoc and Synechocystis share an identical high
frequency palindrome GCGATCGC.
Again, this is the only very high frequency palindrome ever found in bacterial genomes. This
palindrome is found in both coding and non-coding sequences and the reason for this high frequency
is still not clear. In E. coli or other bacteria, no palindrome sequence was found that occurs in a
frequency higher than expected from random combination.
Comparative Analysis of the Genomes of Cyanobacteria and Plants
177
Table 2 shows results of dinucleotide relative abundance (DRA) analysis of cyanobacteria, photosynthetic bacteria, and selected bacteria. The DRA values (or genome signature) is largely similar
among cyanobacteria, but Synechococcus is divergent. This is clearly depicted in Fig. 1, a phylogenetic tree drawn with the sum of difference in DRA as distance. Although DRA cannot be used as
a marker of phylogenetic relationship in a strict sense, some relationship between the phylogeny and
DRA is confirmed.
I have analyzed the GC skew of various cyanobacterial genomes as another sequence feature. The
cumulative GC skew - AT skew diagram has been successfully used to predict replication origin of
some bacteria [8]. The GC skew values are generally very small throughout the whole region of the
completely sequenced genomes of Anabaena and Synechocystis. Therefore, no clear-cut turning point
in the cumulative GC skew - AT skew diagram was detected in these cyanobacteria. Currently, we
have no clue to identify the replication origin of these cyanobacterial genomes. However, there are
some regions with discontinuous GC skew values (results not shown). These regions are potential sites
of recent horizontal gene transfer.
3.2
Homology Group Analysis of Cyanobacterial Genomes
As an initial attempt to compare the genome contents of cyanobacteria, all the predicted protein
sequences of Anabaena, Nostoc and Synechocystis were grouped according to the BLASTP score, and
the resulting groups are shown in Fig. 2A. It is clear that a large number of groups (1,343) are shared
by the three species of cyanobacteria. What is more surprising is that each genome contains speciesspecific groups, and the number of species-specific groups amounted to 20 − 30% of total groups in
each genome. Another point is that as many as 827 groups are shared by Anabaena and Nostoc,
and these groups are supposed to be related to the ability of these two species of cyanobacteria to
differentiate heterocysts, akinetes, and hormogonia. This three-species comparison was then extended
to include six species (Tables 3, 4, 5). Inclusion of additional three species did not change the essential
points described above. In each species of cyanobacteria, about 17 − 40% of the groups are unique.
A close examination of the members of these groups suggests that some of the proteins are similar to
predicted proteins of other bacteria. This indicates that at least some of the species-specific protein
groups originate from genes imported by horizontal gene transfer [18]. Horizontal gene transfer has
been identified mostly for genes of clearly defined function. But the data of the present study suggest
that horizontal gene transfer might be more ubiquitous. However, it is not clear if all of the speciesspecific additions had been generated solely by the horizontal gene transfer. Since there are almost
unlimited number of unknown bacteria in the world, horizontal gene transfer from these unidentified
species could account for the species-specific additions, but it will be difficult to demonstrate this by
assessing exactly the number of unidentified bacterial species, and we need further work to analyze
the species-specific additions. Similar species-specific additions are also documented in various pairs
of closely related bacteria [11].
Relationship of the six species of cyanobacteria is more clearly understood if the homology groups
are used to construct a phylogenetic tree by the parsimony method (Fig. 3). Fortunately, the results
obtained with two different cutoff E values of BLASTP coincided perfectly, although the actual number
of species-specific additions and losses are significantly different. In this analysis, Anabaena and
Nostoc form a group sister to Synechocystis, and all these species are paraphyletic to other three
species. Curiously, the two strains of P. marinus are not monophyletic, as suggested by the analysis
using the 16S rRNA sequences [9]. The two genera Prochlorococcus and Synechococcus are different
in the composition of photosynthetic accessory pigments (divinyl chlorophyll a, chlorophyll b and
phycobiliproteins). On the other hand, the two strains of P. marinus are also different in their
habitats: MED4 is a high light-adapted strain and lives in a shallow zone of sea, while MIT9313 is a
low light-adapted strain and lives in a deep ocean under more than 100 m below surface. The genome
sizes and therefore the gene contents of these strains are also different. More data will be needed to
understand the phylogenetic relationship of Prochlorococcus and Synechococcus.
178
Sato
Figure 2: A. Schematic diagram of gene groups in Anabaena (An), Nostoc (Np) and Synechocystis
(Sy). Similarity among the 5,381 predicted protein sequences (PPS) of An, 3,173 PPS of Sy, and
7,445 PPS of Np were scored by the blastp program and the PPSs were clustered with a cutoff value
of 10−8 . The resulting groups were classified according to the presence of homologues in these three
genomes: 834 gene groups (green area) were not present in Sy or Np, 827 groups (blue area) shared
by An and Np but not by Sy, 67 groups (uncolored area) were shared by An and Sy but not by Np,
71 groups (uncolored area) were shared by Np and Sy but not by An, and 1,343 groups (pale green
area) were common to the three organisms. The area of each field is approximately proportional to
the number of groups shared by various combinations of organisms.
B. Comparison of cyanobacteria, Arabidopsis and non-photosynthetic organisms. The 1,343 groups
shared by An, Np and Sy were further classified with respect to the presence of homologues in other
various organisms. The 784 groups (enclosed by a blue line) were shared also with Prochlorococcus
marinus MED4 (Pm1), Prochlorococcus marinus MIT9313 (Pm2) and Synechococcus sp. WH8102
(S8102). The 628 groups (enclosed by a green line) were shared also with the Arabidopsis thaliana (Ath)
genomes (nuclear, plastid and mitochondrial). The 511 groups were shared by the six cyanobacteria
and Arabidopsis. The 267 gene groups were shared also with Escherichia coli, Bacillus subtilis, and
Saccharomyces cerevisiae (nuclear and mitochondrial genomes). A fairly large number of groups (762)
were shared with at least one of these non-photosynthetic organisms. Finally, 238 groups shown in
the central shaded area were shared by all of the organisms analyzed.
Comparative Analysis of the Genomes of Cyanobacteria and Plants
3.3
179
Comparison of Homology Groups in Cyanobacteria, Plants and Other Organisms
The homology groups of the three species of cyanobacteria (Anabaena, Nostoc and Synechocystis)
were used as a reference to compare the presence of homology groups in plants and other organisms
(Fig. 2B). Here, the common groups in the three species of cyanobacteria (1,343 groups) were further
classified by the presence of homologues in either other three species of cyanobacteria (Pm1, Pm2
and S81), the plant Arabidopsis, or the non-photosynthetic microorganisms (Ec, Bs and Sc). All the
six species of cyanobacteria shared 784 groups, while 511 groups are also shared by Arabidopsis. The
238 groups that are shared by all the organisms analyzed are supposed to form the core building
blocks of all organisms. The 431 groups that are shared by the six cyanobacteria plus Arabidopsis
and at least one of the three non-photosynthetic organisms should be further classified by detailed
phylogenetic analysis. It should be noted that there are only 80 groups that are shared by the six
cyanobacteria and Arabidopsis but not by non-photosynthetic organisms. These groups include, as
expected, many proteins involved in photosynthesis and pigment biosynthesis, but also a number of
proteins of unknown function. Some of the photosynthesis genes are still encoded in the chloroplast
genome, but many are transferred to the nucleus in Arabidopsis (supplementary information will be
available). A complete survey of the genes that have been transferred to the nucleus will be possible if
further genome information of algae and plants are available in the near future. The small number of
protein groups that are specific to photosynthetic organisms (80 groups) suggests that these are the
only proteins that could not be gained by the plants without cyanobacterial endosymbiosis. In other
words, all other proteins in the 431 groups also shared by non-photosynthetic organisms could have
been present in plants as basic homology groups, or basic building blocks of eukaryotic heritage. This
does not mean that the proteins in these groups are not cyanobacterial origin. Some of them are in
fact of endosymbiont origin, but we can hypothesize that homologues of these proteins could have been
provided by the eukaryotic host cell. A good example of this situation has been reported for glycolytic
enzymes, some of which are of cyanobacterial origin while some other are not [5]. Another example
is the carbon fixation protein, ribulose-1,5,bisphosphate carboxylase/oxygenase (RuBisCO), which is
essential in photosynthesis. The large subunit of this enzyme is also found in some chemoautotrophs
as well as organisms that are not normally related to carbon fixation, such as B. subtilis. In many
cases, the genes encoding RuBisCO are supposed to be horizontally transferred. This consideration
suggests that the homology group method can shed light on the set of building blocks of an organism,
and is suitable for comparing very diverse organisms.
4
Conclusion and Prospects
Current analysis of different genomes include 6 cyanobacteria, a plant, two bacteria and a eukaryotic microorganism. In the near future, genomes of 15 species of cyanobacteria will be available for
bioinformatic analysis. In this respect, cyanobacteria are the only group of organisms, in which detailed comparison of genome contents can be made. Many attempts to compare all known bacterial
genomes have been reported [3, 6, 10, 23, 24, 27], but the bacteria are highly divergent and it is not
currently possible to draw meaningful conclusion from the comparison of all bacteria. In contrast,
cyanobacteria form a well-defined monophyletic clade, but they are more divergent than the pairs of
strains of a single species (e.g., E. coli K12 and 0157, or different strains of Helicobacter pylori). In
this respect, cyanobacteria are ideal for genomic comparison, and will provide a good test case for
future comparison of all bacteria or all organisms.
The present study also provides data on the comparison of cyanobacteria and plants. Attempts to
compare plants and cyanobacteria and to trace the origin of plant proteins have been using a single
cyanobacterium (Synechocystis) and a single plant (Arabidopsis) [1]. As I described in the present report, each cyanobacterium has its own species-specific genes. It is important to compare all available
180
Sato
cyanobacterial genomes with the genome of Arabidopsis, which is currently the only plant whose genomic sequence is available with fairly reliable annotation. However, the draft sequences of the genomes
of two varieties of rice have become available. The genome sequencing of a unicellular rhodophyte
Cyanidioschyzon merolae will be complete soon. The EST projects of a macrophytic rhodophyte Porphyra yezoensis, a green alga Chlamydomonas reinhardtii, a moss Physcomitrella patens, as well as
various legumes and cereals will provide important information on expressed genes. Use of these algal
and plant sequences as well as sequences of non-photosynthetic organisms including human will shed
light on the origin and formation of plant genomes.
Development of software for the comparison of genome contents is also continuing. In the present
study, the largest cluster, which contains many unrelated proteins that are linked by various multidomain proteins, was not decomposed to subclusters and analyzed in detail. Another software is being
developed that can classify this complex cluster. However, comparison of a large number of genomes
is limited by the computer resources such as size of memory and the speed of processor. The increase
in the number of available genome sequences is faster than the increase and development in computer
resources. We will need a way of calculation that can be done with a limited size of hardware resources
but can accommodate daily increasing genome data.
Table 3: Comparison of six cyanobacterial genomes. All of the predicted protein sequences of the six
cyanobacteria were clustered by the homology group method.
Species
Anabaena sp. 7120 (An)
Synechocystis sp. PCC 6803 (Sy)
Nostoc punctiforme PCC 73102 (Np)
Prochlorococcus marinus MED4 (Pm1)
Prochlorococcus marinus MIT9313 (Pm2)
Synechococcus sp. WH8102 (S81)
Table
4: Groups
shared
two species
Table
4. Groups
shared
by twobyspecies
Species pair
An - Sy
An - Np
Sy - Np
Pm1 – Pm2
Sy - S81
Pm1 - S81
Pm2 - S81
Other combinations
Number of
common groups
40
736
54
32
11
25
105
0-7
Total
genes
Total
groups
5,371
3,167
7,431
1,689
2,195
2,425
3,036
1,957
3,665
1,255
1,580
1,699
22,278
6,220
Additions
Groups
%
815
453
1,436
208
311
396
26.8
23.1
39.2
16.6
19.7
23.3
Groups
Losses
shared by
(groups)
all six
7
2,412
32
1,666
16
3,074
114
1,157
20
1,256
14
1,370
763
groups
Table 5: Other major shared groups
Species combination
An - Sy - Np
An - Sy - Np - S81
Pm1 – Pm2 - S81
Number of
common groups
353
41
79
Acknowledgments
This study was supported in part by Grants-in-Aid for Scientific Research from the Japan Society for
the Promotion of Science (JSPS) (nos. 13440234, 12874104), a Grant-in-Aid for Scientific Research
on Priority Areas (C) “Genome Biology” from the Ministry of Education, Science, Sports, Culture,
and Technology.
181
Comparative Analysis of the Genomes of Cyanobacteria and Plants
BLAST threshold = 1 x 10 -8
901
An
801
1524
599
Np
BLAST threshold = 1 x 10 -20
1204 An
1095
2121
749
798 Sy
517 Sy
Pm1
396
423 Pm2
Np
538
Pm1
572
Pm2
500 changes
184
500 changes
687 S81
519 S81
Figure 3: Genome phylogenetic tree constructed by the homology groups. To evaluate validity of
the method, two different threshold values of BLAST E values were used, but the resulting tree was
essentially identical in shape. These trees were constructed by the parsimony method with the PAUP
program. For the abbreviated names of species, see Table 3.
134
References
[1] Abdallah, F., Salamini, F., and Leister, D., A prediction of the size and evolutionary origin of
the proteome of chloroplasts of Arabidopsis, Trends Plant Sci., 5:141–142, 2000.
[2] Altschul, S.F., Madden, T.L., Schäfer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J.,
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic
Acids Res., 25:3389–3402, 1997.
[3] Bansal, A.K. and Meyer, T.E., Evolutionary analysis by whole-genome comparisons, J. Bacteriol.,
184:2261–2272, 2002.
[4] Campbell, A., Mrázek, J., and Karlin, S., Genome signature comparisons among prokaryote,
plasmid, and mitochondrial DNA, Proc. Natl. Acad. Sci. USA, 96:9184–9189, 1999.
[5] Canback, B., Andersson, S.G.E., and Kurland, C.G., The global phylogeny of glycolytic enzymes,
Proc. Natl. Acad. Sci. USA, 99:6097–6102, 2002.
[6] Eisen, J.A., Assessing evolutionary relationships among microbes from whole-genome analysis,
Curr. Opinion Microbiol., 3:475–480, 2000.
[7] Felsenstein, J., Phylogenies from molecular sequences: Inference and reliability, Annu. Rev.
Genet., 22:521–565, 1988.
[8] Grigoriev, A., Analyzing genomes with cumulative skew diagrams, Nucleic Acids Res., 26:2286–
2290, 1998.
[9] Honda, D., Yokota, A., and Sugiyama, J., Detection of seven major evolutionary lineages in
cyanobacteria based on the 16S rRNA gene sequence analysis with new sequences of five marine
Synechococcus strains, J. Mol. Evol., 48:723–739, 1999.
[10] House, C.H. and Fitz-Gibbon, S.T., Using homolog groups to create a whole-genomic tree of
free-living organisms: An update, J. Mol. Evol., 54:539–547, 2002.
[11] Janssen, P.J., Audit, B., and Ouzounis, C.A., Strain-specific genes of Helicobacter pylori: distribution, function and dynamics, Nucleic Acids Res., 29:4395–4404, 2001.
182
Sato
[12] Kabeya, Y., Hashimoto, K., and Sato, N., Identification and characterization of two phage-type
RNA polymerase cDNAs in the moss Physcomitrella patens: Implication of recent evolution of
nuclear-encoded RNA polymerase of plastids in plants, Plant Cell Physiol., 43:245–255, 2002.
[13] Kaneko, T., Sato, S., Kotani, H., Tanaka, A., Asamizu, E., Nakamura, Y., Miyajima, N., Hirosawa, M., Sugiura, M., Sasamoto, S., Kimura, T., Hosouchi, T., Matsuno, A., Muraki, A.,
Nakazaki, N., Naruo, K., Okumura, S., Shimpo, S., Takeuchi, C., Wada, T., Watanabe, A.,
Yamada, M., Yasuda, M., and Tabata, S., Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire
genome and assignment of potential protein-coding regions, DNA Res., 3:109–136, 1996.
[14] Kaneko, T., Nakamura, Y., Wolk, C. P., Kuritz, T., Sasamoto, S., Watanabe, A., Iriguchi, M.,
Ishikawa, A., Kawashima, K., Kimura, T., Kishida, Y., Kohara, M., Matsumoto, M., Matsuno,
A., Muraki, A., Nakazaki, N., Shimpo, S., Sugimoto, M., Takazawa, M., Yamada, M., Yasuda, M.,
and Tabata, S., Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium
Anabaena sp. strain PCC 7120, DNA Res., 8:205–213, 2001.
[15] Karlin, S. and Cardon, L.R., Computational DNA sequence analysis, Annu. Rev. Microbiol.,
48:619–654, 1994.
[16] Martin, W., Stoebe, B., Goremykin, V., Hansmann, S., Hasegawa, M., and Kowallik, K.V., Gene
transfer to the nucleus and the evolution of chloroplasts, Nature, 393:162–165, 1998.
[17] Meeks, J.C., Elhai, J., Thiel, T., Potts, M., Larimer, F., Lamerdin, J., Predki, P., and Atlas, R.,
An overview of the genome of Nostoc punctiforme, a multicellular, symbiotic cyanobacterium,
Photosynth. Res., 70:85–106, 2001.
[18] Ochman, H., Lawrence, J.G., and Groisman, E.A., Lateral gene transfer and the nature of bacterial innovation, Nature, 405:299–304, 2000.
[19] Ohmori, M., Ikeuchi, M., Sato, N., Wolk, P., Kaneko, T., Ogawa, T., Kanehisa, M., Goto, S.,
Kawashima, S., Okamoto, S., Yoshimura, H., Katoh, H., Fujisawa, T., Ehira, S., Kamei, A.,
Yoshihara, S., Narikawa, R., and Tabata, S., Characterization of genes encoding multi-domain
proteins in the genome of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain
PCC 7120, DNA Res., 8:271–284, 2001.
[20] Sato, N., SISEQ: Manipulation of multiple sequence and large database files for common platforms, Bioinformatics, 16:180–181, 2000.
[21] Sato, N., Was the evolution of plastid genetic machinery discontinuous?, Trends Plant Sci., 6:151–
155, 2001.
[22] Simpson, C.L. and Stern, D.B., The treasure trove of algal chloroplast genomes, surprises in
architecture and gene content, and their functional implications, Plant Physiol., 129:957–966,
2002.
[23] Snel, B., Bork, P., and Huynen, M.A., Genome phylogeny based on gene content, Nature Genet.,
21:108–110, 1999.
[24] Snel, B., Bork, P., and Huynen, M.A., Genomes in flux: The evolution of archaeal and proteobacterial gene content, Genome Res., 12:17–25, 2002.
[25] Stoebe, B. and Maier, U.-G., One, two, three: nature’s tool box for building plastids, Psrotoplasma, 219:123–130, 2002.
[26] The Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana, Nature, 408:796–815, 2000.
[27] Wolf, Y.I., Bogozin, I.B., Kondrashov, A.S., and Koonin, E.V., Genome alignment, evolution of
prokaryotic genome organization, and prediction of gene function using genome context, Genome
Res., 11:356–372, 2001.
[28] GenomeNet web site. http://www.genome.ad.jp/ and ftp://ftp.genome.ad.jp/
[29] Joint Genome Institute web site. http://spider.jgi-psf.org/JGI_microbial/html/