Mass Identification of Chloroplast Proteins of Endosymbiont Origin

56
Genome Informatics 16(2): 56–68 (2005)
Mass Identification of Chloroplast Proteins of
Endosymbiont Origin by Phylogenetic Profiling Based
on Organism-Optimized Homologous Protein Groups
1
2
Naoki Sato1
Masayuki Ishikawa1
[email protected]
[email protected]
Makoto Fujiwara1
Kintake Sonoike2
[email protected]
[email protected]
Department of Life Sciences, Graduate School of Arts and Sciences, University of
Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan
Department of Integrated Biosciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba, 277-8562, Japan
Abstract
Chloroplasts originate from ancient cyanobacteria-like endosymbiont. Several tens of chloroplast proteins are encoded by the chloroplast genome, while more than hundreds are encoded by
the nuclear genome in plants and algae, but the exact number and identity of nuclear-encoded
chloroplast proteins are still unknown. We describe here attempts to identify a large number of
unidentified chloroplast proteins of endosymbiont origin (CPRENDOs). Our strategy consists of
whole genome protein clustering by the homolog group method, which is optimized for organism
number, and phylogenetic profiling that extract groups conserved in cyanobacteria and photosynthetic eukaryotes. An initial minimal set of CPRENDOs was predicted without targeting prediction
and experimentally validated.
Keywords: genomic clustering, homolog group, CPRENDO, Gclust, endosymbiosis
1
Introduction
Chloroplast is a photosynthetic organelle within plant and algal cells. It is also present as chromoplast,
amyloplast, elaioplast, and leucoplast, depending on types of cells in flowering plants. A general term
for all these organelles related to chloroplast is ‘plastid’. Plastid is also involved in various metabolism
such as biosynthesis of fatty acids, isoprenoids, tetrapyrrols, amino acids, and some plant hormones.
It is also the sole site of assimilation of nitrogen and sulfur in plant cells. Plants (and algae) acquired
chloroplasts by endosymbiosis, which occurred 1.6 Ga (billion years ago) [22]. The endosymbiont was
closely related to present-day cyanobacteria [4], but it is still not clear which cyanobacterium was the
most related to the chloroplast ancestor. Such endosymbiosis theory is supported by the fact that
the genes encoded in the chloroplast genomes are phylogenetically most related to the orthologs in
cyanobacteria. Indeed, the endosymbiosis was a big event of massive transfer of genes from cyanobacteria to photosynthetic eukaryotes, and is a good target of comparative genomic studies. In algae and
plants, many chloroplast proteins are encoded by the nuclear genome, and many of them are supposed to be transferred from the ancient endosymbiont. Chloroplasts also use proteins of eukaryotic
origin. Therefore, chloroplast proteome is a chimera of proteins originated from both endosymbiont
and eukaryotic host [1, 13, 17]. However, photosynthesis-related proteins and the enzymes involved
in chloroplast biogenesis (transcription and translation) are mostly of endosymbiont origin. Based
on this consideration, we tried to estimate the list of chloroplast proteins that were acquired by the
Mass Identification of Chloroplast Proteins of Endosymbiont Origin
57
endosymbiotic event. This is a good (maybe the best of all similar examples) challenge of comparative
genomics [3, 5].
We present here a generally applicable method of phylogenetic profiling, which focuses on unidentified proteins that are conserved in a certain group of organisms that share a common physiological
property or pathway.
After the initial presentation in GIW three year ago [18], we made efforts in both computational
and experimental works [19, 20]. In the computational efforts, the Gclust software has been revised as
described above and implemented the clique mode. In addition, use of an intermediate file facilitated
rapid analysis with different parameters. In the experimental efforts, a minimal set of CPRENDOs
as estimated using the old version of Gclust was analyzed. In the present communication, we present
results of revised prediction of CPRENDOs based on current version of Gclust as well as results of experimental verification of the minimal set CPRENDOs, and discuss on the effectiveness of phylogenetic
profiling in comparative genomics.
2
2.1
Method
Major Features of the Methodology
The following points are emphasized in the present study:
(1) Use of homolog groups but NOT ortholog groups (based on bidirectional best hit)
A usually used method for phylogenetic clustering relies on ‘ortholog groups’. Two genes (or
proteins) are defined as orthologs if they originate from an identical ancestral gene. However, in computational biology, orthologs are operationally defined by bi-directional best-hit relationship inferred
by BLAST or SSEARCH analysis. In practice, several paralogs or highly related genes are present
in every genome such as those encoding protein families, and it is not always easy and practical to
identify the correct orthologs (as originally defined) without phylogenetic analysis. We have been
using ‘homolog group’ method [7], in which all homologous proteins are included in each cluster. Such
method allows detailed phylogenetic analysis of the homolog group to identify true orthologs.
(2) Use of both E-value and homologous regions of BLASTP output for clustering
Many clustering practices use BLASTP or SSEARCH data for hierarchical clustering using a single
criterion such as E-value. Use of such simple criterion in BLASTP-based homolog group method produces large aggregates of various proteins bridged by multidomain proteins [18, 19]. To avoid this, we
use homologous region information to infer both overlap score and domain structure. Overlap score for
two proteins is defined as ’sum of total overlap region in both proteins devided by total length of both
proteins’, namely, (a1 + a2 + b1 + b1’ + b2)/(length1 + length2) in the example shown in Figure 1.
Constraint for E-value, overlap score, and domain structure are used for clustering to infer really
homologous proteins by excluding functionally different proteins sharing one or several domains.
Figure 1: An example showing calculation of overlap score homologous regions are indicated.
58
Sato et al.
(3) Organism-optimized clustering
If homolog group is constructed solely based on sequence similarity, clusters are not always suitable
for phylogenetic profiling. A cluster may contain many proteins of the same family, or a single protein
family is split into several different clusters according to phylogenetic positions. This problem is
partially solved during the initial cluster formation using the 2D table (see below) and at the last
stage of clustering.
(4) Experimental validation of computational estimation
We believe that any bioinformatics inference should be experimentally validated. In many informatics studies, logical consistency is the sole criterion of evaluation of computational estimation. But
biologically meaningful results are most important in bioinformatics. Inference of chloroplast proteins
of endosymbiont origin (CPRENDOs) may be one of the best applications of phylogenetic profiling
that can be experimentally verified.
2.2
Preparation for Clustering
All proteins in selected genomes were clustered by homolog group method. To this end, one of the
authors (NS) developed a software called ‘Gclust’, which reads all-against-all BLASTP results and
outputs a list of homologous protein groups (homolog groups). The software was written in C, and runs
on any common UNIX machines if enough memory is available. The overall flow of data processing
is shown in Figure 2. A typical source of genomic protein data is a GenBank flat file. The gbk file
was processed to produce a FASTA file and a file of annotation. Such data of various genomes were
assembled to get a single FASTA file and an annotation table. For eukaryotic organisms, nuclear as
well as organellar (mitochondrial and chlorplast, if present) genomes were used. The two files were
processed to give another FASTA file (**.gfa) and an annotation table (**.g.table). In the **.gfa file,
all protein names were converted to numbers to save disk space during the BLASTP search. The
numbers can be converted back to the original protein names by referencing the **.g.table. Next,
all-against-all BLASTP search (versions 2.1.2 - 2.2.12) [2] was done using the FASTA file (**.gfa) as
an input. The output was directly pipelined into bl2ls3.pl to produce a list of homology regions and
E-values, using a threshold for E-value at 1e-3. The resultant file was then used for input into Gclust
software. The BLASTP step is the most time-consuming step, and is done as multiple jobs with split
files on several different servers. All sequence file manipulation such as format conversion and file
splitting was done with the SISEQ software (version 1.30) [16].
2.3
Organism-Optimized Clustering with Gclust Software
Gclust software [18] version 3.5.2 [23] was run in the ’clique mode’. The BLASTP results were
processed in the following two steps (Figure 2): first, the data were read and partially transformed
into intermediate data format and saved in a large file ’data.out’ for further analysis with various
different settings of parameters. Low homology data were removed with keeping data with E-values
for short sequences (from 1e-6 for >100 aa to 1e-3 for <40 aa). All single-path relations were picked up
from the homology data. Domain composition of each protein was also estimated using the homology
regions with different subject proteins. At this stage, multi-domain proteins as well as very large
proteins (>2,000 aa, for example) were marked with a flag.
In the second step, Gclust reads the data.out file, and performs clustering using the -clique option,
which produces a good clustering result in a relatively short time (within one day for a dataset
containing 141 organisms). In the clique mode, the homology data were converted to a structure
called ’match’, which held data of binary (i.e., protein-to-protein) similarity, namely, E-value, overlap
score, and domain composition estimated as above. Normally, clique mode uses a list of organisms
provided by the ’org list’ file. For each protein, all match data were tabulated in 2D, using E-value
and overlap score (Figure 4A). The 2D table lists distribution of match array data using a pre-defined
Mass Identification of Chloroplast Proteins of Endosymbiont Origin
59
category scheme, which can be customized using a configuration file called ’var list’. Match data were
selected one by one starting from the initially selected best local maximum. The search scaned in a
circular or diamond manner around the initial starting point. The scanning to lower overlap score
and higher E-value stopped, if the number of members re-increased. This is a sign of another group
of homologs with a lower similarity. This operation was done on a shadow table (Figure 4B), in which
non-negative values indicate selected area, and the increasing number indicates path of search. By
applying such criteria among others, a clearly defined cluster of match data with respect to E-value
and overlap score was selected (boxed area). In addition, match data were selected to include as many
organisms as possible but without picking up very low similarity data (the output below Figure 4A and
B). After such purification of match data, a list of homologs was made for each protein. The threshold
E-value and overlap score were also stored. Then, homolog clusters were formed by merging individual
lists. At this stage, clusters with very diffent threshold E-values were not merged. After a repeat of
merging and removing, orphan entries generated by removal step were again incorporated into the
most adequate cluster. Clusters were again optimized for number of organisms. Homolog groups were
sorted according to the number of entries. Finally, homolog groups were printed out to a large file as
a catenated similarity matrix (Figure 5A). The matrix may be expressed in 1/0 (similar/dissimilar),
E-value, and/or overlap score, depending on output options, 1, r, and/or s, respectively.
3
3.1
Results
Prediction of CPRENDOs by Phylogenetic Profiling
Using a perl script homologtableG3b.pl, the homology matrix was transformed into a table showing
members of each homolog group (Figure 3 and 5B). This table was used to extract homolog groups that
are shared by various combinations of organisms (phylogenetic profiling). Note that proteins encoded
by both organellar and nuclear genomes were included in the data set of eukaryotic organisms. Therefore, we selected organisms rather than genomes in the phylogenetic profiling. For the prediction of
CPRENDOs, a data set CZ16Y containing all predicted proteins in nine species of cyanobacteria, Arabidopsis thaliana (plant) [21], Cyanidioscyzon merolae (red alga) [14], three species of photosynthetic
bacteria, two species of bacteria, and two eukaryotes was used. All data were taken from the GenBank
data repository, except for those of Cyanidioschyzon, which were obtained from the Cyanidioschyzon
Genome Project [24]. Cyanidioschyzon is a representative of the red lineage of photosynthetic eukaryotes, and we expected that the use of a plant (green lineage) and a red alga increases accuracy
of phylogenetic profiling. The homolog groups that are shared by cyanobacteria, Arabidopsis and
Cyanidioscyzon were selected (Table 1). At this step, various constraints were tested in the selection.
Conservation in cyanobacteria was one constraint, and allowance for presence in other organisms was
another constraint. The first constraint could be complete conservation in all cyanobacteria (nine
species), but many homologs of chloroplast proteins are not completely conserved in all cyanobacteria.
A phylogenetic analysis suggests that plastids are sister to Anabaena-Synechocystis clade (Sato,
unpublished results). Therefore, Anabaena [11] and Synechocystis [12] could be used as representatives
of cyanobacteria. But we fould that all chloroplast proteins are not conserved in both cyanobacteria.
We finally adopted a strategy in which any proteins conserved in a certain number of cyanobacteria
were selected, irrespective of combination of cyanobacteria. The number of cyanobacteria was also a
variable, but five species (out of nine) gave satisfactory results (Figure 6 and Table 1).
Allowance for presence in other organisms was also tested. Table 1 compares effects of allowance
in photosynthetic bacteria and non-photosynthetic organisms. Photosynthetic bacteria perform photosynthesis without oxygen evolution, with a single photosystem using machineries that are distantly
related to those of cyanobacteria and plants. Therefore, the inclusion of photosynthetic bacteria
could affect phylogenetic profiling of CPRENDOs. In addition, paralogs of some photosynthesisrelated proteins (ATP synthase, ribosomal proteins, and even a RuBisCO subunit) are present in
60
Sato et al.
Figure 3: Flow chart of data processing for further
analysis towards phylogenetic analysis.
Figure 2: Flow chart of data processing until formation of homology matrix.
Figure 4: Selection of match data in the clique mode. A. 2D table (rows, overlap score; lines, E-value)
showing distribution of match data. A best local maximum is selected first (circle). Other local
maxima with lower similarity are indicated by dotted circles. B. A shadow table for working. Zero is
the start of search. Non-negative values show selected groups.
Mass Identification of Chloroplast Proteins of Endosymbiont Origin
61
Figure 5: Example output of Gclust (A) and tabular summary of homologs as generated by further
processing (B). In (A), protein name (combination of genome name and gene identifier), number of
amino acid residues, similarity matrix, and annotation in the original database are listed from left to
right. The similarity matrix is a square matrix, having identical set of proteins in both vertical and
horizontal directions. Each protein belongs to a single group. The similarity detected by BLASTP
but not incorporated into the clustering is listed below the main matrix as ‘Related groups’. Each line
in related groups consists of group number (number of members in parenthesis) and protein name. In
(B), all homolog groups are listed with number of members in each genome. The annotation is taken
from the first member.
62
Sato et al.
Figure 6: A Venn diagram showing homolog groups shared by the three organism categories, Arabidopsis thaliana (green plant), Cyanidioschyzon merolae (red alga) and 5 cyanobacteria. Here, ‘5 Cyanos’
indicates >=5 of 9 cyanobacteria analyzed. This result was obtained with the selection method G
shown in Table 1. In this diagram, each area is drawn proportional to the number of groups using a
tcl/tk software called TriGraph (Sato, unpublished).
Table 1: Number of homolog groups selected with different criteria. Number of homolog groups
that are conserved in at least 5 among 9 cyanobacteria, Arabidopsis (Ath), and Cyanidioschyzon
(Cme) are listed with varying additional conservation in photosynthetic bacteria (PhotoBact) and
other organisms (Others). Others include C. elegans, S. cerevisiae, E. coli and B. subtilis. Number of
homolog groups consisting of known chloroplast proteins or unknown proteins is listed. Each number
in parenthesis indicates proportion of groups. Finally, number of members in Ath and Cme belonging
to selected groups is listed.
Selection
PhotoBact
Others
] of Groups
Known cp proteins
Unknowns
] of Ath proteins
] of Cme proteins
A
D
E
F
G
0
0-3
0-3
0-3
0-3
0
0
0-1
0-2
0-3
84
112
150
308
443
37 (0.44)
51 (0.46)
66 (0.44)
148 (0.48)
218 (0.49)
44 (0.52)
55 (0.49)
72 (0.48)
97 (0.31)
127 (0.29)
122
185
293
706
1192
103
142
196
438
676
Mass Identification of Chloroplast Proteins of Endosymbiont Origin
63
non-photosynthetic organisms. These facts may make the profiling complicated. However, the results
in Table 1 show that the allowance for other organisms has little effect on the proportion of clusters containing known chloroplast proteins, which was 0.44 - 0.49 in different selections. In contrast,
the proportion of clusters containing unknown proteins decrease with increasing allowance for other
organisms. These results suggest that allowance for other organisms may be as wide as possible.
Conservation in cyanobacteria, Arabidopsis and Cyanidioschyzon may be, therefore, a reliable criterion for selecting CPRENDOs. A Venn diagram (Figure 6) shows that there are smaller numbers
of homolog groups that are shared by cyanobacteria and Arabidopsis, or cyanobacteria and Cyanidioschyzon. These groups could include proteins conserved in only green or red lineages, and may be
studied as CPRENDO-like proteins. The groups shared by Arabidopsis and Cyanidioschyzon represent
eukaryotic proteins, and are not candidate for CPRENDOs.
3.2
Minimal Set of CPRENDOs
To verify the predicted CPRENDOs experimentally, we planned to analyze plastid localization of the
predicted CPRENDOs (see 3.3.1). The data that we used for the experimental study as described
below was predicted two years ago [18] using an older version of Gclust (verion 2.1.2) and an older
database (dataset CZ16). At that time, homolog groups were constructed simply using various different
E-values, and the homolog groups that were conserved in eight cyanobacteria, a red alga and a green
plant but not in other non-photosynthetic organisms were selected at each cutoff E-value. The selected
groups were combined, and used as a minimal set of CPRENDOs (Table 3). In total, 51 homolog
groups were selected. Among them, 19 were clusters of known chloroplast proteins, such as Psa and
Psb proteins. The remaining 32 groups were selected as targets of initial experimental study. These
homolog groups were generally included in the selection A (Table 1) of the current CZ16Y dataset,
with minor inconsistency.
3.3
Experimental Verification of the Minimal Set of CPRENDOs
We performed experimental verification of the minimal set of CPRENDOs to test our idea that
phylogenetic profiling is useful in predicting CPRENDOs, since we believe that all informatic prediction
should be experimentally verified. The experimental verification consists of the following four analyses:
localization of proteins, light-regulated expression, phenotype of cyanobacterial disruptants and plant
tag-lines.
3.3.1
Localization of Predicted CPRENDOs in A. thaliana
Localization of the predicted CPRENDOs was analyzed by using Green Fluorescent Protein (GFP)fusion constructs. Each construct was prepared by successive PCR and either linear DNA or plasmids
were transiently transformed into onion epidermis by particle bombardment. The localization of GFPfusion protein was analyzed by fluorescence microscopy on the next day. The results (Table 2) showed
that 49 out of 52 proteins were targeted to plastids. Interestingly, five proteins were also targeted to
mitochondria. Such dual targeting is common in plant organellar proteins [10]. It should be noted
that the localization as predicted by TargetP [6] (not shown) was generally in good agreement, six
proteins were not correctly predicted to be targeted to chloroplasts. .
3.3.2
Light-Dependent Expression of the Genes for CPRENDOs in A. thaliana
Expression of the predicted CPRENDOs was analyzed by RNA-blot analysis using 7-day-old seedlings,
and the results are also shown in Table 2. As many as 36 genes encoding predicted CPRENDOs
showed light-dependent expression, which is also expected for proteins involved in photosynthesis or
chloroplast biogenesis. Nine genes were constitutively expressed, while expression of seven genes was
64
Sato et al.
below the detection limit of the method employed. Cross-examination of localization and expression
indicates abundance (31 proteins) of light-regulated chloroplast proteins.
3.3.3
Analysis of Synechocystis Disruptans
The genes for the cyanobacterial homologs of predicted CPRENDOs were disrupted in Synechocystis
sp. PCC 6803. For this purpose, a rapid method of preparation of disruption construct was developed
using repeated PCR. Among the 41 genes, 33 were disrupted completely, while five were not completely
segregated, and might represent essential genes. Three constructs were not successfully made, due to
technical problems in PCR (‘PCR problem’ in Table 3).
Table 2: Summary of localization and expression of predicted minimal set of CPRENDOs. Cp,
chloroplasts (plastids); Mt, mitochondria; Cyto, cytoplasm; nuc, nucleus. L > D, expression in the
light was higher than that in the dark; L = D, expression was comparable in the light and the dark;
No exp, no expression was detected by RNA-blot analysis.
Expression
Localization
L > D L = D No Exp.
Cp
44
31
8
5
3
0
2
Cp & Mt
5
Mt
1
1
0
0
1
1
0
Cyto & nuc 2
Total
52
36
9
7
Fluorescence induction kinetics was measured as an indicator of photosynthetic performance (Table
3). Growth defect was also noted for some disruptants. In 22 disruptants, defects in growth or
fluorescence kinetics was noted. These results suggest that the selected genes are important for the
normal growth in cyanobacteria.
3.3.4
Analysis of A. thaliana Mutant Lines
Mutants of the predicted CPRENDOs were analyzed using the SALK T-DNA tag-lines [25]. The
analysis is still in progress, but we obtained homozygous lines for 25 CPRENDOs. During our experiments in the past two years, reports were published on four of the CPRENDOs, namely, Tab2
(in Chlamydomonas), Psb29/Thf1, APE1 and HY2. These are not components of photosynthetic
machinery except Psb29, but are involved in its biogenesis. This demonstrates the correctness of our
strategy, and many of the remaining CPRENDOs are also likely to be important in the biogenesis
of photosynthetic machinery. However, only two of the Arabidopsis mutant lines showed visible phenotypes, such as variegation. The CPRENDO gene in one of them has been already annotated as
‘ycf65’, a hypothetical chloroplast reading frame, because it is encoded in the chloroplast genome in
some algae such as Cyanidioschyzon. A mutant of ycf65 in Synechocystis also showed growth defect.
Ycf65 protein is likely to be important in both chloroplasts and cyanobacteria.
4
4.1
Discussion
Evaluation of the Prediction Strategy of CPRENDOs
The present study shows that phylogenetic profiling is useful in predicting CPRENDOs. Essential
methodology for predicting CPRENDOs consists of (1) constructing homolog groups from total predicted proteins of both photosynthetic and non-photosynthetic organisms, and (2) selecting groups
that are conserved in photosynthetic organisms under appropriate constraints. A probable estimate of
Mass Identification of Chloroplast Proteins of Endosymbiont Origin
65
Table 3: Summary of results of functional analysis. Phenotypes of Synechocystis disruptants, localization and light regulation of Arabidopsis genes, and the number of homozygous tag-lines are listed. For
Synechocystis mutants, mutant ID is indicated with phenotypes (segregation state, growth properties,
and fluorescence properties, in this order), if present. Localization is shown in abbreviated words (see
Table 2). Blue underline indicates light-regulated expression. Confirmed CPRENDOs are marked by
bold characters. For homozygous tag-lines, visible phenotype is marked by bold red characters. In
annotation, ‘Ycf’ stands for ‘hypothetical chloroplast ORF’.
ID
Function
Synechocystis mutants
reported
Mutant ID: Phenotype
during
Annotation
the work
Arabidopsis
# of localization
Ath, # of
(segregation, growth,
(UL=L>D,
homozygous
fluorescence)
Bold=CPRENDO)
tag-lines
3
5
Hypothetical
10; 11
2 cp, 1 mt
6
Hypothetical
12; (13: PCR problem)
1 cp
7
Ycf52
6: -, slow, -
2 cp
1
8
Hypothetical
14
2 cp
3
15; (16: PCR problem)
1 cp
Not available
9
Yes
Tab2
(Chlamydomonas)
17: incomplete, light
10
Hypothetical
12
Hypothetical
(18: PCR problem)
Membrane
19: -, light sensitive,
1 cp, 1 cp,
protease
very low peak
1 (cp, mt)
13
14
Hypothetical
15
Hypothetical
sensitive, low peak
20: -, slow, high second
peak; 21: -, -, Low peak
1 cp
1 cp, 1 cp,
1 (cp), 1 nuc
1 cp
3
1
22: -, slow, low peak
1 cp
23: -, -, low second peak
1 (cp)
Ycf19
3
2 cp
2
32
Ycf19-like
4: incomplete, -, -
1 cp
Not available
19
Hypothetical
24
1 (cp)
21
Ycf60
22
Hypothetical
26: -, -, low peak
1 (cp, mt)
1
23
Ycf65
27: -, slow, -
2 cp
1, 1
Probable
16
ferredoxin (2Fe-2S)
18
27
33
Yes
34
35
39
1 cp, 1 cp
2
Hypothetical
28
1 (cp)
Psb29/Thf1/APG5
29 (sll1414): no phenotype
2g20890 (cp)
(Not tried)
Rubredoxin
30: -, -, low peak
1 cp
Not available
1 cp
1
Hypothetical
Yes
25: -, pale green
and light sensitive, -
31: incomplete,
slow, high peak
APE1
32(slr0575): -, -, low peak
5g38660 (cp)
(Not tried)
40
Hypothetical
8, -, slow, -
1 cp
1
41
Hypothetical
33: -, -, very low peak
1 cp, 1 (cp)
Not available
43
Hypothetical
34
1 cp
Not available
44
Hypothetical
1 cp
Not available
35: -, -, no decrease
after peak
66
Sato et al.
Table 4: Continuation of Table 3.
Function
Synechocystis mutants
reported
Mutant ID: Phenotype
Arabidopsis
# of localization
Ath, # of
(segregation, growth,
(UL=L>D,
homozygous
fluorescence)
Bold=CPRENDO)
tag-lines
HY2 (phycobilin
36(slr0116): incomplete,
3g09150
synthesis)
-, high peak
(1 cp)
47
Ycf20
37: -, slow, -
3 (cp, mt)
49
Hypothetical
38: -, slow, -
1 cp
51
Hypothetical
38
3 cp
54
Hypothetical
40
1 cp
ID
during
Annotation
the work
46
55
59
62
Yes
Hypothetical
ATP-dependent
proteinase
Hypothetical
41: -, slow, no decrease
after peak
45
47: incomplete, light
sensitive and slow, -
(Not tried)
2
3
1 (cp)
2 cp
1 (cp)
the number of CPRENDOs is 1192 in Arabidopsis and 676 in Cyanidioscyzon. A previous study [13]
estimated the upper limit of chloroplast proteins of endosymbiont origin as about 4,500 in Arabidopsis,
and another study [1] suggested about 650-900 plant proteins originated from cyanobacterial endosymbiont. A more recent estimate was about 880 [15]. These estimates were done by calculation, but
not by complete enumeration. These studies also showed that a significant proportion of proteins
of cyanobacterial origin might be located in non-chloroplast compartment, which is not the case in
our result. This could be partly due to the limitation of targeting prediction [15], but also to the
inaccuracy in the prediction. In contrast, the results of present study on the minimal set of predicted
CPRENDOs clearly indicate that almost all of them are chloroplast proteins, although no targeting prediction was used in the prediction process. A reasonable explanation of the discrepancy may
be that we used ‘conservation in 5 cyanobacteria, Arabidopsis, and Cyanidioschyzon’ as a criterion,
while previous studies used conservation in only Arabidopsis and Synechocystis, or a similar simple
criterion, which overestimates number of proteins conserved in plants and cyanobacteria. In addition,
these previous studies used simple ‘one plant vs one cyanobacterium’ relationship using a single cutoff
E-value for all proteins. Our approach using phylogenetic profiling based on homolog groups gives
robust clusters, which could yield a more solid prediction.
4.2
General Usefulness of Phylogenetic Profiling
General success of our approach of comparative genomics prompted us to extend phylogenetic profiling
to prediction of various other proteins that are conserved in a certain group of organisms. Prediction of
pathogenicity-related proteins was done in various bacterial groups including strains with or without
pathogenicity [8, 9]. Such analysis might not need sophisticated strategy of genomic comparison. But
identification of proteins, which are conserved in a wide range of organisms that are not closely related
phylogenetically, requires a solid clustering and phylogenetic profiling. The phylogenetic profiling with
Gclust database will be a powerful tool for identifying plant-specific proteins and proteins specific to
flowering plants, if more plant genomic sequences are available.
Mass Identification of Chloroplast Proteins of Endosymbiont Origin
67
References
[1] Abdallah, F., Salamini, F., and Leister, D., A prediction of the size and evolutionary origin of
the proteome of chloroplasts of Arabidopsis, Trends Plant Sci., 5:141–142, 2000.
[2] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J.,
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic
Acids Res., 25:3389–3402, 1997.
[3] Bansal, A.K. and Meyer, T.E., Evolutionary analysis by whole-genome comparisons, J. Bacteriol.,
184:2261–2272, 2002.
[4] Cavalier-Smith, T., Genomic reduction and evolution of novel genetic membranes and proteintargeting machinery in eukaryote-eukaryote chimaeras (meta-algae). Phil. Trans. R. Soc. Lond.,
358B:109–134, 2003.
[5] Eisen, J.A., Assessing evolutionary relationships among microbes from whole-genome analysis,
Curr. Opinion Microbiol., 3:475–480, 2000.
[6] Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G., Predicting subcellular localization
of proteins based on their N-terminal amino acid sequence, J. Mol. Biol., 300:1005–1016, 2000.
[7] House, C.H. and Fitz-Gibbon, S.T., Using homolog groups to create a whole-genomic tree of
free-living organisms: An update, J. Mol. Evol., 54:539–547, 2002.
[8] Janssen, P.J., Audit, B., and Ouzounis, C.A., Strain-specific genes of Helicobacter pylori: Distribution, function and dynamics, Nucleic Acids Res., 29:4395–4404, 2001.
[9] Jin, Q., et al., Genome sequence of Shigella flexneri 2a: Insights into pathogenicity through
comparison with genomes of Escherichia coli K12 and O157, Nucleic Acids Res., 30:4432–4441,
2002.
[10] Kabeya, Y. and Sato, N., Unique translation initiation at the second AUG codon determines
mitochondrial localization of the phage-type RNA polymerases in the moss Physcomitrella patens.
Plant Physiol., 138:369–382, 2005.
[11] Kaneko, T., et al., Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment
of potential protein-coding regions, DNA Res., 3:109–136, 1996.
[12] Kaneko, T., et al., Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium
Anabaena sp. strain PCC 7120, DNA Res., 8:205–213, 2001.
[13] Martin, W., Rujan, T., Richly, E., Hansen, A., Cornelsen, S., Lins, T., Leister, D., Stoebe, B.,
Hasegawa, M., and Penny, D., Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus.
Proc. Nat. Acad. Sci. USA, 99:12246–12251, 2002.
[14] Matsuzaki, M., et al., Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon
merolae 10D, Nature, 428:653–657, 2004.
[15] Richly, E. and Leister, D., An improved prediction of chloroplast proteins reveals diversities and
commonalities in the chloroplast proteomes of Arabidopsis and rice, Gene, 329:11–16, 2004.
[16] Sato, N., SISEQ: Manipulation of multiple sequence and large database files for common platforms, Bioinformatics, 16:180–181, 2000.
68
Sato et al.
[17] Sato, N., Was the evolution of plastid genetic machinery discontinuous?, Trends Plant Sci., 6:151–
155, 2001.
[18] Sato, N., Comparative analysis of the genomes of cyanobacteria and plants, Genome Inform.,
13:173–182, 2002.
[19] Sato, N., Gclust: Genome-wide clustering of protein sequences for identification of photosynthesisrelated genes resulting from massive horizontal gene transfer, Genome Inform., 14:585–586, 2003.
[20] Sato, N. and Ishikawa, M., Identification of novel chloroplast proteins of endosymbiotic origin by
phylogenetic profiling using homolog groups, Abstract Book of GIW2004, P139, 2004.
[21] The Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana, Nature, 408:796–815, 2000.
[22] Wang, D. Y. C., Kumar, S., and Hedges, S. B., Divergence time estimates for the early history of
animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci., 266B:163–171, 1999.
[23] http://nsato4.c.u-tokyo.ac.jp/old/Gclust/Gclust.html/
[24] http://merolae.biol.s.u-tokyo.ac.jp/
[25] http://signal.salk.edu/tabout.html