High Coding Density on the Largest Paramecium tetraurelia

Current Biology, Vol. 14, 1397–1404, August 10, 2004, 2004 Elsevier Ltd. All rights reserved.
DOI 10.1016/j .c ub . 20 04 . 07 .0 2 9
High Coding Density on the Largest
Paramecium tetraurelia Somatic Chromosome
Marek Zagulski,1 Jacek K. Nowak,1
Anne Le Mouël,2 Mariusz Nowacki,1,2
Andrzej Migdalski,1 Robert Gromadka,1
Benjamin Noël,3 Isabelle Blanc,3 Philippe Dessen,4
Patrick Wincker,5 Anne-Marie Keller,3 Jean Cohen,3
Eric Meyer,2,* and Linda Sperling3,*
1
Institute of Biochemistry and Biophysics
DNA Sequencing Laboratory
Polish Academy of Sciences
Pawinskiego 5a
02-106 Warsaw
Poland
2
Laboratoire de Génétique Moléculaire
ENS
46 rue d’Ulm
75005 Paris
France
3
Centre de Génétique Moléculaire
CNRS
91198 Gif-sur-Yvette cedex
France
4
Institut Gustave-Roussy
39 rue Camille Desmoulins
94805 Villejuif cedex
France
5
Genoscope—Centre National de Séquençage
2 rue Gaston Crémieux CP5706
91057 Evry cedex
France
Summary
Paramecium, like other ciliates, remodels its entire
germline genome at each sexual generation to produce a somatic genome stripped of transposons and
other multicopy elements [1]. The germline chromosomes are fragmented by a DNA elimination process
that targets heterochromatin to give a reproducible
set of some 200 linear molecules 50 kb to 1 Mb in size
[2]. These chromosomes are maintained at a ploidy of
800n in the somatic macronucleus and assure all gene
expression. We isolated and sequenced the largest
megabase somatic chromosome in order to explore
its organization and gene content. The AT-rich (72%)
chromosome is compact, with very small introns (average size 25 nt), short intergenic regions (median size
202 nt), and a coding density of at least 74%, higher
than that reported for budding yeast (70%) or any other
free-living eukaryote. Similarity to known proteins
could be detected for 57% of the 460 potential protein
coding genes. Thirty-two of the proteins are shared
with vertebrates but absent from yeast, consistent
with the morphogenetic complexity [3] of Paramecium, a long-standing model for differentiated functions shared with metazoans but often absent from
*Correspondence: [email protected] (E.M.); [email protected] (L.S.)
simpler eukaryotes [4]. Extrapolation to the whole genome suggests that Paramecium has at least 30,000
genes.
Results and Discussion
The 50–60 germline chromosomes of the ⵑ100 Mb Paramecium genome are reproducibly rearranged during
macronuclear development, with elimination of 10%–
15% of the DNA and amplification to high copy number.
Numerous (ⵑ60,000) short unique copy elements (internal eliminated sequences or IES) are precisely excised,
which is necessary for gene expression since most
genes are interrupted by IESs [5]. Transposons and
other repeated sequences are removed by an imprecise
mechanism leading to fragmentation of the germline
chromosomes into smaller macronuclear chromosomes, which are healed by de novo telomere addition
[2]. Since several rounds of endoreplication precede
DNA elimination, considerable heterogeneity exists at
chromosome ends, even within a single macronucleus.
The finished megabase chromosome sequence of
984,602 nt is a single contiguous sequence close to the
expected 1 Mb size, and conceptual and experimental
restriction digestions with the rare cutter BsiWI are in
good agreement (Supplemental Figure S1 online). Telomeric repeats at somatic chromosome ends consist of
approximately 30 copies of A3C3 and A2C4 hexanucleotides branched randomly over ⵑ1 kb regions [6]. We
mapped reads with telomeric repeats from whole genome shotgun (WGS) primary sequence data to the
chromosome. (The megabase shotgun library, made by
restriction digestion and tailing, does not contain telomeres. Primary sequence data from the Paramecium
WGS project [http://www.genoscope.cns.fr/] is available in the trace archive [http://www.ncbi.nlm.nih.gov/
Traces/].) Many reads mapped to the left end of the
sequence, but few reads mapped to the right end (data
not shown). However, we found tandem repeats of the
P126 element, a 126 nt motif related to G protein WD40 repeats, at both ends of the chromosome. The P126
element was originally identified by Forney and Rodkey
[7] in the immediate subtelomeric region of several macronuclear chromosomes, so this is a good indication
that the sequence does extend to the telomeric region
at both ends and represents an entire macronuclear
chromosome.
Strikingly, the ratio of observed/expected CpG dinucleotides for the sequence is 0.40, while this ratio is
close to 1.0 for the other dinucleotides. DNA methylation
on cytosine, which has not been reported in Paramecium, might explain this severe CpG depression. It could
occur in the germline micronucleus or at a specific developmental stage, as discussed in the Supplemental
Data.
Table 1 compares other characteristics of the chromosome with those of Dictyostelium chromosome 2 [8] and
of the Plasmodium [9] and yeast [10] genomes. The
Current Biology
1398
Table 1. Megabase Chromosome Characteristics and Comparison to Other Organisms
Size (bp)
G⫹C content overall
In predicted coding regions
In predicted intergenic regions
No. of genes
Genes with introns (%)
Number of introns
Number of exons
Average number of exons per CDS
Average intron length (bp)
Average exon length (bp)
Average gene length (bp)
Percent coding
Gene density (kb/gene)
Number of tRNA genes
P. tetraurelia
D. discoideum
P. falciparum
S. cerevisiae
984,602
27.5
29.2
15
460
84
1,061
1,521
3.3
24.8
482
1,651
74.4
2.14
1b
7,520,000
22.2
28
14
2,799
68
3,587
6,398
2.3
177
711
1,626
60.5
2.6
73
22,853,764
19.4
23.7
13.6
5,268
54
7,406
12,674
2.4
178.7
979
2,283
52.6
4.3
43
12,495,682
38.3
NA
NA
5,770
5
272
NA
ⵑ1
287a
NA
1,424
70.5
2.09
275
The characteristics of the megabase DNA sequence were calculated by Artemis. The percent coding is exclusive of introns (with introns,
77.1%). Only three previously characterized genes were detected by a BLASTN homology search of the sequence against the public nucleotide
database. The only predicted tRNA gene, with a marginally significant Cove score, is uncertain. The sequence contains no rRNA genes, as
expected, since they are tandemly arranged on dedicated rDNA chromosomes. Two inverted DNA repeats, of unknown origin and significance,
are separated by ⵑ1.3 and ⵑ22 kb, respectively, for the smaller and the larger repeat; the smaller repeat is GC rich. Smaller repeat:
132845..133153 and 136154..136462, size 308 bp; larger repeat 267647..268472 and 289000..289826, size 826 bp. Data for comparison is for
D. discoideum chromosome 2 [8] and for the P. falciparum and S. cerevisiae genomes as presented in [9] and [10]. NA, not available.
a
Calculated from the 256 spliceosomal introns in Yeast Intron DataBase [40].
b
Cys, anticodon GCA, Cove score 56.41 (516706..516635).
72.5% AT content, although lower than that of Plasmodium or of Dictyostelium, is one of the highest so far
reported for a eukaryotic genome. Gene models were
annotated manually and using GlimmerM (Figure 1 and
Supplemental Table at http://paramecium.cgm.cnrsgif.fr/megabase/), resulting in a predicted coding density of 74% (77% if introns are not excluded from the
calculation). The only eukaryotic genome reported to
have higher coding density is that of the parasite Encephalitozoon cuniculi [11]. It is important to note that
few Paramecium genes have been studied to date and
no complete, annotated ciliate genome is available. The
closest complete genome, that of Plasmodium, is separated by at least 1 billion years, the time at which the
alveolates, comprising ciliates and apicomplexan parasites, branched from other eukaryotes [12]. Plasmodium, moreover, has lost genes owing to its parasitic
lifestyle. The gene models presented here must therefore be considered preliminary until other data become
available to improve gene finder training and to validate
the predicted CDS, especially those with limited sequence similarity to known proteins.
The introns are smaller in size and more frequent than
in Dictyostelium, Plasmodium, or yeast. Indeed, Paramecium introns are among the smallest known, ranging
in size from 20 to 35 nt. They contain the canonical GT
. . . AG splice site junctions of spliceosomal introns.
Most of the genes on the megabase chromosome (84%)
are interrupted by one or more introns, and their size
distribution is the same as for the 462 introns annotated
during a pilot project of random single-run sequencing
of the Paramecium macronuclear genome [13].
Figure 2 shows the size distribution of the intergenic
regions; half are smaller than 200 nt, and the intervals
are even smaller between convergent genes. It has been
shown for one locus on a different chromosome that
two convergent genes use the same 26 bp between
their respective stop codons, on opposite strands, as
3⬘ UTR (D. Kobric and R.E. Pearlman, personal communication). The few large intergenic regions have a size
distribution similar to that of the annotated megabase
genes, suggesting that these regions may contain additional CDSs or pseudogenes.
Detection of the only currently annotated pseudogene, PTMB.411c, involved comparison with a series
of paralogs found elsewhere in the genome. Alignment
revealed that this CDS is indeed an RNA N6-adenine
methylase pseudogene, presenting several deletions,
mutated splice site junctions, and in-frame stop codons.
The paucity of identified pseudogenes may reflect their
tendency to be associated with heterochromatin, as in
many species [14], and thus to be eliminated during
development of the macronucleus: the first pseudogene
that was identified in Paramecium is near a telomere
and is severely underamplified with respect to the bulk
of the macronuclear DNA [15]. It is also possible that
the Paramecium genome rapidly loses sequences not
required for function, so that pseudogenes do not remain recognizable for long.
As shown in Table 2, homologs could be detected for
260 (57%) of the predicted proteins. The evidence was
provided by examination of matches against the swissprot⫹sptrembl protein database, NCBI’s Conserved Domain Database, and the InterPro protein domain database, as detailed in Experimental Procedures. If we
remove the 30 gene models that we consider uncertain
for lack of either sequence similarity with known proteins, structural motifs, or paralogs elsewhere in the
Paramecium genome (white boxes in Figure 1), 40%
of the predicted proteins remain orphans, presumably
representing phylum- or species-specific proteins.
Half of the predicted proteins have at least one InterPro protein signature. The ten most frequent InterPro
signatures are given in Table 2 along with the percentage
Figure 1. Map of the Megabase Chromosome
The map shows the position and the strand for each predicted CDS of the megabase chromosome. Each line represents 100 kb, with tick
marks every 10 kb. Asterisks associated with the CDS number indicate a predicted signal peptide. Gene ontology molecular function terms
were mapped using the GOA project gene associations for significant BLASTP matches and the interpro2go mappings for InterPro domains,
as detailed in Experimental Procedures. The term assignments are distributed as follows: binding, 11; catalytic activity, 97; chaperone activity,
3; enzyme regulator activity, 1; molecular function unknown, 92; motor activity, 1; nucleic acid binding, 18; signal transducer activity, 3;
structural molecule activity, 15; transcription regulator activity, 2; transporter activity, 16. If several terms were found for a given gene, the
most specific one was adopted (e.g., “nucleic acid binding” rather than “binding” or “catalytic activity” for an RNA helicase).
Current Biology
1400
Figure 2. Size of Intergenic Regions
The figure shows a histogram of the size of megabase chromosome
intergenic regions (transparent bars with black borders), superimposed on a histogram of the size of the CDSs (gray bars with no
borders), between 0 and 3000 nt, with 100 nt bins. There are CDSs
larger than 3000 nt, but there are essentially no intergenic regions
larger than 3000 nt. The intergenic regions were further classed
according to the orientation of the flanking CDSs. The median distance between tandem genes is 198.5 nt; between convergent
genes, 144 nt; and between divergent genes, 309 nt. The median
CDS size is 1210 nt.
of proteins in other organisms that contain each of the
domains. As the genes on this chromosome represent
only ⵑ1.5% of the organism’s gene complement, extrapolation to the whole genome requires caution. Nonetheless, the MORN motif appears to be strikingly more
frequent in Paramecium than in the other organisms,
while K⫹ channel pore region domains, protein kinase
domains, and metallo-phosphoesterase domains are
overrepresented, especially compared to their frequency in unicellular eukaryotes. As previously discussed in the context of the random survey project [13],
Paramecium seems to devote an important part of its
coding capacity to proteins involved in signaling pathways, which may have evolved as part of the organism’s
repertoire of responses to environmental challenge. The
presence of four K⫹ channel genes on the chromosome
is compatible with the recent prediction that Paramecium contains at least 200 K⫹ channel genes—significantly
more than man [16]. The authors suggest that different
sets of K⫹ channels, which are located in ciliary membranes and involved in ciliary motility, might be expressed under different environmental conditions.
Given the size of the intergenic regions on the megabase chromosome (Figure 2), it is likely that many genes
have extremely small promoters, so we looked for clusters of functionally related genes that might be cotranscribed using the Gene Ontology (GO) term assignments
and the InterPro signatures. We did not find any evidence for clusters of genes involved in the same biological process or pathway. However, we did identify six
examples of clusters of two or three paralogous genes
(Supplemental Table S2). These paralogous genes undoubtedly originate from quite ancient duplications,
given the low percentage of amino acid identity between
the proteins (25%–56%). In the case of the cluster of
three actin genes (PTMB.200, 201c, 202), each gene has
closer homologs not only in the Paramecium genome
(which contains at least 20 actin or actin-like genes
according to preliminary analysis of the primary sequence data [not shown]), but also in other species.
Table 2. Characterization of Putative CDSs
Feature
Number
Percent
Predicted CDS
Homology to known proteins or domains
Uncertain proteins
Hypothetical proteins
Pseudogenes
CDS with a signal peptide
CDS with multiple transmembrane helices
CDS with an InterPro match
460
260
30
170
1
31
54
227
100
56.5
6.5
37
0.2
6.7
12
49
Most Frequent InterPro Domains
IPR000719
IPR001841
IPR001680
IPR002048
IPR000379
IPR003593
IPR003409
IPR001806
IPR004842
IPR001622
Protein kinase
Zn-finger, RING
G-protein beta WD-40 repeat
Calcium-binding EF-hand
Esterase/lipase/thioesterase
AAA ATPase
MORN motif
RAS GTPase superfamily
Metallo-phosphoesterase
K⫹ channel, pore region
32
7
7
6
6
5
5
4
4
4
Pt
Dd
Sc
At
Ce
Dm
Hs
6.9
1.5
1.5
1.3
1.3
1.1
1.1
0.87
0.87
0.87
1.9
0.8
1.1
0.9
NA
1.1
NA
0.86
NA
NA
1.9
0.6
1.6
0.2
0.6
1.3
0
0.6
0.3
0
4.0
1.9
0.9
0.6
0.8
1.2
0.06
0.4
0.3
0.1
2.5
0.8
0.7
0.5
0.6
1.6
0.005
0.4
0.4
0.5
1.9
1.0
1.2
0.7
0.9
0.9
0.006
0.6
0.3
0.3
2.3
1.4
1.3
1.0
0.4
0.5
0.05
0.7
0.1
0.4
The upper part of the table describes features of the 460 annotated CDSs. The CDSs with homology to known proteins were identified by
evaluation of sequence similarity to known proteins or domains, as described in Experimental Procedures. The lower part of the table presents
the occurrence of the 10 most frequent InterPro domains of the Paramecium megabase chromosome, compared to the 30 most frequent
domains of Dictyostelium chromosome 2 as given in [8] and the frequency of domains in fully sequenced eukaryotes (InterPro database). The
number of megabase genes with each domain is given in the first column, followed by the percentage of genes from each organism with the
given domain. Pt, P. tetraurelia; Dd, D. discoideum; Sc, S. cerevisiae; At, A. thaliana; Ce, C. elegans; Dm, D. melanogaster; Hs, H. sapiens.
NA, not among the 30 most frequent Dictyostelium chromosome 2 domains.
Paramecium Somatic Chromosome Organization
1401
We compared the set of predicted proteins with proteomes from complete eukaryotic genomes to screen for
interesting patterns of incidence among species. We did
not find evidence for proteins uniquely shared with Plasmodium. However, 32 proteins are shared by vertebrates
(represented by man) but not yeasts (represented by
S. cerevisiae and S. pombe). Among the latter, most are
also shared with invertebrates (represented by C. elegans
and D. melanogaster) and/or plants (represented by
A. thaliana) and a few with filamentous fungi (represented by N. crassa). After validation by homology
searches against the entire protein database, we focused
our attention on the 24 proteins presented in Table 3.
Two groups seem particularly noteworthy. First, a few
putative proteins are shared with man but not fungi or
invertebrates, including PTMB.423c, which has a very
significant match with a hypothetical human protein, and
PTMB.114, which matches an interferon-induced guanylate binding protein found only in vertebrates and
plants. A second group consists of five proteins shared
with plants but not animals. Since Plasmodium contains
a relic plastid [9], it was of interest to see whether any
of these genes might come from plastids, in support for
the theory of a single symbiogenetic origin of chloroplasts and subsequent lateral transfer of plastids from
a red alga to the corticoflagellate ancestor of chromalveolates [17], before the divergence of ciliates and apicomplexa. Four of the proteins in this group (PTMB.151c,
PTMB.356, PTMB.361c, and PTMB.376c) do have homologs in bacteria, including cyanobacteria, and one of
them, PTMB.356, is a probable lysophospholipase also
found among Plasmodium apicoplast proteins. However, in all cases, the best matches are with plants and
apicomplexa, not with cyanobacteria, so we find no phylogenetic support for a plastid origin of any of these
genes.
Other Paramecium proteins, implicated in axonemal
structures, are conserved in plants, animals, ciliates,
and flagellates but absent from fungi (the radial spoke
head protein PTMB.212c, absent from A. thaliana but
present in algae such as Chlamydomonas). Many additional proteins are absent from the reference yeasts,
including some proteins that have been studied in ciliates. Copines (PTMB.394c) are Ca2⫹-dependent phospholipid binding proteins first discovered in Paramecium [18]; Myb-related transcription factors (PTMB.149)
have been described in spirotrich ciliates [19]; and UNC119 orthologs (PTMB.296), found up until now only in
animals, have been characterized in Paramecium
(D. Gogendeau and F. Koll, personal communication).
Another protein shared with animals is an inositol-1,4,5
triphosphate receptor (PTMB.445c), a ligand-gated calcium ion channel that modulates Ca2⫹ release from intracellular stores. Although the Paramecium protein shares
only 25% amino acid identity with its vertebrate homologs, the protein contains the ryanodin/IP3 homology
domain, a Ca2⫹/Na⫹ pore domain, and the characteristic
six transmembrane helices located at the C terminus of
the molecule. It will be interesting to see whether this
protein is involved in morphogenesis in Paramecium,
since it has been shown that cortical pattern is respecified by lithium ions [20] in a manner reminiscent of the
teratogenic effects of lithium during amphibian em-
bryogenesis. All of the genes on the chromosome and
their annotations are available (gene table at http://
paramecium.cgm.cnrs-gif.fr/megabase; Generic Genome
Browser at http://paramecium.cgm.cnrs-gif.fr/cgi-bin/
gbrowse).
Examination of the megabase chromosome illustrates
the rich diversity in functions and origins of Paramecium
protein-coding genes. Given the ⵑ75 Mb size of the
macronuclear genome, we can extrapolate to a surprisingly large gene complement for a unicellular organism
(⬎30,000 genes). Since powerful direct and reverse genetic techniques are available, Paramecium could be a
system of choice for functional analyses of many genes
shared with vertebrates and/or apicomplexan parasites
but not always present in model eukaryotes such as
yeast, worm, or fly.
In conclusion, our analysis of the largest DNA molecule in the Paramecium macronucleus provides a first
glimpse of a chromosome “stripped for action” by DNA
rearrangements that remove all the germline heterochromatic sequences. As complete ciliate genomes become available, it should be possible to see whether
the unusually high coding content of this molecule is a
characteristic result of the differentiation of a somatic
nucleus or whether other forces are driving the Paramecium genome toward prokaryote-like compactness.
Experimental Procedures
Paramecium Strain and Culture
Paramecium tetraurelia strain d4-2 is an entirely homozygous wildtype laboratory stock [21]. Cells were cultured using standard methods [22].
Chromosome Isolation and Shotgun Library Construction
The megabase chromosome was isolated by clamped homogeneous electric field (CHEF) gel electrophoresis as previously described [23]. Young cells (3 divisions post autogamy) were used for
a first separation at 70 s pulse frequency. In order to eliminate small
DNA molecules, the region between 600 kb and the limit mobility
zone was cut out and subjected to electrophoresis on a second gel
using 113 s pulses, to allow optimal separation in the megabase
region. The largest chromosome was cut out of the gel and the DNA
was purified using agarase (Sigma).
Library construction, optimized for small amounts of DNA, involved partial restriction digestion with a very frequent cutter,
Tsp509I. Size-selected partial digestion products (1–3 kb) were
cloned by a tailing procedure in pCRScript (Stratagene).
Sequence Determination
Clones from the shotgun library were amplified and the PCR products were sequenced using dye-terminator chemistry (DYEnamic
ET Terminator Cycle Sequencing Kit Amersham US81050) with MegSeqR or MegSeqU primers. Bases were called using Phred [24] and
assembled using GAP4 [25]. Approximately 6200 plasmids from the
shotgun DNA library were sequenced.
Final shotgun assembly (4.5 ⫻ coverage) did not cover the whole
chromosome, and 40 gaps were found. Gaps were filled either by
primer walking or by multiplex PCR [26]. The entire sequence of the
980 kb region was determined with a statistical error rate 1/700,000.
The entire chromosome was covered by 60 overlapping PCR
products (15–25 kb). Restriction digestion patterns with several different enzymes were fully compatible with the sequence.
Annotation
Artemis [27] was used for annotation, and the entire analysis relied
on custom Perl scripts written using the Bioperl library [28]. At the
outset of the project, too few Paramecium genes were available for
training of an ab initio gene finder. We therefore manually annotated
Current Biology
1402
Table 3. Species Distribution of Conserved Proteins
Putative
Protein
Length
(aa)
Group
Best Match
Accession Number
E Value
Identity
(%)
Overlap
3e-52
28
0.70
2e-30
25
1.00
3e-98
28
0.96
hypothetical protein
T4B21.12 (putative calcium-dependent protein kinase)
hypothetical protein
hypothetical protein
T10F20.3 protein (F2H15.21 protein)
5e-39
2e-27
35
26
0.69
0.71
6e-23
6e-23
9e-63
25
34
37
0.72
0.45
0.95
hypothetical protein
similar to CG17349 gene product
hypothetical protein KIAA0590
hypothetical protein F09G8.2 in
chr. III precursor
CG1637 protein
Unc-119 protein homolog (retinal
protein 4)
similar to ATP/GTP-binding protein
inositol 1,4,5-triphosphate receptor
type 2
0.0
1e-54
1e-167
2e-29
37
50
27
28
1.00
0.62
1.00
1.00
3e-36
3e-26
27
33
0.79
1.00
3e-57
5e-34
35
25
0.57
0.40
1e-125
56
0.96
6e-32
8e-28
1e-108
2e-53
45
32
40
50
0.28
0.93
0.92
0.76
2e-35
1e-19
35
29
0.56
0.92
2e-81
35
1.00
Definition
Proteins Shared with Vertebrates (and Plants), Absent from Yeasts, Fly, and Worm
PTMB.114
835
VP
H. sapiens
GBP1_HUMAN
PTMB.212c
452
V
H. sapiens
Q9NQ10
PTMB.423c
970
V
H. sapiens
Q8TBY9
interferon-induced guanylate-binding
protein 1
DJ412I7.1 (similar to radial spokehead
protein)
hypothetical protein
Proteins Shared with Plants but Not with Animals
PTMB.151c
PTMB.164c
341
494
P
PNF
A. thaliana
A. thaliana
Q9C6B3
Q9ZSA3
PTMB.356
PTMB.361c
PTMB.376c
415
420
391
PF
PF
PNY
A. thaliana
P. falciparum
A. thaliana
O23287
PF10_0306
Q9LDU9
Proteins Shared Uniquely with Animals
PTMB.56
PTMB.153
PTMB.199
PTMB.227
1196
315
1360
339
VI
VI
VI
VI
H.
H.
H.
C.
sapiens
sapiens
sapiens
elegans
Q8NE11
Q8NHP5
O60332
YLS2_CAEEL
PTMB.378
PTMB.296
559
174
I
VI
D. melanogaster
H. sapiens
Q9VZ56
U119_HUMAN
PTMB.414
PTMB.445c
708
2910
VIF
VI
H. sapiens
H. sapiens
Q8NEM8
IP3S_HUMAN
Other Proteins Shared with Plants and Animals but Absent from Yeasts
PTMB.142c
390
VIPN
H. sapiens
HPPD_HUMAN
PTMB.149
PTMB.175c
PTMB.180c
PTMB.256c
490
247
549
263
VIPN
VIP
VIPF
VIPN
H. sapiens
H. sapiens
A. thaliana
H. sapiens
MYBB_HUMAN
ATTY_HUMAN
Q9SLI8
NUHM_HUMAN
PTMB.357c
PTMB.360c
412
218
VIPF
VIP
H. sapiens
H. sapiens
Q9Y377
GILT_HUMAN
PTMB.394c
534
VIP
H. sapiens
CNE5_HUMAN
4-hydroxyphenylpyruvate dioxygenase (EC 1.13.11.27) (4HPPD) (HPD)
(HPPDase)
Myb-related protein B (B-Myb)
tyrosine aminotransferase (EC 2.6.1.5)
F20D21.27 protein
NADH-ubiquinone oxidoreductase 24
kDa subunit, mitochondrial precursor (EC 1.6.5.3) (EC 1.6.99.3)
CGI-67 protein
gamma interferon inducible lysosomal thiol reductase precursor
copine V
Putative proteins of the chromosome shared with vertebrates (V, represented by H. sapiens), plants (P, represented by A. thaliana), and/or
invertebrates (I, represented by D. melagnogaster and C. elegans) but (with the exception of PTMB.376c) absent from yeasts (Y, represented
by S. cerevisiae and S. pombe). N indicates that the protein is shared with N. crassa, and F indicates that the protein is also shared by P.
falciparum. The columns, from left to right, give the name of the Paramecium CDS whose product is under consideration, the length of the
putative protein, the distribution among the different groups of organisms, and the characteristics of the best match: species, accession
number, definition, E-value , amino acid identity, and overlap (the fraction of the query that is covered by the subject match). We note that
PTMB.256c is a subunit of respiratory complex I, which is present in mitochondria of many species including human, Paramecium, and some
yeasts such as S. pombe and Y. lipolytica, but absent from S. cerevisiae. The complete list of genes with vertebrate homologs but absent
from yeast is: PTMB.56 WD-40 repeat protein, PTMB.62 hypothetical protein with arrestin domain, PTMB.68 K⫹ channel, PTMB.113 K⫹
channel, PTMB.114 Guanylate binding protein, PTMB.142c 4-hydroxyphenyl pyruvate dioxygenase, PTMB.149 Myb-related protein, PTMB.153
Conserved hypothetical protein, PTMB.171c Guanylyl cyclase, PTMB.175c Tyrosine aminotransferase, PTMB.180c Phosphatase regulatory
subunit, PTMB.189c Phosphatidyl inositol-4-phosphate-5 kinase, PTMB.199 Conserved WD-40 and TPR repeat-containing protein, PTMB.206c
Phosphatidyl inositol-4-phosphate-5 kinase, PTMB.212c Radial spoke protein, PTMB.214c Conserved hypothetical protein, PTMB.227 Dnase
II-like, PTMB.231c Prenylcysteine lyase, PTMB.252c hypothetical protein with LITAF membrane-association domain, PTMB.256c NADHubiquinone oxidoreductase 24 kDa subunit, PTMB.296 UNC-119, PTMB.353 MORN repeat protein, PTMB.357c Conserved protein, alpha/beta
hydrolase, PTMB.360c Gamma-interferon inducible lysosomal thiol reductase-like protein, PTMB. 361c, MORN repeat protein, PTMB.387c
MORN repeat protein, PTMB.394c Copine, PTMB.403c K⫹ channel, PTMB.414 Zn-carboxypeptidase, PTMB.422c Guanylate nucleotide binding
protein, PTMB.423c WD-40 protein, and PTMB.445c Inositol 1,4,5-phosphate receptor.
genes using multiple lines of evidence: GC content, codon bias,
and sequence similarity with known proteins and intron predictions
(profile hmms were built using the HMMER-2.2 package [29], using
introns from a pilot random sequencing project [13] and from Paramecium genes in the Invertebrate division of the public nucleotide
database). Once enough genes had been annotated manually, GlimmerM was trained and used to complete the structural annotation (see below). tRNA predictions were made using tRNAscanSE-1.23 [30].
The predicted proteins were extracted from the gene models, and
Paramecium Somatic Chromosome Organization
1403
adjustment of the gene models as well as functional annotation were
based on case-by-case examination of all the evidence provided by
(1) BLASTP matches (cutoff E ⬍ 10⫺2, BLOSUM62 matrix and default
parameters) against the swissprot⫹sptrembl database and clustalw
mutiple alignments with the five best matches, (2) RPSBLAST
matches (cutoff E ⬍ 10⫺1) with NCBI’s Conserved Domain Database
[31], (3) InterPro matches found using InterProScan v3.1 [32] with
default parameters, (4) Signal peptide (SignalP-2.0 [33]), transmembrane helix (TMHMM [34]), and coiled coils [35] predictions, and (5)
identification of close Paramecium paralogs by TBLASTN (cutoff
E ⬍ 10⫺40) against the primary WGS sequence data.
GO [36] Molecular Function terms were mapped using GOA gene
associations and interpro2go mappings [37]. The terms were further
mapping to high-level goaslim function terms, generating the classification shown in Figure 1. A gene table with the evidence used for
annotation is available at http://paramecium.cgm.cnrs-gif.fr/megabase/. The annotation can be viewed using the Generic Genome
Browser [38] at http://paramecium.cgm.cnrs-gif.fr/cgi-bin/gbrowse.
Translated CDSs were compared to complete proteomes (H. sapiens, A. thaliana, D. melanogaster, C. elegans, S. cerevisiae, and S.
pombe from http://www.ebi.ac.uk/proteome/, P. falciparum from
http://plasmodb.org, and N. crassa from http://www.broad.mit.edu/
annotation/fungi/neurospora/; December 2003) using BLASTP. Species with a match with an E-value ⬍ 10⫺3 were scored after visual
inspection of the alignments. Proteins with interesting patterns of
occurrence among these species were then used to search the
nonredundant protein database at NCBI for validation and in order
to examine homologs in all available species using the “taxonomy
report” feature.
Gene Prediction with GlimmerM
GlimmerM v3 [39] was used for ab initio gene prediction. The source
code was modified to take into account the Paramecium genetic
code. For training, a reference set of genes was constituted. The
set is comprised of 51 Paramecium tetraurelia genes from the Invertebrate division of the public nucleotide database, 66 conserved
genes from the Paramecium WGS project that were validated by
ClustalW multiple alignment with homologs from other species, and
141 genes from the manual annotation of the megabase chromosome consisting either of conserved genes validated by multiple
alignment with homologs from other species or genes validated by
nucleotide alignment with Paramecium paralogs found in the primary data from the Paramecium WGS project. Given the small size
of Paramecium introns and gene density higher than that of the
Plasmodium falciparum genome for which GlimmerM was originally
designed, we found the best predictions by changing the splice site
filter size for training from 60 nt to 24 nt. Although the accuracy of
the predictions is expected to improve as the size of the training
set increases, performance based on the initial set of 262 reference
genes is already quite useful as judged by comparison of the GlimmerM gene models with CDS from the manual annotation of the
entire megabase chromosome (Supplemental Table S3).
Supplemental Data
Supplemental Data, including tables, a figure, and a discussion of
dinucleotide frequencies, can be found at http://www.currentbiology.com/cgi/content/full/14/15/1397/DC1. A gene table is available at http://paramecium.cgm.cnrs-gif.fr/megabase/. The annotation can be viewed with the Generic Genome Browser at http://
paramecium.cgm.cnrs-gif.fr/cgi-bin/gbrowse.
Acknowledgments
We gratefully acknowledge support from the CNRS for construction
of a European network (GDRE Paramecium Genomics), the Polish
Ministry of Science (grant KBN 3P04A00625), the Ministère de l’Education Nationale, de la Recherche et de la Technologie (MENRT)
Program “Centre de Ressources Biologiques” (J.C.), the MENRT
Program “Recherche fondamentale en Microbiologie et Maladies
infectieuses et parasitaires” (E.M.), the Association pour la Recherche sur le Cancer (grant # 5733, E.M.), and the Ligue Nationale
contre le Cancer (grant # 75/01-RS/73, E.M.). M.N. was supported
by a graduate studentship from the MENRT, and M.N. and J.K.N.
received support from the CNRS in the framework of the PolishFrench Centre for Plant Biotechnology. I.B. received support
through the MENRT program “Centres de Ressources Biologiques.”
We are indebted to Janine Beisson and Christoper J. Herbert for
constructive criticism of the manuscript. We thank Lawrence Aggerbeck and the Gif-Orsay Microarray Platform for helping us index the
shotgun library.
Received: January 21, 2004
Revised: June 14, 2004
Accepted: June 14, 2004
Published: August 10, 2004
References
1. Jahn, C.L., and Klobutcher, L.A. (2002). Genome remodeling in
ciliated protozoa. Annu. Rev. Microbiol. 56, 489–520.
2. Le Mouël, A., Butler, A., Caron, F., and Meyer, E. (2003). Developmentally regulated chromosome fragmentation linked to imprecise elimination of repeated sequences in paramecia. Eukaryot.
Cell 2, 1076–1090.
3. Jerka-Dziadosz, M., and Beisson, J. (1990). Genetic approaches
to ciliate pattern formation: from self-assembly to morphogenesis. Trends Genet. 6, 41–45.
4. Görtz, H.D. (1988). Paramecium (Berlin: Springer-Verlag).
5. Gratias, A., and Bétermier, M. (2001). Developmentally regulated
excision of internal DNA sequences in Paramecium aurelia. Biochimie 83, 1009–1022.
6. Baroin, A., Prat, A., and Caron, F. (1987). Telomeric site position
heterogeneity in macronuclear DNA of Paramecium primaurelia.
Nucleic Acids Res. 15, 1717–1728.
7. Forney, J., and Rodkey, K. (1992). A repetitive DNA sequence
in Paramecium macronuclei is related to the beta subunit of G
proteins. Nucleic Acids Res. 20, 5397–5402.
8. Glöckner, G., Eichinger, L., Szafranski, K., Pachebat, J.A., Bankier, A.T., Dear, P.H., Lehmann, R., Baumgart, C., Parra, G.,
Abril, J.F., et al. (2002). Sequence and analysis of chromosome
2 of Dictyostelium discoideum. Nature 418, 79–85.
9. Gardner, M.J., Hall, N., Fung, E., White, O., Berriman, M., Hyman,
R.W., Carlton, J.M., Pain, A., Nelson, K.E., Bowman, S., et al.
(2002). Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498–511.
10. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B.,
Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston,
M., et al. (1996). Life with 6000 genes. Science 274, 563–567.
11. Katinka, M.D., Duprat, S., Cornillot, E., Metenier, G., Thomarat,
F., Prensier, G., Barbe, V., Peyretaillade, E., Brottier, P., Wincker,
P., et al. (2001). Genome sequence and gene compaction of
the eukaryote parasite Encephalitozoon cuniculi. Nature 414,
450–453.
12. Baldauf, S.L., Roger, A.J., Wenk-Siefert, I., and Doolittle, W.F.
(2000). A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290, 972–977.
13. Sperling, L., Dessen, P., Zagulski, M., Pearlman, R.E., Migdalski,
A., Gromadka, R., Froissard, M., Keller, A.M., and Cohen, J.
(2002). Random sequencing of Paramecium somatic DNA. Eukaryot. Cell 1, 341–352.
14. Dasilva, C., Hadji, H., Ozouf-Costaz, C., Nicaud, S., Jaillon, O.,
Weissenbach, J., and Crollius, H.R. (2002). Remarkable compartmentalization of transposable elements and pseudogenes
in the heterochromatin of the Tetraodon nigroviridis genome.
Proc. Natl. Acad. Sci. USA 99, 13636–13641.
15. Dubrana, K., and Amar, L. (2000). Programmed DNA underamplification in Paramecium primaurelia. Chromosoma 109,
460–466.
16. Haynes, W.J., Ling, K.-Y., and Kung, C. (2003). PAK paradox:
Paramecium appears to have more K(⫹)-channel genes than
human. Eukaryot. Cell 2, 737–745.
17. Cavalier-Smith, T. (2002). The phagotrophic origin of eukaryotes
and phylogenetic classification of Protozoa. Int. J. Syst. Evol.
Microbiol. 52, 297–354.
18. Creutz, C.E., Tomsig, J.L., Snyder, S.L., Gautier, M.C., Skouri,
F., Beisson, J., and Cohen, J. (1998). The copines, a novel class
Current Biology
1404
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
of C2 domain-containing, calcium-dependent, phospholipidbinding proteins conserved from Paramecium to humans. J.
Biol. Chem. 273, 1393–1402.
Yang, T., Perasso, R., and Baroin-Tourancheau, A. (2003). Myb
genes in ciliates: a common origin with the myb protooncogene?
Protist 154, 229–238.
Beisson, J., and Ruiz, F. (1992). Lithium-induced respecification
of pattern in Paramecium. Dev. Genet. 13, 194–202.
Sonneborn, T.M. (1974). Paramecium aurelia. In Handbook of
Genetics, R. King, ed. (New York: Plenum Publishing), pp.
469–594.
Sonneborn, T.M. (1970). Methods in Paramecium research.
Methods Cell. Physiol. 4, 241–339.
Caron, F. (1992). A high degree of macronuclear chromosome
polymorphism is generated by variable DNA rearrangements in
Paramecium primaurelia during macronuclear differentiation. J.
Mol. Biol. 225, 661–678.
Ewing, B., and Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res.
8, 186–194.
Staden, R. (1996). The Staden sequence analysis package. Mol.
Biotechnol. 5, 233–241.
Tettelin, H., Radune, D., Kasif, S., Khouri, H., and Salzberg, S.
(1999). Optimized multiplex PCR: efficiently closing a wholegenome shotgun sequencing project. Genomics 62, 500–507.
Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., and Barrell, B. (2000). Artemis: sequence visualization and annotation. Bioinformatics 16, 944–945.
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A.,
Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., et al.
(2002). The Bioperl toolkit: Perl modules for the life sciences.
Genome Res. 12, 1611–1618.
Eddy, S. (1996). Hidden Markov models. Curr. Opin. Struct. Biol.
6, 361–365.
Lowe, T., and Eddy, S. (1997). tRNAscan-SE: a program for
improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964.
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs,
A.R., Lanczycki, C.J., et al. (2003). CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31,
383–387.
Zdobnov, E., and Apweiler, R. (2001). InterProScan-an integration platform for the signature-recognition methods in InterPro.
Bioinformatics 17, 847–848.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G.
(1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10,
1–6.
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L.
(2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol.
Biol. 305, 567–580.
Lupas, A., Van Dyke, M., and Stock, J. (1991). Predicting coiled
coils from protein sequences. Science 252, 1162–1164.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H.,
Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T.,
et al. (2000). Gene ontology: tool for the unification of biology.
The Gene Ontology Consortium. Nat. Genet. 25, 25–29.
Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann,
W., Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A., et al.
(2003). The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome
Res. 13, 662–672.
Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day,
A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., et al. (2002).
The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610.
Salzberg, S., Pertea, M., Delcher, A., Gardner, M., and Tettelin,
H. (1999). Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31.
Lopez, P.J., and Séraphin, B. (2000). YIDB: the Yeast Intron
DataBase. Nucleic Acids Res. 28, 85–86.
Accession Numbers
The Accession Number for the megabase chromosome sequence
is CR548612.