thermophilic eukaryotes as a sources for structural genomic targets

THERMOPHILIC EUKARYOTES AS A SOURCES FOR STRUCTURAL GENOMIC TARGETS
Craig A. Bingman, Russell L. Wrobel, Frank C. Vojtik, Ronnie O. Frederick, Karl W. Nichols, Brian G. Fox, George N. Phillips, Jr., and John L. Markley
University of Wisconsin-Madison, Department of Biochemistry, 433 Babcock Drive, Madison, Wisconsin, USA 53706-1549, http://www.uwstructuralgenomics.org
Relative Performance
Abstract
Cloning
Thermophilic organisms are believed to provide a superior source of targets for
structure determination. However, it is presently unclear whether homologs of
medically relevant proteins are reliably found in the thermophilic eubacterial and
archeal genomes that are most frequently targeted in structural genomics efforts.
For our PSI-2 efforts, CESG has been interested in identifying novel sources of
eukaryotic protein folds that may also provide unique insight into the structures of
difficult-to-obtain, medically relevant human proteins. As part of these efforts, CESG
has undertaken a pilot study of Galdieria sulfuraria, a unicellular eukaryotic red
algae (phylum Rhodophyta) that grows well at pH values as low as 1.0 and at
temperatures up to 55˚C. A complete genomic sequence and recently predicted
coding sequences provided the starting point for work with this organism. These
results were combined with detailed analysis of expressed sequence tagging (EST)
to increase the probably that a selected gene was also produced as a protein in the
algae. A cloning frequency of ~70% based on design of 5’ and 3’ primers was largely
governed by inaccuracies in the gene model, as has been observed by CESG
during work with Arabidopsis and human stem cell proteins. The presented results
show that the initial set of successfully cloned targets proceeded through our
research platform to a deposited structure with higher than usual stage-to-stage
efficiencies in protein expression, purification, and intermediate steps of structure
determination, with 6% of cloned targets yielding deposited PDB structures. This
compares favorably to a corresponding overall success rate of 2.5% across all PSI
centers and particularly to the PSI success rate on eukaryotic targets. Based on
results from our pilot study, a set of biomedically relevant human proteins has been
selected for structure determination in parallel with homologs from Galdieria. This
work will simultaneously advance understanding of structures of biomedically
relevant proteins and give further insight into the properties of homologous proteins
from thermophilic eukaryotes as surrogates for studies of human protein structure
The initial workgroup of 96 targets was independently cloned into both the Gateway
and FlexiVector systems. Two Flexi expression vectors were used, pVP-33K and
pVP-56K. They both bear an His8 tag for IMAC chromatography, a maltose binding
protein fusion to facilitate solubility of recombinant proteins, and have a tobacco etch
virus protease (TEV) cleavage site before the N-terminus of the recombinant protein.
They differ from each other in that pVP-33K has a tetracysteine motif for universal
protein quantification via FLaSH labeling. pVP56K lacks this motif. Both vectors
replace the amino terminal methionine of the target protein with alanine-isoleucinealanaine. pVP-16K was the baseline protein production vector for CESG during PSI
Phase 1. It is a Gateway system vector, also with an His8 tag, maltose binding protein,
and TEV cleavage site. It replaces the amino terminal methionine of the target protein
with a serine, after TEV cleavage.
Overall Risk Assessment
The use of multiple expression vectors to study the same set of targets is part of
CESG’s strategy to perform systematic evaluation new expression vector designs
under conditions that are fully representative of the challenges presented by the
complete protein production pipeline, i.e., from target selection to structure deposition
or work stopped.
Working from cDNA libraries carries its own dangers. CESG’s work with
Arabidopsis showed that even with best, high redundancy genomic sequences,
calculated coding sequences were often in error with experimental results.
Accurate intron boundary predictions are very difficult. We found that one can
expect to recover perhaps 70-80% of calculated coding sequences from cDNA
pools in a single experimental attempt. These losses are not only due to
bioinformatics errors. It is difficult to recover low abundance genes by PCR from
cDNA libraries isolated from a few growth conditions, because not all messenger
RNAs are present in the materials used to produce the libraries.
With respect to these risks and difficulties, we have begun to evaluate the
properties of a thermophilic genome to supplement ongoing studies of human
proteins. This poster describes current progress and success in these efforts.
Target Selection
The initial workgroup of 96 targets was selected from the following bioinformatics
data. Based on a preliminary build of the Galdieria sulfuraria genome, Weber and
colleagues supplied 5872 calculated coding sequences. These assignments were
supported by 2789 EST reads on 5’- and 3’-primed cDNA sequences. Most
Galdieria genes contained introns, in contrast to the reported lack of introns in the
closely related Cyanidioschizon merolae. Because of potential inaccuracies in
calculated intron boundaries, we chose to demand at least 200 nucleotides of
overlap between the EST sequences and coding sequences before moving them
forward in the target selection data pipeline. 1391 cds survived the EST cut. It is
worth noting that by demanding EST data from a relatively small set of EST
sequences, we preferentially selected for more highly expressed genes.
Of these, 407 sequences met our bioinformatic criteria for valid sequencestructure targets. These include less than 30% identity to a solved 3D structure
deposited in the PDB, relative lack of disorder, no signal peptide or predicted
trans-membrane helices, absence of substantial low complexity regions, and no
match to known sequence patterns associated with retrotransposable elements
and other parasitic DNA sequences.
From this set of 407 sequences, a size-balanced set of 96 targets was selected to
put approximately one third of the selected cds under 200 amino acids. These
sequences were then injected into our standard structural genomics platform for
processing via both our cell-free and cell-based protein production platforms, to
support structure determination efforts by both NMR and X-ray crystallography.
Cumulative TargetDB statistics represent a snapshot of all past and current PSI
targets. Accordingly, it may be slightly unfair to compare our projected final outcomes
for the pilot workgroup against a non-equilibrium pool of general PSI targets. However,
PSI has been operating long enough that the cumulative numbers should be near
steady-state. We can predict with some confidence that the final outcome from the first
Galdieria workgroup will be eight solved structures. This three times more than would
be expected for generic structural genomics targets.
go.80017
2I3F
1.38Å
APS 22-ID
go.80004
2NYI
1.80Å
APS 23-ID-D
It thus appears that some eukaryotic thermophiles provide targets with a much higher
than average chance of success in the context of a structural genomics pipeline.
During the course of research using this workgroup, it became apparent hat the
tetracysteine motif in pVP-33K led to adverse outcomes in protein purification by IMAC
chromatography. pVP-56K was chosen to replace it.
The outcomes reported here are cumulative results for all three different cell-based
vectors. For this workgroup, cloned directly from cDNA libraries, it was fortunate that
there were multiple, independent cloning events. There was even more variance than
is apparent in the presentation below. Forty-one of 96 dual cloning events gave
substantially different outcomes. It would have been a mistake to not carry forward the
clones that were translationally variant from the predicted coding sequences. Two
crystal structures were solved from clones with substantial variance from the predicted
CDS sequence.
go.80055
3CAZ
3.34Å
APS 23-ID-D
Cloning Outcomes For Two
Independent cDNA Library Cloning
To date, most of the effort of the Protein Structure Initiative has been directed at
determining the structures of prokaryotic proteins, and a limited number of
eukaryotic proteins from the limited number of well characterized eukaryotic
genomes. Starting material for studies on eukaryotic proteins most typically
comes from centralized cDNA clone collections, such as the NIH-sponsored
Mammalian Genome Collection (MGC.)
However, it is of some interest to consider life at the extremes. For exploration of
eukaryotic thermophiles, these extremes range from the obvious: proteins from
organisms living at higher temperature are expected to be more thermally stable,
while others may only be deadly bioinformatics traps. There are few available
published genomes from thermophilic eukaryotes. Of those available, some had
odd genomic or biochemical properties. For example, genes that code for proteins
in Cyanidioschizon merolae are reported to be almost entirely devoid of introns,
but it is unclear if this is actually the case.
Even taking into account the special problems associated with cloning from cDNA
libraries and the imprecision in models for coding sequences, the first workgroup from
Galdieria vastly outperformed the general pool of all PSI targets and selected
mesophilic eukaryotes (Homo sapiens and Mus musculus) at several stages. Most
notably, Galdieria targets purified at about 50% higher rate than the general NIH target
pool, and went on to solved structures at roughly twice the frequency of the general
target pool. These effects are even more striking compared to Homo sapiens and Mus
musculus.
Events
go.80034
2I3F
2.81Å
APS 23-ID-D
pVP16
Sequence +
Silent variant
Missense variant
PCR Substantial sequence variance
Destination clone screen Sequencing inconculusive
Sequence -
go.80048
2O57
1.95Å
APS 23-ID-D
pVP33+pVP56
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Thermophilicity in Eukarya
True thermophilicity (growth at above 50ºC) and hyperthemophilicity (growth at above
80ºC) are quite common in Archea and somewhat common in Eubacteria. In contrast,
the maximum temperature tolerated by eukaryotes extends only slightly into the
range of thermophilicity, and there are no reports of growth at above 60ºC for any
eukaryote. (Dickey and Singer, 2004) Phylogenetically, true thermophilicity occurs
sporadically in fungi, cilliates, rhotophyta, and perhaps annelids. Since
chemoautotropy is rare or absent altogether from eukaryotes, they are found either in
sunlit hot springs, in environments rich in pre-formed organic compounds such as
compost heaps, or in symbiotic or predatory relationships with primary producers in
hydrothermal vents. Exaggeration and irreproducibility in reported temperature
optima is quite common. This includes the oft-storied but completely irreproducible
adaptation of a cilliate, probably Tetrahymena, to live at up to 80°C in culture.
Extreme environments often have extremely steep temperature gradients, and it can
be quite difficult to assess the actual limits to growth for organisms that cannot be
cultured. Similarly, Alvenelid worms have been reported to exist at temperatures up to
80°C, but when specimens are placed in high temperature aquaria, they prefer
temperatures between 40-50ºC, and exhibit taxis away from temperatures above 55°
C (Girguis and Lee, 2006.) For easily culturable organisms, where precise thermal
limits can be easily determined experimentally, the two hardiest eukaryotic
thermophiles are probably Galdieria and Thermomyces, which grow at temperatures
as high as 55°C. True thermophily in eukaryotes may be limited to unicellular
organisms.
In addition to the genomic sequence for Galdieria sulfuraria completed at by Andreas
Weber and colleagues at Michigan State University, sequencing efforts have been
made on several other thermophilic eukaryotic genomes.
Organism
Cyanidioschizon merolae
Galdieria sulfuraria
Alvinella pompejana
Tetrahymena thermophila
Thermomyces lanuginosus
Temp
50°C
55ºC
?
43ºC+
55°C
Consortium
Japan
MSU
Scripps/JGI
Berkeley/JGI
Concordia
The Structures
Crystals representing the five proteins from Galdieria solved to date by CESG diffract to
1.34 Å, 1.80 Å, 1.95 Å, 2.81 Å, and 3.34 Å. While this is somewhat better than an
average for the eukaryotic protein structures solved by our Center, this distribution is not
significantly better than the PSI average. It is at present unclear whether or not
“thermophilicity” produces more crystals or somewhat better crystals. Our perception,
which will be tested in future studies, is that these proteins perform better at all stages
from expression trials to purification, handling, and structure determination.
Future Directions
Galdieria sulfuraria is a legitimate, albeit unicellular eukaryotic organism. It shares core
biochemical and nucleic acid metabolism, including mRNA processing, with multicellular eukaryotes. There is detectable sequence similarity between many human and
Galdieria proteins, many of which are quite important.
Future workgroups will provide head-to-head comparisons between homologous human
and Galdieria proteins. These comparative studies will span our entire structural
genomics platform and engage both NMR and X-ray crystallography. Structures
emerging from these classical structural genomics efforts will yield either a high value
structure of a human protein, or a homologous structure in the event that the human
protein is intractable.
The first four such workgroups are underway. Samples are only now reaching the
structural biology teams, but it is already apparent that Galdieria proteins are again
outperforming those from Homo sapiens in these directed comparisons.
Scope/Status
Genomic/completed
Genomic/completed
EST/targeted genomic/completed
Macronulceus completed
EST
References
Blommel, P.G., Martin, P. A., Wrobel,R.L., Steffen, E. and Fox, B.G. (2006) High
efficiency single-step production of expression plasmids from cDNA clones
using the FlexiVector cloning platform. Prot Exp Purif 47(2):562-70.
Dickey, D.A. and Singer, G.A.C. (2004) Genomic and proteomic adaptations to
growth at high temperatures. Genome Biology 5(10):117 (2004)
Eisen et al (2006) Macronuclear genome sequence of the ciliate Tetrahymena
thermophila, a model eukaryote. PLoS Biology 4(9) e286
Matsuzaki et al. (2004) Genome sequence of the ultrasmall unicellular red alga
Cyanidioschyzon merolae 10D. Nature 428(6983): 653-7.
Girguis, P.R. and Lee, R.W. (2006) Thermal preference and tolerance of Alvinellids.
Science 312(5771): 231.
Thao, S. Zhao, Q., Kimball, T., Steffen,E., Blommel, P.G., Riters, M., Newman, C.S.,
Fox, B.G., and Wroblel, R.L. (2004) Results from high-throughput DNA cloning of
Arabidopsis thaliana target genes using site-specific recombination. JSFG 5(4):
267-76.
Weber, A.P., Horst, R.J., Barbier, G.G., Oesterhelt, D. (2007) Metabolism and
metabolomics of eukaryotes living under extreme conditions. Int Rev Cytol
265:1-34.
Weber, A.P. et al. (2004) EST-analysis of the thermo-acidophilic red microalgae
Galdieria sulphuraria reveals potential for lipid A biosynthesis and unveils the
pathway of carbon export from rhodoplasts. Plant Mol Biol 55(1):17-32
Acknowledgements
All members of the CESG team, especially Jason McCoy and Eduard Bitto for
crystallographic support. Zhaohui Sun assisted Craig Bingman in the initial
bioinformatics work necessary for target selection.
Andreas Weber, PI Microbial Genome Sequencing: Genome Analysis of Galdieria
sulphuraria - A Unique Unicellular Thermo-Acidophilic Photosynthetic Microorganism
National Science Foundation, Emerging Frontiers Award.
The Advanced Photon Source, SER-CAT (B.C. Wang, John Chrzas, John Gonzy), GM/
CA-CAT (Janet Smith, Ward Smith, Craig Ogata).
CESG is supported by the National Institute of General Medical Sciences through
The Protein Structure Initiative NIGMS grant number U54 GM074901.