THERMOPHILIC EUKARYOTES AS A SOURCES FOR STRUCTURAL GENOMIC TARGETS Craig A. Bingman, Russell L. Wrobel, Frank C. Vojtik, Ronnie O. Frederick, Karl W. Nichols, Brian G. Fox, George N. Phillips, Jr., and John L. Markley University of Wisconsin-Madison, Department of Biochemistry, 433 Babcock Drive, Madison, Wisconsin, USA 53706-1549, http://www.uwstructuralgenomics.org Relative Performance Abstract Cloning Thermophilic organisms are believed to provide a superior source of targets for structure determination. However, it is presently unclear whether homologs of medically relevant proteins are reliably found in the thermophilic eubacterial and archeal genomes that are most frequently targeted in structural genomics efforts. For our PSI-2 efforts, CESG has been interested in identifying novel sources of eukaryotic protein folds that may also provide unique insight into the structures of difficult-to-obtain, medically relevant human proteins. As part of these efforts, CESG has undertaken a pilot study of Galdieria sulfuraria, a unicellular eukaryotic red algae (phylum Rhodophyta) that grows well at pH values as low as 1.0 and at temperatures up to 55˚C. A complete genomic sequence and recently predicted coding sequences provided the starting point for work with this organism. These results were combined with detailed analysis of expressed sequence tagging (EST) to increase the probably that a selected gene was also produced as a protein in the algae. A cloning frequency of ~70% based on design of 5’ and 3’ primers was largely governed by inaccuracies in the gene model, as has been observed by CESG during work with Arabidopsis and human stem cell proteins. The presented results show that the initial set of successfully cloned targets proceeded through our research platform to a deposited structure with higher than usual stage-to-stage efficiencies in protein expression, purification, and intermediate steps of structure determination, with 6% of cloned targets yielding deposited PDB structures. This compares favorably to a corresponding overall success rate of 2.5% across all PSI centers and particularly to the PSI success rate on eukaryotic targets. Based on results from our pilot study, a set of biomedically relevant human proteins has been selected for structure determination in parallel with homologs from Galdieria. This work will simultaneously advance understanding of structures of biomedically relevant proteins and give further insight into the properties of homologous proteins from thermophilic eukaryotes as surrogates for studies of human protein structure The initial workgroup of 96 targets was independently cloned into both the Gateway and FlexiVector systems. Two Flexi expression vectors were used, pVP-33K and pVP-56K. They both bear an His8 tag for IMAC chromatography, a maltose binding protein fusion to facilitate solubility of recombinant proteins, and have a tobacco etch virus protease (TEV) cleavage site before the N-terminus of the recombinant protein. They differ from each other in that pVP-33K has a tetracysteine motif for universal protein quantification via FLaSH labeling. pVP56K lacks this motif. Both vectors replace the amino terminal methionine of the target protein with alanine-isoleucinealanaine. pVP-16K was the baseline protein production vector for CESG during PSI Phase 1. It is a Gateway system vector, also with an His8 tag, maltose binding protein, and TEV cleavage site. It replaces the amino terminal methionine of the target protein with a serine, after TEV cleavage. Overall Risk Assessment The use of multiple expression vectors to study the same set of targets is part of CESG’s strategy to perform systematic evaluation new expression vector designs under conditions that are fully representative of the challenges presented by the complete protein production pipeline, i.e., from target selection to structure deposition or work stopped. Working from cDNA libraries carries its own dangers. CESG’s work with Arabidopsis showed that even with best, high redundancy genomic sequences, calculated coding sequences were often in error with experimental results. Accurate intron boundary predictions are very difficult. We found that one can expect to recover perhaps 70-80% of calculated coding sequences from cDNA pools in a single experimental attempt. These losses are not only due to bioinformatics errors. It is difficult to recover low abundance genes by PCR from cDNA libraries isolated from a few growth conditions, because not all messenger RNAs are present in the materials used to produce the libraries. With respect to these risks and difficulties, we have begun to evaluate the properties of a thermophilic genome to supplement ongoing studies of human proteins. This poster describes current progress and success in these efforts. Target Selection The initial workgroup of 96 targets was selected from the following bioinformatics data. Based on a preliminary build of the Galdieria sulfuraria genome, Weber and colleagues supplied 5872 calculated coding sequences. These assignments were supported by 2789 EST reads on 5’- and 3’-primed cDNA sequences. Most Galdieria genes contained introns, in contrast to the reported lack of introns in the closely related Cyanidioschizon merolae. Because of potential inaccuracies in calculated intron boundaries, we chose to demand at least 200 nucleotides of overlap between the EST sequences and coding sequences before moving them forward in the target selection data pipeline. 1391 cds survived the EST cut. It is worth noting that by demanding EST data from a relatively small set of EST sequences, we preferentially selected for more highly expressed genes. Of these, 407 sequences met our bioinformatic criteria for valid sequencestructure targets. These include less than 30% identity to a solved 3D structure deposited in the PDB, relative lack of disorder, no signal peptide or predicted trans-membrane helices, absence of substantial low complexity regions, and no match to known sequence patterns associated with retrotransposable elements and other parasitic DNA sequences. From this set of 407 sequences, a size-balanced set of 96 targets was selected to put approximately one third of the selected cds under 200 amino acids. These sequences were then injected into our standard structural genomics platform for processing via both our cell-free and cell-based protein production platforms, to support structure determination efforts by both NMR and X-ray crystallography. Cumulative TargetDB statistics represent a snapshot of all past and current PSI targets. Accordingly, it may be slightly unfair to compare our projected final outcomes for the pilot workgroup against a non-equilibrium pool of general PSI targets. However, PSI has been operating long enough that the cumulative numbers should be near steady-state. We can predict with some confidence that the final outcome from the first Galdieria workgroup will be eight solved structures. This three times more than would be expected for generic structural genomics targets. go.80017 2I3F 1.38Å APS 22-ID go.80004 2NYI 1.80Å APS 23-ID-D It thus appears that some eukaryotic thermophiles provide targets with a much higher than average chance of success in the context of a structural genomics pipeline. During the course of research using this workgroup, it became apparent hat the tetracysteine motif in pVP-33K led to adverse outcomes in protein purification by IMAC chromatography. pVP-56K was chosen to replace it. The outcomes reported here are cumulative results for all three different cell-based vectors. For this workgroup, cloned directly from cDNA libraries, it was fortunate that there were multiple, independent cloning events. There was even more variance than is apparent in the presentation below. Forty-one of 96 dual cloning events gave substantially different outcomes. It would have been a mistake to not carry forward the clones that were translationally variant from the predicted coding sequences. Two crystal structures were solved from clones with substantial variance from the predicted CDS sequence. go.80055 3CAZ 3.34Å APS 23-ID-D Cloning Outcomes For Two Independent cDNA Library Cloning To date, most of the effort of the Protein Structure Initiative has been directed at determining the structures of prokaryotic proteins, and a limited number of eukaryotic proteins from the limited number of well characterized eukaryotic genomes. Starting material for studies on eukaryotic proteins most typically comes from centralized cDNA clone collections, such as the NIH-sponsored Mammalian Genome Collection (MGC.) However, it is of some interest to consider life at the extremes. For exploration of eukaryotic thermophiles, these extremes range from the obvious: proteins from organisms living at higher temperature are expected to be more thermally stable, while others may only be deadly bioinformatics traps. There are few available published genomes from thermophilic eukaryotes. Of those available, some had odd genomic or biochemical properties. For example, genes that code for proteins in Cyanidioschizon merolae are reported to be almost entirely devoid of introns, but it is unclear if this is actually the case. Even taking into account the special problems associated with cloning from cDNA libraries and the imprecision in models for coding sequences, the first workgroup from Galdieria vastly outperformed the general pool of all PSI targets and selected mesophilic eukaryotes (Homo sapiens and Mus musculus) at several stages. Most notably, Galdieria targets purified at about 50% higher rate than the general NIH target pool, and went on to solved structures at roughly twice the frequency of the general target pool. These effects are even more striking compared to Homo sapiens and Mus musculus. Events go.80034 2I3F 2.81Å APS 23-ID-D pVP16 Sequence + Silent variant Missense variant PCR Substantial sequence variance Destination clone screen Sequencing inconculusive Sequence - go.80048 2O57 1.95Å APS 23-ID-D pVP33+pVP56 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Thermophilicity in Eukarya True thermophilicity (growth at above 50ºC) and hyperthemophilicity (growth at above 80ºC) are quite common in Archea and somewhat common in Eubacteria. In contrast, the maximum temperature tolerated by eukaryotes extends only slightly into the range of thermophilicity, and there are no reports of growth at above 60ºC for any eukaryote. (Dickey and Singer, 2004) Phylogenetically, true thermophilicity occurs sporadically in fungi, cilliates, rhotophyta, and perhaps annelids. Since chemoautotropy is rare or absent altogether from eukaryotes, they are found either in sunlit hot springs, in environments rich in pre-formed organic compounds such as compost heaps, or in symbiotic or predatory relationships with primary producers in hydrothermal vents. Exaggeration and irreproducibility in reported temperature optima is quite common. This includes the oft-storied but completely irreproducible adaptation of a cilliate, probably Tetrahymena, to live at up to 80°C in culture. Extreme environments often have extremely steep temperature gradients, and it can be quite difficult to assess the actual limits to growth for organisms that cannot be cultured. Similarly, Alvenelid worms have been reported to exist at temperatures up to 80°C, but when specimens are placed in high temperature aquaria, they prefer temperatures between 40-50ºC, and exhibit taxis away from temperatures above 55° C (Girguis and Lee, 2006.) For easily culturable organisms, where precise thermal limits can be easily determined experimentally, the two hardiest eukaryotic thermophiles are probably Galdieria and Thermomyces, which grow at temperatures as high as 55°C. True thermophily in eukaryotes may be limited to unicellular organisms. In addition to the genomic sequence for Galdieria sulfuraria completed at by Andreas Weber and colleagues at Michigan State University, sequencing efforts have been made on several other thermophilic eukaryotic genomes. Organism Cyanidioschizon merolae Galdieria sulfuraria Alvinella pompejana Tetrahymena thermophila Thermomyces lanuginosus Temp 50°C 55ºC ? 43ºC+ 55°C Consortium Japan MSU Scripps/JGI Berkeley/JGI Concordia The Structures Crystals representing the five proteins from Galdieria solved to date by CESG diffract to 1.34 Å, 1.80 Å, 1.95 Å, 2.81 Å, and 3.34 Å. While this is somewhat better than an average for the eukaryotic protein structures solved by our Center, this distribution is not significantly better than the PSI average. It is at present unclear whether or not “thermophilicity” produces more crystals or somewhat better crystals. Our perception, which will be tested in future studies, is that these proteins perform better at all stages from expression trials to purification, handling, and structure determination. Future Directions Galdieria sulfuraria is a legitimate, albeit unicellular eukaryotic organism. It shares core biochemical and nucleic acid metabolism, including mRNA processing, with multicellular eukaryotes. There is detectable sequence similarity between many human and Galdieria proteins, many of which are quite important. Future workgroups will provide head-to-head comparisons between homologous human and Galdieria proteins. These comparative studies will span our entire structural genomics platform and engage both NMR and X-ray crystallography. Structures emerging from these classical structural genomics efforts will yield either a high value structure of a human protein, or a homologous structure in the event that the human protein is intractable. The first four such workgroups are underway. Samples are only now reaching the structural biology teams, but it is already apparent that Galdieria proteins are again outperforming those from Homo sapiens in these directed comparisons. Scope/Status Genomic/completed Genomic/completed EST/targeted genomic/completed Macronulceus completed EST References Blommel, P.G., Martin, P. A., Wrobel,R.L., Steffen, E. and Fox, B.G. (2006) High efficiency single-step production of expression plasmids from cDNA clones using the FlexiVector cloning platform. Prot Exp Purif 47(2):562-70. Dickey, D.A. and Singer, G.A.C. (2004) Genomic and proteomic adaptations to growth at high temperatures. Genome Biology 5(10):117 (2004) Eisen et al (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biology 4(9) e286 Matsuzaki et al. (2004) Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428(6983): 653-7. Girguis, P.R. and Lee, R.W. (2006) Thermal preference and tolerance of Alvinellids. Science 312(5771): 231. Thao, S. Zhao, Q., Kimball, T., Steffen,E., Blommel, P.G., Riters, M., Newman, C.S., Fox, B.G., and Wroblel, R.L. (2004) Results from high-throughput DNA cloning of Arabidopsis thaliana target genes using site-specific recombination. JSFG 5(4): 267-76. Weber, A.P., Horst, R.J., Barbier, G.G., Oesterhelt, D. (2007) Metabolism and metabolomics of eukaryotes living under extreme conditions. Int Rev Cytol 265:1-34. Weber, A.P. et al. (2004) EST-analysis of the thermo-acidophilic red microalgae Galdieria sulphuraria reveals potential for lipid A biosynthesis and unveils the pathway of carbon export from rhodoplasts. Plant Mol Biol 55(1):17-32 Acknowledgements All members of the CESG team, especially Jason McCoy and Eduard Bitto for crystallographic support. Zhaohui Sun assisted Craig Bingman in the initial bioinformatics work necessary for target selection. Andreas Weber, PI Microbial Genome Sequencing: Genome Analysis of Galdieria sulphuraria - A Unique Unicellular Thermo-Acidophilic Photosynthetic Microorganism National Science Foundation, Emerging Frontiers Award. The Advanced Photon Source, SER-CAT (B.C. Wang, John Chrzas, John Gonzy), GM/ CA-CAT (Janet Smith, Ward Smith, Craig Ogata). CESG is supported by the National Institute of General Medical Sciences through The Protein Structure Initiative NIGMS grant number U54 GM074901.
© Copyright 2026 Paperzz