The promise of functional genomics: completing the encyclopedia of a cell Timothy R Hughes1, Mark D Robinson1, Nicholas Mitsakakis1 and Mark Johnston2 Genome sequencing provides complete parts lists of organisms. This presents the obvious challenge of determining how each gene contributes to the life of the organism. This task seems increasingly feasible; however, progress to date suggests that increased interaction between systematic efforts and individual investigators will be critical to completing the encyclopedia of the yeast cell. Addresses 1 Banting and Best Department of Medical Research, University of Toronto, 112 College St., Room 307, Toronto, ON, M5G 1L6, Canada 2 Department of Genetics, Campus Box 8232, Washington University Medical School, 4566 Scott Ave., St. Louis, MO 63110, USA e-mail: [email protected] Current Opinion in Microbiology 2004, 7:546–554 This review comes from a themed issue on Genomics Edited by Charles Boone and Philippe Glaser Available online 11th September 2004 1369-5274/$ – see front matter # 2004 Elsevier Ltd. All rights reserved. DOI 10.1016/j.mib.2004.08.015 Abbreviations FLAG, FLG GO REG SGA SGD TAP Y2H YPD flag-affinity tag gene ontology gene co-regulations synthetic genetic array Saccharomyces genome database tandem affinity purification tag yeast two-hybrid yeast proteome database Introduction Erwin Schroedinger ushered in the previous era of biological research by posing the simple (but profound) question: ‘What is Life?’ [1]. The answer came much more quickly than the renowned physicist imagined it would ‘Indeed, I do not expect that any detailed information on this question is likely to come from physics in the near future’: it is the result of chemical reactions and molecular interactions (both micro and macro). These general principles of the physics and chemistry of life, revealed over the past half century, give us a basic (in many cases highly detailed) understanding of the fundamental processes that define living things: how they conquer entropy, Current Opinion in Microbiology 2004, 7:546–554 how like begets like, and how cellular components self-assemble [2–4]. A few years ago, we entered what we believe to be a new (perhaps penultimate) era of biological research, in which we expect to be able to approach a complete understanding of the molecular mechanisms of life. To achieve this lofty goal, we need to first identify all the components of a cell. The genome sequencing projects are doing that, because they have generated substantially complete parts lists of several organisms; most of the protein coding genes are (more or less) apparent in the genome sequences, and the relatively few hidden ones will undoubtedly be uncovered in due time. Continued application of this program to many more genomes [5] promises to reveal the functional non-protein coding sequences, such as gene regulatory sequences, sequences governing chromosome replication and segregation, regulatory RNAs, and others, by their evolutionary conservation [6–9]. One does not need to be an optimist to expect that we will soon have nearly complete parts lists for several organisms. With the parts lists in hand, we can begin to tackle the next goal of molecular and cellular biology; determination of the function of all gene products of an organism. Such a goal might have seemed absurd a few years ago, but today it seems reachable to us because the parts lists of organisms has turned out to be surprisingly short: only about 6000 genes are necessary to construct and operate a simple eukaryotic cell [10,11], and only about two to three times that number of genes are required to produce relatively complicated multicellular organisms [12]. Remarkably, only about five times that many genes seem necessary to compose a human [13,14]. Suddenly, the scale of the task does not seem overwhelming. We can imagine resolving the functions and interactions of all of the genes and proteins in a few well-studied organisms within our own lifetimes. In this review, we take stock of current progress toward this goal with the yeast Saccharomyces cerevisiae, a bellwether of molecular biology and functional genomics. We ponder the prospect of ‘solving’ the yeast cell by examining what is in databases, what is in the literature, what is in large-scale datasets, and whether the synthesis of existing information contributes to a real understanding of the functions of the many new genes discovered by genome sequencing. www.sciencedirect.com The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 547 Yeast: the proving ground It is fitting that the first organism that will be ‘solved’ in this way is the first one to be domesticated by humans [15,16]: bakers’ and brewers’ yeast (Saccharomyces cerevisiae), because it has the fewest genes among the eukaryotic ‘model organisms’ and the experimental toolbox that is available is highly sophisticated. Discovered and named by Meyen [17] and made famous by Pasteur [18], yeast became the workhorse for eukaryotic molecular and cellular biology, revealing over the past 30 years much of what we know about how cells work [19]. How close are we to completing the encyclopedia of the yeast cell? When will this signal accomplishment be realized? How will it be done? And, once we have reached this goal, what will we have learned? We will attempt to address some of these questions in the following discourse. What do we think we know? First, how close are we to our goal of identifying the function of all yeast genes? A superficial analysis suggests we are closer than one might have imagined. Towards the end of 2003, about 80% of yeast genes annotated in the Yeast Proteome Database [20] (YPD; http://proteome.incyte. com/) were listed as ‘known’. The remarkably linear rate of progress in understanding gene function, apparent in Figure 1, allows us to predict with a high degree of accuracy when all 6000 genes will be ‘known’: we expect to celebrate this remarkable accomplishment only three years from now, on or around April 1, 2007! However, even a cursory look into YPD to see what is known about some of these genes will soon cast a pall over our celebration, because it will quickly become apparent that little is known about most proteins. For example, we will see that DFG10 encodes a protein involved in ‘pseudohyphal growth’ and ‘regulation of cell shape’, but this is on the basis of only two observations. First, ‘mutant diploids are defective in cell polarity and cell elongation, but still invade the agar upon nitrogen starvation’; and second, the ‘mutant diploid is partially suppressed by the ras2-val19 mutation’. These observations come from only one study [21], despite the fact that 15 references to papers with information concerning DFG10 are listed in YPD. We are heartened when we notice that Dfg10 is similar to proteins in other organisms, but our optimism is quickly quenched when we realize that nothing is known about these proteins. YPD summarizes the results of a handful of systematic analyses that yielded data for DFG10, but these results reveal little about Dfg10 function; for example, ‘one of 177 genes co-repressed by the addition of 960 mg per liter of diammonium phosphate to stationary cultures grown in Riesling grape juice’. Clearly, we are unsatisfied with our understanding of this ‘known’ protein, and the number of similarly characterized proteins is uncomfortably large. Even proteins with extensive lists of information in YPD, such as Std1, are poorly understood. It is clear that we will have to get back to the laboratory bench on April 2, 2007 if we are to have an encyclopedia of the organism that is worth reading. A different analysis of the situation similarly reveals that we are not quite as close to completing the encyclopedia of the yeast proteome as our initial analysis suggested. Of the 5818 genes annotated with gene ontology (GO) [22] terms in the Saccharomyces Genome Database (SGD), 40% (2317) are of unknown molecular function and 30% of these genes (1720), the biological process they are involved in is unknown. In addition, many of the proteins whose biological process and molecular function are ‘known’ are poorly understood. For example, SGD lists Figure 1 Number of genes ‘known’ 7000 6000 6000 proteins ‘known’ April 1, 2007 5000 4000 3000 2000 1000 4679 proteins ‘known’ Oct. 14, 2003 Ju n9 Ju 5 nJu 96 nJu 97 nJu 98 nJu 99 n0 1- 0 Ju Ju n nJu 02 nJu 03 nJu 04 nJu 05 nJu 06 nJu 07 n0 Ju 8 n09 0 Time Current Opinion in Microbiology Number of genes with ‘known’ functions according to the Yeast Proteome Database (www.proteome.com, [20]). The first eleven points were compiled on the individual dates shown. The last point is extrapolated. www.sciencedirect.com Current Opinion in Microbiology 2004, 7:546–554 548 Genomics many facts about MET18, which was discovered in 1975 on the basis of its requirement for methionine biosynthesis [23] and again in 1979 because it is required for resistance to methylmethane sulfonate [24]. Met18 has been associated with ‘RNA polymerase II transcription factor activity’, and has been localized to the cytoplasm. These are all intriguing observations, but it has been difficult to reconcile them and synthesize a model for Met18 function. Although there are obviously many examples where the function of a gene or protein is obvious from the data (for example, functional genomics and proteomics have identified many new RNA processing and ribosome biogenesis factors in which all of the available information is consistent), it is also clear that much more work needs to be done before we can claim to have ‘solved’ the proteome. Thus, we are somewhat schizophrenic about the prospects of approaching a complete understanding of the yeast cell proteome. In some respects, much is known about a substantial fraction of the proteome, and our rate of progress in learning about the rest has been steady. However, little light has been so far shed on the function of a significant number of proteins, and there is a long way to go before we will be comfortable with our understanding of many of the ‘characterized’ proteins. Our mood tends toward the optimistic, because of the resources and experimental efficiencies engendered by the genome projects. The complete yeast deletion collection [25] and efforts to create libraries of conditional alleles of essential genes [26,27] have greatly facilitated systematic genetic analysis. Nearly complete collections of green fluorescent protein (GFP), tandem affinity purification tag (TAP) and transcriptional activation domaintagged genes [28,29,30] enable systematic examination of protein localization and protein complexes. Use of DNA microarrays to assay gene expression genome-wide was pioneered with yeast [31,32]. These and other genome-scale technologies have greatly enhanced the experimental toolbox [33–36], and made the power of yeast genetics even more awesome than it was. Figure 2 (a) 2500 2000 1500 100 1000 50 500 SGA TAP FLAG 2-hybrid Microarray 0 All 0 Dataset (b) 2500 2000 1500 100 1000 50 500 0 0 0 1 2 Number of data sets 3 4 Number of uncharacterized genes in indicated data sets (scale on right) Percentage of genes subsequently annotated (scale on left) Current Opinion in Microbiology Correspondence between the appearance of uncharacterized genes in functional genomics data sets, and whether or not those genes are subsequently characterized. (a) Comparison of different data sets and data types (b) Impact of appearance in multiple data sets. Current Opinion in Microbiology 2004, 7:546–554 www.sciencedirect.com The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 549 Functional genomics: is it worth the trouble? The ability to analyze 6000 genes in one experiment promises to speed completion of the encyclopedia of the yeast cell. But realistically, how might these data inform us about the function of the cell? The impact of the availability of the genome sequence and the resources it spawned on ‘directed’ (i.e. hypothesis-driven) research can be estimated by asking if the appearance of information on uncharacterized genes from large-scale studies was accompanied by more complete characterization of these genes. We previously noted a clear increase in the rate of characterization of new genes following determi- nation of the DNA sequence of the first yeast chromosome in 1993, a trend that continued through the 1990s [36]. Thus, the availability of genome sequence appears to have had a positive impact on the discovery of new gene functions. Figure 2a illustrates whether any of the 2,248 genes that were labeled as ‘biological process unknown’ by gene ontology in 2002 appeared in various large-scale datasets reported between 2000 and 2002 (2hybrid [30,37], synthetic gene interactions [38,39], direct identification of protein complexes [40,41], microarray expression profiling [42]), and whether any of these genes subsequently acquired another biological process Entire GO Data 3681 TAP 393 SGA 121 FLG 394 Y2H 1147 REG 2749 500 1000 1500 2000 2500 3000 3500 4000 Total number of annotated genes GO-BP Categories (a) >10 9 8 7 6 5 4 3 2 1 0 Scale Figure 3 Alcohol metabolism (596–681) 85 Amine metabolism (691–970) 144 Coenzymes and prosthetic group metabolism (1932–2024) 76 ion transport (4137–4201) 96 Amino acid and derivative metabolism (971–1184) 147 Organic acid metabolism (3395–3634) 68 Nucleotide metabolism (2564–3136) 25 1177 GRID Number of medline abstracts (b) 50 45 40 35 30 25 20 15 10 5 0 0 1000 2000 3000 4000 5000 6000 Entire GO Data TAP SGA FLG Y2H REG Current Opinion in Microbiology Functional categories represented among genes from five different functional genomics and proteomics data sets. (a) Numbers of genes in each data set in each GO category, for the different data sets and the entire GO database, and also their distribution in the GRID database. The color shows the number of genes in each of the categories, according to the scale on the right (i.e. categories with no genes in a data set are white and those with 10 or more genes are dark red). Numbers in parentheses indicate the number of subcategories in GO. For example, ‘organic acid metabolism (3395-3634) 68’ indicates that the general category ‘organic acid metabolism’ encompasses subcategory numbers 3395–3634 (i.e. 240 related subcategories) and that there are 68 known genes in one or more of these categories. The fact that this area of the graph is completely white in the SGA data indicates that the data set does not contain any genes involved in ‘organic acid metabolism’. (b) Number of Medline abstracts containing the name of each yeast gene. The genes are sorted according to the number of abstracts in which they appear. Below are indicated whether each of the genes is annotated in GO, and whether it is used as a ‘bait’ in any of the five data sets. For the REG data set (gene co-regulations), any gene for which there are other genes that correlated at r = 0.5 or better is considered as a ‘bait’, as functional associations can be drawn from any such correlation. www.sciencedirect.com Current Opinion in Microbiology 2004, 7:546–554 550 Genomics annotation (the biological process annotations are primarily assigned on the basis of phenotypic data from individual studies, [43]). In this analysis, genetic interaction and protein complex data correlate with more complete annotation, whereas two-hybrid analysis and DNA microarray co-regulation data do not. Although this correlation does not demonstrate cause-and-effect, it increases our confidence that data from some types of whole genome scale systematic analysis will propel us toward our goal of ‘solving’ the proteome. in most of the large-scale data sets, genes of known function are highly over-represented (Figure 3b). For example, in the dataset of Gavin et al. [41], 8.61% of the proteins (118 of 1371) are uncharacterized for GO ‘biological process’, significantly less than the 29.53% of all proteins that are uncharacterized. Again, this may be owing to experimental design (i.e. choice of genes used to query for interactions), but it is also possible that these methods are more effective on the types of genes that have already been studied. It bears mentioning that the data on genetic interaction and protein complexes, which are the most difficult to generate, are biased towards specific functional categories. For example, TAP and FLAG-tag protein complex data as well as the synthetic genetic array (SGA) data are highly biased against functional categories related to small molecule metabolism or transport (Figure 3a). Although this may reflect a fundamental limitation of these experiments, recent SGA data are in fact intentionally biased towards proteins involved in the cytoskeleton, the cell wall, and DNA replication [38]. These kinds of datasets would be more helpful for ‘solving’ the yeast proteome if they contained more uncharacterized genes; Are the results from the systematic studies sufficient on their own to define gene function? Perusal of 20 reports published between 1996 and 2001 that describe analysis of hitherto uncharacterized yeast genes reveals that agreement among at least three different experimental tests is generally required to characterize gene function, at least in the traditional sense of being able to satisfy peer reviewers and editors (Figure 4). Several studies have also shown that gene functions can be more accurately identified on the basis of agreement among multiple types of functional genomics and proteomics data. That is, ‘guilt-by-association’ predictions of function based on both 2-hybrid and co-expression are more likely to be Figure 4 Biochemical purification Specialized in vitro biochemical assay Co-sedimentation Effect on reporter gene REGression Cross-complementation Sequence analysis Co-localization Directed genetic/biochemical REGeriment Supporting evidence for new reagent/assay Protein–protein interactions General phenotypic/physiological assay Localization Specialized in vivo physiological assay Specialized in vivo biochemical assay AUT8: Barth et al. (2001) Gene 274:151 CYK3: Korinek et al. (2000) Curr Biol 10:947 DOC1: Hwang et al. (1997) Mol Biol Cell 8:1877 MCD4: Packeiser et al. (1999) Yeast 15:1485 THP1: Gallardo et al. (2001) Genetics 157:79 RGP1: Panek et al. (2000) J Cell Sci 113:4545 XTC1: Emili et al. (1998) Proc Natl Acad Sci U S A 95:11122 MSS11: Webber et al. (1997) Curr Genet 32:260 PPT2: Stuible et al. (1998) J Biol Chem 273:22334 MBA1: Rep et al. (1996) FEBS Lett 388:185 EPS1: Wang et al. (1999) EMBO J 18:5972 ARK1,PRK1: Cope et al. (1999) J Cell Biol 144:1203 RGS2: Versele et al. (1999) EMBO J 18:5577 LUC7: Fortes et al. (1999) Genes Dev 13:2425 APG10: Shintani et al. (1999) EMBO J 18:5234 SAD1: Lygerou et al. (1999) Mol Cell Biol 19:2008 VPS52,VPS53,VPS54: Conibear et al. (2000) Mol Biol Cell 11:305 SNU17: Gottschalk et al. (2001) Mol Cell Biol 21:3037 BMS1,TSR1: Gelperin et al. (2001) RNA 7:1268 MMF1: Oxelmark et al. (2000) Mol Cell Biol 20:7784 5 figures of this type 4 3 2 1 No figures of this type Current Opinion in Microbiology Types of evidence provided for the initial functional assignment of yeast genes in the figures of 20 papers. Figures often contained more than one evidence type, and often more than one figure contained the same evidence type. Axes were arranged by hierarchical clustering. Current Opinion in Microbiology 2004, 7:546–554 www.sciencedirect.com The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 551 Figure 5 (a) 0.3 TAP SGA FLG Y2H REG 0.25 ∆precision 0.2 0.15 0.1 0.05 0 -0.05 -0.1 (b) 1->2 2->3 3->4 4->5 0 -0.05 ∆recall -0.1 -0.15 TAP SGA FLG Y2H REG -0.2 -0.25 1->2 2->3 3->4 4->5 Current Opinion in Microbiology Evaluation of the use of multiple data types to predict gene functions. Changes in precision and recall ([a] and [b], respectively) as a consequence of adding a data type and taking only the intersection of predictions made by all of the data types analyzed, taken as the average of all data combinations (e.g. the first set of bars show the average effect on precision of going from one to two data types; the first bar in each set of five indicates the average effect of the addition of TAP data to any other data set or combination of data sets). Precision equals the number of true predictions divided by the total number of predictions; recall equal the number of true predictions divided by the number of actual gene annotations in the intersection of data types. For example, progressing from one to two data sets, if the second data set added is TAP data, then the precision of predictions increases by almost 0.3 (i.e. the predictions become almost 30% more ‘correct’, purple bar at left in (a)); however, this comes at the price of eliminating almost 10% of the predictions that were correct with just one data set (purple bar at lower left in (b)), even though these genes are in the TAP data. All data are available at http://hugheslab.med.utoronto.ca/Mitsakakis/. correct than those based on only one type of result [44–46]. As Figure 5 shows, this is also true of the five datasets we analyzed above. When we attempted to re-predict the GO ‘biological process’ annotations of the genes in the datasets, precision improved (i.e. the predictions were more often correct), but recall (i.e. the number of correct predictions that are made) decreased almost as dramatically! In addition, the recall statistic only considers the genes that are present in three or more datasets — a substantial number of genes do not meet this criterion in www.sciencedirect.com the first place. Genes that are present in multiple datasets were indeed subsequently characterized with a higher frequency (Figure 2b), again consistent with previous analyses, but only about 10% (228/2248) were represented in three or more datasets. This experience raises the obvious question: how often do three or more sets of results from large-scale analysis agree on the same function for an uncharacterized gene? Unfortunately, this does not happen frequently; Current Opinion in Microbiology 2004, 7:546–554 552 Genomics Table 1 197 protein-coding genes whose existence is supported by expression and/or conservation over evolution, but which are completely uncharacterized on Saccharomyces Genome Database and which are not present in any of the major yeast functional genomics data sets analyzed here. YAL016C-B YAL037C-A YAL037W YAL063C-A YAL064C-A YAL067W-A YAR035C-A YAR068W YBL008W-A YBL039W-A YBL071C-B YBL101W-C YBL108C-A YBL112C YBR056W-A YBR072C-A YBR182C-A YBR196C-A YBR196C-B YBR200W-A YBR221W-A YBR296C-A YBR298C-A YCL001W-A YCL012C YCL047C YCR024C-B YCR075W-A YCR108C YDL073W YDL159W-A YDL160C-A YDL169C YDR003W-A YDR169C-A YDR179W-A YDR182W-A YDR194W-A YDR246W-A YDR475C YDR524C-B YDR524W-A YEL076C-A YER038W-A YER078W-A YER085C YER138W-A YER175W-A YER186W-A YER188C-A YFL041W-A YFR012W-A YFR032C-B YGL006W-A YGL007C-A YGL041C-B YGL159W YGL188C-A YGL218W YGL258W-A YGL262W YGR035W-A YGR121W-A YGR127W YGR146C-A YGR169C-A YGR174W-A YGR204C-A YGR240C-A YHL015W-A YHL048C-A YHR007C-A YHR022C-A YHR050W-A YHR086W-A YHR175W-A YHR199C-A YHR212W-A YHR213W-A YHR213W-B YHR214C-D YHR214C-E YIL002W-A YIL014C-A YIL046W-A YIL102C YIL134C-A YIR018C-A YIR021W-A YJL047C-A YJL052C-A YJL077W-B YJL127C-B YJL133C-A YJL136W-A YJR005C-A YJR039W YJR112W-A YJR151W-A YKL033W-A YKL068W-A YKL084W YKL096C-B YKL106C-A YKL138C-A YKL183C-A YKR095W-A YLL006W-A YLL066W-B YLR012C YLR154C-G YLR154C-H YLR154W-A YLR154W-B YLR154W-C YLR154W-E YLR154W-F YLR156C-A YLR157C-C YLR157W-A YLR157W-C YLR159C-A YLR159W YLR161W YLR162W YLR162W-A YLR264C-A YLR285C-A YLR307C-A YLR312C-B YLR342W-A YLR361C-A YLR406C-A YLR412C-A YLR466C-B YML003W YML054C-A YML100W-A YMR001C-A YMR013W-A YMR030W-A YMR105W-A YMR107W YMR158C-A YMR175W-A YMR182W-A YMR185W YMR194C-B YMR230W-A YMR242W-A YMR247W-A YMR272W-B YMR315W-A YNL024C-A YNL034W YNL042W-B YNL050C YNL067W-B YNL097C-A YNL130C-A YNL138W-A YNL146C-A YNL249C YNL269W YNL277W-A YNR075C-A YOL013W-B YOL015W YOL019W-A YOL038C-A YOL086W-A YOL097W-A YOL155W-A YOL164W-A Current Opinion in Microbiology 2004, 7:546–554 YOL166W-A YOR011W-A YOR012W YOR020W-A YOR032W-A YOR034C-A YOR072W-B YOR161C-C YOR192C-C YOR293C-A YOR316C-A YOR338W YOR376W-A YOR381W-A YOR394C-A YPL038W-A YPL039W YPL119C-A YPL152W-A YPL189C-A YPR089W YPR108W-A YPR159C-A only 122 uncharacterized genes can be considered to be ‘characterized’ by these five datasets, and 90 of the 122 uncharacterized genes (74%) fall into one of five general categories (transcription, cell cycle, protein modification, organelle organization and biogenesis, and DNA metabolism) (Listed at http://hugheslab.med.utoronto.ca/ Mitsakakis/). We note that this traditional standard of ‘characterized’ is quite rigorous, because published conclusions are generally regarded highly until proven suspect. Certainly, if we treat the large-scale data as clues rather than facts we can tolerate some error. Numerous computational studies have shown that many more than 122 high-confidence functional predictions can be drawn from data from essentially the same datasets we analyzed, but with an associated P-value, rather than an outright statement of fact [42,47,48]. Hence, while the datasets produced by systematic analysis are instrumental for providing clues to gene function, it seems clear that they do not render obsolete the insight and efforts of the biologist at the laboratory bench. New data, new genes, new approaches While many of the first functional genomics efforts were not comprehensive, more recent resources and datasets do tend to encompass a majority if not all of the yeast genes [25,28,29,49]. This should in principle enable clues regarding potential function of each yeast gene to be uncovered, and we are likely to be more confident in them as individual uncharacterized genes appear in more and more datasets. Nonetheless, there are still many genes for which clues may not be forthcoming. Table 1 lists 197 genes that are 1) completely uncharacterized in GO, 2) not present in any of the five datasets analyzed above, or in the more recent reports on proteome-wide protein-localization [28,29,49], and 3) have no sequence motifs suggestive of their function [50,51]. The majority of these are genes that were not included in the initial yeast gene count [10], but which were subsequently identified by their expression or by comparative genomics [6,7,52,53]. Hence, these may have been simply ignored in most studies. However, we must entertain the possibility that some of the uncharacterized proteins will not yield to the approaches of functional genomics, possibly because they are involved in aspects of yeast life that we are unable to probe in the laboratory. Conclusions Both anecdotal and systematic examination of what is known about the whole of yeast genes reveals that accumulation of facts does not lead automatically to convergence upon global understanding of gene function. However, it does appear that small-scale research (i.e. individual efforts aimed at understanding the functions of individual genes) benefits from large-scale research, and the benefit is obviously reciprocated, as gold-standard annotations form the basis for interpreting large-scale data. This seems to be the making of a positive-feedback www.sciencedirect.com The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 553 loop. A major asset in this enterprise, which we did not address in our analysis, is widespread acceptance in the yeast community that determining all of the gene functions is an important short-term goal. Although the precise date the task will be completed may not be certain, enthusiasm for the effort is undiminished, and we expect there to be no rest until the job is done. We anticipate that the widespread availability of mutant collections and other resources that enable researchers to perform their specialized assays comprehensively (and which in theory could allow functional genomic experimentation on yeast in the wild), will help reveal the purpose of all of the orphan genes (perhaps by April 1, 2007!). References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: of special interest of outstanding interest 1. Schroedinger E: What is life? The Physical aspect of the living cell. New York, NY: Macmillan Company; 1945. 2. Stent GS: The coming of the Golden Age; a view of the end of progress. Garden City, NY: The Natural History Press; 1969. 3. Alberts B: Molecular biology of the cell, edn 4. New York: Garland Science; 2002. 4. Lodish HF: Molecular cell biology, edn 5. New York: W.H. Freeman and Company; 2004. 5. Collins FS, Green ED, Guttmacher AE, Guyer MS: A vision for the future of genomics research. Nature 2003, 422:835-847. 6. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M: Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 2003, 301:71-76. 7. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003, 423:241-254. 8. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science 2004, 304:1321-1325. 9. Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, Rossier C, Antonarakis SE: Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 2003, 302:1033-1035. 10. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, et al.: Life with 6000 genes. Science. 1996. 274(5287):546, 546-567. 11. Consortium TCeS: Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 1998, 282:2012-2018. 12. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF et al.: The genome sequence of Drosophila melanogaster. Science 2000, 287:2185-2195. 13. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 16. Vaughan-Martini A, Martini A: Facts, myths and legends on the prime industrial microorganism. J Ind Microbiol 1995, 14:514-522. 17. Meyen J: Jahresbericht uber die Resultate der Arbeiten im Felde der physiologischen Botanik vom der Jahre 1837. Wiegmann Archiv fur Naturgeschichte, Band 2 1838, 4:1-186. 18. Pasteur L: Nouvelles experiences pour demontrer que le germe de la levure qui fait le vin provient de 1-exterieur des grains de raisin. Comptes Rendus de l’Academie des Science de Paris 1872, 75:781-796. 19. Broach JR, Pringle J, Jones EW: The molecular and cellular biology of the yeast Saccharomyces. Plainview, N.Y.: Cold Spring Harbor Laboratory Press; 1991. 20. Hodges PE, McKee AH, Davis BP, Payne WE, Garrels JI: The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Res 1999, 27:69-73. 21. Mosch HU, Fink GR: Dissection of filamentous growth by transposon mutagenesis in Saccharomyces cerevisiae. Genetics 1997, 145:671-684. 22. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29. 23. Masselot M, De Robichon-Szulmajster H: Methionine biosynthesis in Saccharomyces cerevisiae. I. Genetical analysis of auxotrophic mutants. Mol Gen Genet 1975, 139:121-132. 24. Prakash L, Prakash S: Three additional genes involved in pyrimidine dimer removal in Saccharomyces cerevisiae: RAD7, RAD14 and MMS19. Mol Gen Genet 1979, 176:351-359. 25. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A, Anderson K, Andre B et al.: Functional profiling of the Saccharomyces cerevisiae genome. Nature 2002, 418:387-391. 26. Kanemaki M, Sanchez-Diaz A, Gambus A, Labib K: Functional proteomic identification of DNA replication proteins by induced proteolysis in vivo. Nature 2003, 423:720-724. 27. Mnaimneh S, Davierwala AP, Haynes J, Moffat J, Peng WT, Zhang W, Yang X, Pootoolal J, Chua G, Lopez A et al.: Exploration of essential gene functions via titratable promoter alleles. Cell 2004, 118:31-44. 28. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O’Shea EK, Weissman JS: Global analysis of protein expression in yeast. Nature 2003, 425:737-741. 29. Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O’Shea EK: Global analysis of protein localization in budding yeast. Nature 2003, 425:686-691. Carboxy-terminal green fluorescent protein tagged strains for each yeast open reading frame were analyzed to determine subcellular localization for 75% of the yeast proteome, highlighting ‘organellar proteomics’ of the nucleolus and correspondence between localization and other functional information. 30. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403:623-627. 31. DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997, 278:680-686. 14. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of the human genome. Science 2001, 291:1304-1351. 32. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD et al.: Functional discovery via a compendium of expression profiles. Cell 2000, 102:109-126. 15. Samuel D: Investigation of ancient Egyptian baking and brewing methods by correlative microscopy. Science 1996, 273:488-490. 33. Grunenfelder B, Winzeler EA: Treasures and traps in genome-wide data sets: case examples from yeast. Nat Rev Genet 2002, 3:653-661. www.sciencedirect.com Current Opinion in Microbiology 2004, 7:546–554 554 Genomics 34. Kumar A, Snyder M: Emerging technologies in yeast genomics. Nat Rev Genet 2001, 2:302-312. 35. Castrillo JI, Oliver SG: Yeast as a touchstone in post-genomic research: strategies for integrative analysis in functional genomics. J Biochem Mol Biol 2004, 37:93-106. 36. Bader GD, Heilbut A, Andrews B, Tyers M, Hughes T, Boone C: Functional genomics and proteomics: charting a multidimensional map of the yeast cell. Trends Cell Biol 2003, 13:344-356. 37. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98:4569-4574. 38. Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M et al.: Global mapping of the yeast genetic interaction network. Science 2004, 303:808-813. Analysis of 132 systematic genetic screens shows that ‘network connectivity’ is predictive of function. As the average screen yields 34 interactions, the genetic interaction network is estimated to be four times as dense as the physical interaction network. 44. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402:83-86. 45. Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 2001, 29:482-486. 46. Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FC: Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell 2002, 9:1133-1143. 47. Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N et al.: Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res 2003, 31:2443-2450. 48. Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 2004, 101:2981-2986. 39. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H et al.: Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 2001, 294:2364-2368. 49. Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y et al.: Subcellular localization of the yeast proteome. Genes Dev 2002, 16:707-719. 40. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415:180-183. 50. Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP, Bork P: SMART 4.0: towards genomic data integration. Nucleic Acids Res 2004, 32 Database issue:D142–144. 41. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415:141-147. 42. Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet 2002, 31:255-265. 43. Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G et al.: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res 2002, 30:69-72. Current Opinion in Microbiology 2004, 7:546–554 51. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL et al.: The Pfam protein families database. Nucleic Acids Res 2004, 32 Database issue:D138–141. 52. Oshiro G, Wodicka LM, Washburn MP, Yates JR III, Lockhart DJ, Winzeler EA: Parallel identification of new genes in Saccharomyces cerevisiae. Genome Res 2002, 12:1210-1220. 53. Kumar A, Harrison PM, Cheung KH, Lan N, Echols N, Bertone P, Miller P, Gerstein MB, Snyder M: An integrated approach for finding overlooked genes in yeast. Nat Biotechnol 2002, 20:58-63. www.sciencedirect.com
© Copyright 2026 Paperzz