The promise of functional genomics: completing the encyclopedia of

The promise of functional genomics: completing the
encyclopedia of a cell
Timothy R Hughes1, Mark D Robinson1, Nicholas Mitsakakis1 and
Mark Johnston2
Genome sequencing provides complete parts lists of
organisms. This presents the obvious challenge of determining
how each gene contributes to the life of the organism. This task
seems increasingly feasible; however, progress to date
suggests that increased interaction between systematic efforts
and individual investigators will be critical to completing the
encyclopedia of the yeast cell.
Addresses
1
Banting and Best Department of Medical Research, University of
Toronto, 112 College St., Room 307, Toronto, ON, M5G 1L6, Canada
2
Department of Genetics, Campus Box 8232, Washington University
Medical School, 4566 Scott Ave., St. Louis, MO 63110, USA
e-mail: [email protected]
Current Opinion in Microbiology 2004, 7:546–554
This review comes from a themed issue on
Genomics
Edited by Charles Boone and Philippe Glaser
Available online 11th September 2004
1369-5274/$ – see front matter
# 2004 Elsevier Ltd. All rights reserved.
DOI 10.1016/j.mib.2004.08.015
Abbreviations
FLAG, FLG
GO
REG
SGA
SGD
TAP
Y2H
YPD
flag-affinity tag
gene ontology
gene co-regulations
synthetic genetic array
Saccharomyces genome database
tandem affinity purification tag
yeast two-hybrid
yeast proteome database
Introduction
Erwin Schroedinger ushered in the previous era of biological research by posing the simple (but profound)
question: ‘What is Life?’ [1]. The answer came much
more quickly than the renowned physicist imagined it
would ‘Indeed, I do not expect that any detailed information on
this question is likely to come from physics in the near future’: it
is the result of chemical reactions and molecular interactions (both micro and macro). These general principles of
the physics and chemistry of life, revealed over the past
half century, give us a basic (in many cases highly
detailed) understanding of the fundamental processes
that define living things: how they conquer entropy,
Current Opinion in Microbiology 2004, 7:546–554
how like begets like, and how cellular components
self-assemble [2–4].
A few years ago, we entered what we believe to be a new
(perhaps penultimate) era of biological research, in which
we expect to be able to approach a complete understanding of the molecular mechanisms of life. To achieve
this lofty goal, we need to first identify all the components
of a cell. The genome sequencing projects are doing that,
because they have generated substantially complete parts
lists of several organisms; most of the protein coding
genes are (more or less) apparent in the genome
sequences, and the relatively few hidden ones will
undoubtedly be uncovered in due time. Continued application of this program to many more genomes [5] promises to reveal the functional non-protein coding
sequences, such as gene regulatory sequences, sequences
governing chromosome replication and segregation, regulatory RNAs, and others, by their evolutionary conservation [6–9]. One does not need to be an optimist to expect
that we will soon have nearly complete parts lists for
several organisms.
With the parts lists in hand, we can begin to tackle the
next goal of molecular and cellular biology; determination
of the function of all gene products of an organism. Such a
goal might have seemed absurd a few years ago, but today
it seems reachable to us because the parts lists of organisms has turned out to be surprisingly short: only about
6000 genes are necessary to construct and operate a
simple eukaryotic cell [10,11], and only about two to
three times that number of genes are required to produce
relatively complicated multicellular organisms [12].
Remarkably, only about five times that many genes seem
necessary to compose a human [13,14]. Suddenly, the
scale of the task does not seem overwhelming. We can
imagine resolving the functions and interactions of all of
the genes and proteins in a few well-studied organisms
within our own lifetimes.
In this review, we take stock of current progress toward
this goal with the yeast Saccharomyces cerevisiae, a bellwether of molecular biology and functional genomics. We
ponder the prospect of ‘solving’ the yeast cell by examining what is in databases, what is in the literature, what
is in large-scale datasets, and whether the synthesis of
existing information contributes to a real understanding
of the functions of the many new genes discovered by
genome sequencing.
www.sciencedirect.com
The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 547
Yeast: the proving ground
It is fitting that the first organism that will be ‘solved’ in
this way is the first one to be domesticated by humans
[15,16]: bakers’ and brewers’ yeast (Saccharomyces cerevisiae), because it has the fewest genes among the eukaryotic ‘model organisms’ and the experimental toolbox
that is available is highly sophisticated. Discovered and
named by Meyen [17] and made famous by Pasteur [18],
yeast became the workhorse for eukaryotic molecular and
cellular biology, revealing over the past 30 years much of
what we know about how cells work [19].
How close are we to completing the encyclopedia of the
yeast cell? When will this signal accomplishment be
realized? How will it be done? And, once we have reached
this goal, what will we have learned? We will attempt
to address some of these questions in the following
discourse.
What do we think we know?
First, how close are we to our goal of identifying the function of all yeast genes? A superficial analysis suggests we
are closer than one might have imagined. Towards the end
of 2003, about 80% of yeast genes annotated in the Yeast
Proteome Database [20] (YPD; http://proteome.incyte.
com/) were listed as ‘known’. The remarkably linear rate
of progress in understanding gene function, apparent
in Figure 1, allows us to predict with a high degree of
accuracy when all 6000 genes will be ‘known’: we expect
to celebrate this remarkable accomplishment only three
years from now, on or around April 1, 2007!
However, even a cursory look into YPD to see what is
known about some of these genes will soon cast a pall over
our celebration, because it will quickly become apparent
that little is known about most proteins. For example,
we will see that DFG10 encodes a protein involved in
‘pseudohyphal growth’ and ‘regulation of cell shape’, but
this is on the basis of only two observations. First, ‘mutant
diploids are defective in cell polarity and cell elongation,
but still invade the agar upon nitrogen starvation’; and
second, the ‘mutant diploid is partially suppressed by the
ras2-val19 mutation’. These observations come from only
one study [21], despite the fact that 15 references to
papers with information concerning DFG10 are listed in
YPD. We are heartened when we notice that Dfg10 is
similar to proteins in other organisms, but our optimism is
quickly quenched when we realize that nothing is known
about these proteins. YPD summarizes the results of a
handful of systematic analyses that yielded data for
DFG10, but these results reveal little about Dfg10 function; for example, ‘one of 177 genes co-repressed by the
addition of 960 mg per liter of diammonium phosphate to
stationary cultures grown in Riesling grape juice’. Clearly,
we are unsatisfied with our understanding of this ‘known’
protein, and the number of similarly characterized proteins is uncomfortably large. Even proteins with extensive lists of information in YPD, such as Std1, are poorly
understood. It is clear that we will have to get back to the
laboratory bench on April 2, 2007 if we are to have an
encyclopedia of the organism that is worth reading.
A different analysis of the situation similarly reveals that
we are not quite as close to completing the encyclopedia
of the yeast proteome as our initial analysis suggested. Of
the 5818 genes annotated with gene ontology (GO) [22]
terms in the Saccharomyces Genome Database (SGD),
40% (2317) are of unknown molecular function and 30%
of these genes (1720), the biological process they are
involved in is unknown. In addition, many of the proteins
whose biological process and molecular function are
‘known’ are poorly understood. For example, SGD lists
Figure 1
Number of genes ‘known’
7000
6000
6000 proteins ‘known’
April 1, 2007
5000
4000
3000
2000
1000
4679 proteins ‘known’
Oct. 14, 2003
Ju
n9
Ju 5
nJu 96
nJu 97
nJu 98
nJu 99
n0
1- 0
Ju
Ju n
nJu 02
nJu 03
nJu 04
nJu 05
nJu 06
nJu 07
n0
Ju 8
n09
0
Time
Current Opinion in Microbiology
Number of genes with ‘known’ functions according to the Yeast Proteome Database (www.proteome.com, [20]). The first eleven points were
compiled on the individual dates shown. The last point is extrapolated.
www.sciencedirect.com
Current Opinion in Microbiology 2004, 7:546–554
548 Genomics
many facts about MET18, which was discovered in 1975
on the basis of its requirement for methionine biosynthesis [23] and again in 1979 because it is required for
resistance to methylmethane sulfonate [24]. Met18 has
been associated with ‘RNA polymerase II transcription
factor activity’, and has been localized to the cytoplasm.
These are all intriguing observations, but it has been
difficult to reconcile them and synthesize a model for
Met18 function. Although there are obviously many
examples where the function of a gene or protein is
obvious from the data (for example, functional genomics
and proteomics have identified many new RNA processing and ribosome biogenesis factors in which all of the
available information is consistent), it is also clear that
much more work needs to be done before we can claim to
have ‘solved’ the proteome.
Thus, we are somewhat schizophrenic about the prospects of approaching a complete understanding of the
yeast cell proteome. In some respects, much is known
about a substantial fraction of the proteome, and our rate
of progress in learning about the rest has been steady.
However, little light has been so far shed on the function
of a significant number of proteins, and there is a long way
to go before we will be comfortable with our understanding of many of the ‘characterized’ proteins.
Our mood tends toward the optimistic, because of the
resources and experimental efficiencies engendered by
the genome projects. The complete yeast deletion collection [25] and efforts to create libraries of conditional
alleles of essential genes [26,27] have greatly facilitated
systematic genetic analysis. Nearly complete collections
of green fluorescent protein (GFP), tandem affinity purification tag (TAP) and transcriptional activation domaintagged genes [28,29,30] enable systematic examination
of protein localization and protein complexes. Use of
DNA microarrays to assay gene expression genome-wide
was pioneered with yeast [31,32]. These and other genome-scale technologies have greatly enhanced the experimental toolbox [33–36], and made the power of yeast
genetics even more awesome than it was.
Figure 2
(a)
2500
2000
1500
100
1000
50
500
SGA
TAP
FLAG
2-hybrid
Microarray
0
All
0
Dataset
(b)
2500
2000
1500
100
1000
50
500
0
0
0
1
2
Number of data sets
3
4
Number of uncharacterized genes in indicated data sets (scale on right)
Percentage of genes subsequently annotated (scale on left)
Current Opinion in Microbiology
Correspondence between the appearance of uncharacterized genes in functional genomics data sets, and whether or not those genes are
subsequently characterized. (a) Comparison of different data sets and data types (b) Impact of appearance in multiple data sets.
Current Opinion in Microbiology 2004, 7:546–554
www.sciencedirect.com
The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 549
Functional genomics: is it worth the trouble?
The ability to analyze 6000 genes in one experiment
promises to speed completion of the encyclopedia of the
yeast cell. But realistically, how might these data inform
us about the function of the cell? The impact of the
availability of the genome sequence and the resources it
spawned on ‘directed’ (i.e. hypothesis-driven) research
can be estimated by asking if the appearance of information on uncharacterized genes from large-scale studies
was accompanied by more complete characterization of
these genes. We previously noted a clear increase in the
rate of characterization of new genes following determi-
nation of the DNA sequence of the first yeast chromosome in 1993, a trend that continued through the 1990s
[36]. Thus, the availability of genome sequence appears
to have had a positive impact on the discovery of new
gene functions. Figure 2a illustrates whether any of the
2,248 genes that were labeled as ‘biological process
unknown’ by gene ontology in 2002 appeared in various
large-scale datasets reported between 2000 and 2002 (2hybrid [30,37], synthetic gene interactions [38,39],
direct identification of protein complexes [40,41], microarray expression profiling [42]), and whether any of these
genes subsequently acquired another biological process
Entire GO Data
3681
TAP
393
SGA
121
FLG
394
Y2H
1147
REG
2749
500
1000
1500
2000
2500
3000
3500
4000
Total number of annotated genes
GO-BP Categories
(a)
>10
9
8
7
6
5
4
3
2
1
0
Scale
Figure 3
Alcohol metabolism (596–681) 85 Amine metabolism (691–970) 144 Coenzymes and prosthetic group metabolism (1932–2024) 76
ion transport (4137–4201) 96
Amino acid and derivative metabolism (971–1184) 147
Organic acid metabolism (3395–3634) 68
Nucleotide metabolism (2564–3136) 25
1177
GRID
Number of medline abstracts
(b)
50
45
40
35
30
25
20
15
10
5
0
0
1000
2000
3000
4000
5000
6000
Entire GO Data
TAP
SGA
FLG
Y2H
REG
Current Opinion in Microbiology
Functional categories represented among genes from five different functional genomics and proteomics data sets. (a) Numbers of genes in each
data set in each GO category, for the different data sets and the entire GO database, and also their distribution in the GRID database. The color
shows the number of genes in each of the categories, according to the scale on the right (i.e. categories with no genes in a data set are white and
those with 10 or more genes are dark red). Numbers in parentheses indicate the number of subcategories in GO. For example, ‘organic acid
metabolism (3395-3634) 68’ indicates that the general category ‘organic acid metabolism’ encompasses subcategory numbers 3395–3634
(i.e. 240 related subcategories) and that there are 68 known genes in one or more of these categories. The fact that this area of the graph is
completely white in the SGA data indicates that the data set does not contain any genes involved in ‘organic acid metabolism’. (b) Number of
Medline abstracts containing the name of each yeast gene. The genes are sorted according to the number of abstracts in which they appear.
Below are indicated whether each of the genes is annotated in GO, and whether it is used as a ‘bait’ in any of the five data sets.
For the REG data set (gene co-regulations), any gene for which there are other genes that correlated at r = 0.5 or better is considered
as a ‘bait’, as functional associations can be drawn from any such correlation.
www.sciencedirect.com
Current Opinion in Microbiology 2004, 7:546–554
550 Genomics
annotation (the biological process annotations are primarily assigned on the basis of phenotypic data from individual studies, [43]). In this analysis, genetic interaction
and protein complex data correlate with more complete
annotation, whereas two-hybrid analysis and DNA microarray co-regulation data do not. Although this correlation
does not demonstrate cause-and-effect, it increases our
confidence that data from some types of whole genome
scale systematic analysis will propel us toward our goal of
‘solving’ the proteome.
in most of the large-scale data sets, genes of known
function are highly over-represented (Figure 3b). For
example, in the dataset of Gavin et al. [41], 8.61% of
the proteins (118 of 1371) are uncharacterized for GO
‘biological process’, significantly less than the 29.53% of
all proteins that are uncharacterized. Again, this may be
owing to experimental design (i.e. choice of genes used to
query for interactions), but it is also possible that these
methods are more effective on the types of genes that
have already been studied.
It bears mentioning that the data on genetic interaction
and protein complexes, which are the most difficult to
generate, are biased towards specific functional categories. For example, TAP and FLAG-tag protein complex data as well as the synthetic genetic array (SGA) data
are highly biased against functional categories related to
small molecule metabolism or transport (Figure 3a).
Although this may reflect a fundamental limitation of
these experiments, recent SGA data are in fact intentionally biased towards proteins involved in the cytoskeleton,
the cell wall, and DNA replication [38]. These kinds of
datasets would be more helpful for ‘solving’ the yeast
proteome if they contained more uncharacterized genes;
Are the results from the systematic studies sufficient on
their own to define gene function? Perusal of 20 reports
published between 1996 and 2001 that describe analysis
of hitherto uncharacterized yeast genes reveals that agreement among at least three different experimental tests is
generally required to characterize gene function, at least
in the traditional sense of being able to satisfy peer
reviewers and editors (Figure 4). Several studies have
also shown that gene functions can be more accurately
identified on the basis of agreement among multiple
types of functional genomics and proteomics data. That
is, ‘guilt-by-association’ predictions of function based on
both 2-hybrid and co-expression are more likely to be
Figure 4
Biochemical purification
Specialized in vitro biochemical assay
Co-sedimentation
Effect on reporter gene REGression
Cross-complementation
Sequence analysis
Co-localization
Directed genetic/biochemical REGeriment
Supporting evidence for new reagent/assay
Protein–protein interactions
General phenotypic/physiological assay
Localization
Specialized in vivo physiological assay
Specialized in vivo biochemical assay
AUT8: Barth et al. (2001) Gene 274:151
CYK3: Korinek et al. (2000) Curr Biol 10:947
DOC1: Hwang et al. (1997) Mol Biol Cell 8:1877
MCD4: Packeiser et al. (1999) Yeast 15:1485
THP1: Gallardo et al. (2001) Genetics 157:79
RGP1: Panek et al. (2000) J Cell Sci 113:4545
XTC1: Emili et al. (1998) Proc Natl Acad Sci U S A 95:11122
MSS11: Webber et al. (1997) Curr Genet 32:260
PPT2: Stuible et al. (1998) J Biol Chem 273:22334
MBA1: Rep et al. (1996) FEBS Lett 388:185
EPS1: Wang et al. (1999) EMBO J 18:5972
ARK1,PRK1: Cope et al. (1999) J Cell Biol 144:1203
RGS2: Versele et al. (1999) EMBO J 18:5577
LUC7: Fortes et al. (1999) Genes Dev 13:2425
APG10: Shintani et al. (1999) EMBO J 18:5234
SAD1: Lygerou et al. (1999) Mol Cell Biol 19:2008
VPS52,VPS53,VPS54: Conibear et al. (2000) Mol Biol Cell 11:305
SNU17: Gottschalk et al. (2001) Mol Cell Biol 21:3037
BMS1,TSR1: Gelperin et al. (2001) RNA 7:1268
MMF1: Oxelmark et al. (2000) Mol Cell Biol 20:7784
5 figures of this type
4
3
2
1
No figures of this type
Current Opinion in Microbiology
Types of evidence provided for the initial functional assignment of yeast genes in the figures of 20 papers. Figures often contained more
than one evidence type, and often more than one figure contained the same evidence type. Axes were arranged by hierarchical clustering.
Current Opinion in Microbiology 2004, 7:546–554
www.sciencedirect.com
The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 551
Figure 5
(a)
0.3
TAP
SGA
FLG
Y2H
REG
0.25
∆precision
0.2
0.15
0.1
0.05
0
-0.05
-0.1
(b)
1->2
2->3
3->4
4->5
0
-0.05
∆recall
-0.1
-0.15
TAP
SGA
FLG
Y2H
REG
-0.2
-0.25
1->2
2->3
3->4
4->5
Current Opinion in Microbiology
Evaluation of the use of multiple data types to predict gene functions. Changes in precision and recall ([a] and [b], respectively) as a consequence
of adding a data type and taking only the intersection of predictions made by all of the data types analyzed, taken as the average of all data
combinations (e.g. the first set of bars show the average effect on precision of going from one to two data types; the first bar in each set of five
indicates the average effect of the addition of TAP data to any other data set or combination of data sets). Precision equals the number of
true predictions divided by the total number of predictions; recall equal the number of true predictions divided by the number of actual gene
annotations in the intersection of data types. For example, progressing from one to two data sets, if the second data set added is TAP data,
then the precision of predictions increases by almost 0.3 (i.e. the predictions become almost 30% more ‘correct’, purple bar at left in
(a)); however, this comes at the price of eliminating almost 10% of the predictions that were correct with just one data set (purple bar at
lower left in (b)), even though these genes are in the TAP data. All data are available at http://hugheslab.med.utoronto.ca/Mitsakakis/.
correct than those based on only one type of result [44–46].
As Figure 5 shows, this is also true of the five datasets we
analyzed above. When we attempted to re-predict the GO
‘biological process’ annotations of the genes in the datasets, precision improved (i.e. the predictions were more
often correct), but recall (i.e. the number of correct
predictions that are made) decreased almost as dramatically! In addition, the recall statistic only considers the
genes that are present in three or more datasets — a
substantial number of genes do not meet this criterion in
www.sciencedirect.com
the first place. Genes that are present in multiple datasets
were indeed subsequently characterized with a higher
frequency (Figure 2b), again consistent with previous
analyses, but only about 10% (228/2248) were represented
in three or more datasets.
This experience raises the obvious question: how often
do three or more sets of results from large-scale analysis
agree on the same function for an uncharacterized
gene? Unfortunately, this does not happen frequently;
Current Opinion in Microbiology 2004, 7:546–554
552 Genomics
Table 1
197 protein-coding genes whose existence is supported by
expression and/or conservation over evolution, but which
are completely uncharacterized on Saccharomyces Genome
Database and which are not present in any of the major yeast
functional genomics data sets analyzed here.
YAL016C-B
YAL037C-A
YAL037W
YAL063C-A
YAL064C-A
YAL067W-A
YAR035C-A
YAR068W
YBL008W-A
YBL039W-A
YBL071C-B
YBL101W-C
YBL108C-A
YBL112C
YBR056W-A
YBR072C-A
YBR182C-A
YBR196C-A
YBR196C-B
YBR200W-A
YBR221W-A
YBR296C-A
YBR298C-A
YCL001W-A
YCL012C
YCL047C
YCR024C-B
YCR075W-A
YCR108C
YDL073W
YDL159W-A
YDL160C-A
YDL169C
YDR003W-A
YDR169C-A
YDR179W-A
YDR182W-A
YDR194W-A
YDR246W-A
YDR475C
YDR524C-B
YDR524W-A
YEL076C-A
YER038W-A
YER078W-A
YER085C
YER138W-A
YER175W-A
YER186W-A
YER188C-A
YFL041W-A
YFR012W-A
YFR032C-B
YGL006W-A
YGL007C-A
YGL041C-B
YGL159W
YGL188C-A
YGL218W
YGL258W-A
YGL262W
YGR035W-A
YGR121W-A
YGR127W
YGR146C-A
YGR169C-A
YGR174W-A
YGR204C-A
YGR240C-A
YHL015W-A
YHL048C-A
YHR007C-A
YHR022C-A
YHR050W-A
YHR086W-A
YHR175W-A
YHR199C-A
YHR212W-A
YHR213W-A
YHR213W-B
YHR214C-D
YHR214C-E
YIL002W-A
YIL014C-A
YIL046W-A
YIL102C
YIL134C-A
YIR018C-A
YIR021W-A
YJL047C-A
YJL052C-A
YJL077W-B
YJL127C-B
YJL133C-A
YJL136W-A
YJR005C-A
YJR039W
YJR112W-A
YJR151W-A
YKL033W-A
YKL068W-A
YKL084W
YKL096C-B
YKL106C-A
YKL138C-A
YKL183C-A
YKR095W-A
YLL006W-A
YLL066W-B
YLR012C
YLR154C-G
YLR154C-H
YLR154W-A
YLR154W-B
YLR154W-C
YLR154W-E
YLR154W-F
YLR156C-A
YLR157C-C
YLR157W-A
YLR157W-C
YLR159C-A
YLR159W
YLR161W
YLR162W
YLR162W-A
YLR264C-A
YLR285C-A
YLR307C-A
YLR312C-B
YLR342W-A
YLR361C-A
YLR406C-A
YLR412C-A
YLR466C-B
YML003W
YML054C-A
YML100W-A
YMR001C-A
YMR013W-A
YMR030W-A
YMR105W-A
YMR107W
YMR158C-A
YMR175W-A
YMR182W-A
YMR185W
YMR194C-B
YMR230W-A
YMR242W-A
YMR247W-A
YMR272W-B
YMR315W-A
YNL024C-A
YNL034W
YNL042W-B
YNL050C
YNL067W-B
YNL097C-A
YNL130C-A
YNL138W-A
YNL146C-A
YNL249C
YNL269W
YNL277W-A
YNR075C-A
YOL013W-B
YOL015W
YOL019W-A
YOL038C-A
YOL086W-A
YOL097W-A
YOL155W-A
YOL164W-A
Current Opinion in Microbiology 2004, 7:546–554
YOL166W-A
YOR011W-A
YOR012W
YOR020W-A
YOR032W-A
YOR034C-A
YOR072W-B
YOR161C-C
YOR192C-C
YOR293C-A
YOR316C-A
YOR338W
YOR376W-A
YOR381W-A
YOR394C-A
YPL038W-A
YPL039W
YPL119C-A
YPL152W-A
YPL189C-A
YPR089W
YPR108W-A
YPR159C-A
only 122 uncharacterized genes can be considered to be
‘characterized’ by these five datasets, and 90 of the 122
uncharacterized genes (74%) fall into one of five general
categories (transcription, cell cycle, protein modification,
organelle organization and biogenesis, and DNA metabolism) (Listed at http://hugheslab.med.utoronto.ca/
Mitsakakis/). We note that this traditional standard of
‘characterized’ is quite rigorous, because published conclusions are generally regarded highly until proven suspect. Certainly, if we treat the large-scale data as clues
rather than facts we can tolerate some error. Numerous
computational studies have shown that many more than
122 high-confidence functional predictions can be drawn
from data from essentially the same datasets we analyzed,
but with an associated P-value, rather than an outright
statement of fact [42,47,48]. Hence, while the datasets
produced by systematic analysis are instrumental for
providing clues to gene function, it seems clear that they
do not render obsolete the insight and efforts of the
biologist at the laboratory bench.
New data, new genes, new approaches
While many of the first functional genomics efforts were
not comprehensive, more recent resources and datasets
do tend to encompass a majority if not all of the yeast
genes [25,28,29,49]. This should in principle enable
clues regarding potential function of each yeast gene to
be uncovered, and we are likely to be more confident in
them as individual uncharacterized genes appear in more
and more datasets. Nonetheless, there are still many
genes for which clues may not be forthcoming. Table 1
lists 197 genes that are 1) completely uncharacterized in
GO, 2) not present in any of the five datasets analyzed
above, or in the more recent reports on proteome-wide
protein-localization [28,29,49], and 3) have no sequence
motifs suggestive of their function [50,51]. The majority
of these are genes that were not included in the initial
yeast gene count [10], but which were subsequently
identified by their expression or by comparative genomics
[6,7,52,53]. Hence, these may have been simply ignored
in most studies. However, we must entertain the possibility that some of the uncharacterized proteins will not
yield to the approaches of functional genomics, possibly
because they are involved in aspects of yeast life that we
are unable to probe in the laboratory.
Conclusions
Both anecdotal and systematic examination of what is
known about the whole of yeast genes reveals that accumulation of facts does not lead automatically to convergence upon global understanding of gene function.
However, it does appear that small-scale research (i.e.
individual efforts aimed at understanding the functions of
individual genes) benefits from large-scale research, and
the benefit is obviously reciprocated, as gold-standard
annotations form the basis for interpreting large-scale
data. This seems to be the making of a positive-feedback
www.sciencedirect.com
The promise of functional genomics: completing the encyclopedia of a cell Hughes et al. 553
loop. A major asset in this enterprise, which we did not
address in our analysis, is widespread acceptance in the
yeast community that determining all of the gene functions is an important short-term goal. Although the precise date the task will be completed may not be certain,
enthusiasm for the effort is undiminished, and we expect
there to be no rest until the job is done. We anticipate that
the widespread availability of mutant collections and
other resources that enable researchers to perform their
specialized assays comprehensively (and which in theory
could allow functional genomic experimentation on yeast
in the wild), will help reveal the purpose of all of the
orphan genes (perhaps by April 1, 2007!).
References and recommended reading
Papers of particular interest, published within the annual period of
review, have been highlighted as:
of special interest
of outstanding interest
1.
Schroedinger E: What is life? The Physical aspect of the living cell.
New York, NY: Macmillan Company; 1945.
2.
Stent GS: The coming of the Golden Age; a view of the end of
progress. Garden City, NY: The Natural History Press; 1969.
3.
Alberts B: Molecular biology of the cell, edn 4. New York: Garland
Science; 2002.
4.
Lodish HF: Molecular cell biology, edn 5. New York: W.H. Freeman
and Company; 2004.
5.
Collins FS, Green ED, Guttmacher AE, Guyer MS: A vision for the
future of genomics research. Nature 2003, 422:835-847.
6.
Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J,
Waterston R, Cohen BA, Johnston M: Finding functional features
in Saccharomyces genomes by phylogenetic footprinting.
Science 2003, 301:71-76.
7.
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES:
Sequencing and comparison of yeast species to identify genes
and regulatory elements. Nature 2003, 423:241-254.
8.
Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ,
Mattick JS, Haussler D: Ultraconserved elements in the
human genome. Science 2004, 304:1321-1325.
9.
Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E,
Rossier C, Antonarakis SE: Evolutionary discrimination of
mammalian conserved non-genic sequences (CNGs).
Science 2003, 302:1033-1035.
10. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H,
Galibert F, Hoheisel JD, Jacq C, Johnston M, et al.: Life with 6000
genes. Science. 1996. 274(5287):546, 546-567.
11. Consortium TCeS: Genome sequence of the nematode C.
elegans: a platform for investigating biology. Science 1998,
282:2012-2018.
12. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD,
Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF et al.:
The genome sequence of Drosophila melanogaster.
Science 2000, 287:2185-2195.
13. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,
Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing
and analysis of the human genome. Nature 2001, 409:860-921.
16. Vaughan-Martini A, Martini A: Facts, myths and legends on the
prime industrial microorganism. J Ind Microbiol 1995,
14:514-522.
17. Meyen J: Jahresbericht uber die Resultate der Arbeiten im
Felde der physiologischen Botanik vom der Jahre 1837.
Wiegmann Archiv fur Naturgeschichte, Band 2 1838, 4:1-186.
18. Pasteur L: Nouvelles experiences pour demontrer que le germe
de la levure qui fait le vin provient de 1-exterieur des grains de
raisin. Comptes Rendus de l’Academie des Science de Paris
1872, 75:781-796.
19. Broach JR, Pringle J, Jones EW: The molecular and cellular biology
of the yeast Saccharomyces. Plainview, N.Y.: Cold Spring Harbor
Laboratory Press; 1991.
20. Hodges PE, McKee AH, Davis BP, Payne WE, Garrels JI:
The Yeast Proteome Database (YPD): a model for the
organization and presentation of genome-wide functional
data. Nucleic Acids Res 1999, 27:69-73.
21. Mosch HU, Fink GR: Dissection of filamentous growth by
transposon mutagenesis in Saccharomyces cerevisiae.
Genetics 1997, 145:671-684.
22. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,
Davis AP, Dolinski K, Dwight SS, Eppig JT et al.: Gene ontology:
tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 2000, 25:25-29.
23. Masselot M, De Robichon-Szulmajster H: Methionine
biosynthesis in Saccharomyces cerevisiae. I. Genetical
analysis of auxotrophic mutants. Mol Gen Genet 1975,
139:121-132.
24. Prakash L, Prakash S: Three additional genes involved in
pyrimidine dimer removal in Saccharomyces cerevisiae:
RAD7, RAD14 and MMS19. Mol Gen Genet 1979,
176:351-359.
25. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S,
Lucau-Danila A, Anderson K, Andre B et al.: Functional profiling
of the Saccharomyces cerevisiae genome. Nature 2002,
418:387-391.
26. Kanemaki M, Sanchez-Diaz A, Gambus A, Labib K:
Functional proteomic identification of DNA replication
proteins by induced proteolysis in vivo. Nature 2003,
423:720-724.
27. Mnaimneh S, Davierwala AP, Haynes J, Moffat J, Peng WT,
Zhang W, Yang X, Pootoolal J, Chua G, Lopez A et al.: Exploration
of essential gene functions via titratable promoter alleles.
Cell 2004, 118:31-44.
28. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A,
Dephoure N, O’Shea EK, Weissman JS: Global analysis of
protein expression in yeast. Nature 2003, 425:737-741.
29. Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW,
Weissman JS, O’Shea EK: Global analysis of protein
localization in budding yeast. Nature 2003, 425:686-691.
Carboxy-terminal green fluorescent protein tagged strains for each yeast
open reading frame were analyzed to determine subcellular localization
for 75% of the yeast proteome, highlighting ‘organellar proteomics’ of the
nucleolus and correspondence between localization and other functional
information.
30. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR,
Lockshon D, Narayan V, Srinivasan M, Pochart P et al.:
A comprehensive analysis of protein-protein interactions in
Saccharomyces cerevisiae. Nature 2000, 403:623-627.
31. DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and
genetic control of gene expression on a genomic scale.
Science 1997, 278:680-686.
14. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG,
Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of
the human genome. Science 2001, 291:1304-1351.
32. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R,
Armour CD, Bennett HA, Coffey E, Dai H, He YD et al.: Functional
discovery via a compendium of expression profiles. Cell 2000,
102:109-126.
15. Samuel D: Investigation of ancient Egyptian baking and
brewing methods by correlative microscopy. Science 1996,
273:488-490.
33. Grunenfelder B, Winzeler EA: Treasures and traps in
genome-wide data sets: case examples from yeast.
Nat Rev Genet 2002, 3:653-661.
www.sciencedirect.com
Current Opinion in Microbiology 2004, 7:546–554
554 Genomics
34. Kumar A, Snyder M: Emerging technologies in yeast genomics.
Nat Rev Genet 2001, 2:302-312.
35. Castrillo JI, Oliver SG: Yeast as a touchstone in post-genomic
research: strategies for integrative analysis in functional
genomics. J Biochem Mol Biol 2004, 37:93-106.
36. Bader GD, Heilbut A, Andrews B, Tyers M, Hughes T, Boone C:
Functional genomics and proteomics: charting a
multidimensional map of the yeast cell. Trends Cell Biol 2003,
13:344-356.
37. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y:
A comprehensive two-hybrid analysis to explore the yeast
protein interactome. Proc Natl Acad Sci USA 2001,
98:4569-4574.
38. Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz
GF, Brost RL, Chang M et al.: Global mapping of the yeast
genetic interaction network. Science 2004, 303:808-813.
Analysis of 132 systematic genetic screens shows that ‘network connectivity’ is predictive of function. As the average screen yields 34
interactions, the genetic interaction network is estimated to be four times
as dense as the physical interaction network.
44. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO,
Eisenberg D: A combined algorithm for genome-wide
prediction of protein function. Nature 1999, 402:83-86.
45. Ge H, Liu Z, Church GM, Vidal M: Correlation between
transcriptome and interactome mapping data from
Saccharomyces cerevisiae. Nat Genet 2001,
29:482-486.
46. Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R,
Brazma A, Holstege FC: Protein interaction verification and
functional annotation by integrated analysis of genome-scale
data. Mol Cell 2002, 9:1133-1143.
47. Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L,
Zhang N et al.: Topological structure analysis of the
protein–protein interaction network in budding yeast.
Nucleic Acids Res 2003, 31:2443-2450.
48. Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity
and organization in the yeast molecular network by integrated
analysis of highly heterogeneous genomewide data. Proc Natl
Acad Sci USA 2004, 101:2981-2986.
39. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N,
Robinson M, Raghibizadeh S, Hogue CW, Bussey H et al.:
Systematic genetic analysis with ordered arrays of yeast
deletion mutants. Science 2001, 294:2364-2368.
49. Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M,
Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y et al.:
Subcellular localization of the yeast proteome.
Genes Dev 2002, 16:707-719.
40. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A,
Taylor P, Bennett K, Boutilier K et al.: Systematic identification of
protein complexes in Saccharomyces cerevisiae by mass
spectrometry. Nature 2002, 415:180-183.
50. Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T,
Schultz J, Ponting CP, Bork P: SMART 4.0: towards genomic
data integration. Nucleic Acids Res 2004, 32 Database
issue:D142–144.
41. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A,
Schultz J, Rick JM, Michon AM, Cruciat CM et al.: Functional
organization of the yeast proteome by systematic analysis of
protein complexes. Nature 2002, 415:141-147.
42. Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R,
Altschuler SJ: Large-scale prediction of Saccharomyces
cerevisiae gene function using overlapping transcriptional
clusters. Nat Genet 2002, 31:255-265.
43. Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR,
Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G et al.:
Saccharomyces Genome Database (SGD) provides secondary
gene annotation using the Gene Ontology (GO).
Nucleic Acids Res 2002, 30:69-72.
Current Opinion in Microbiology 2004, 7:546–554
51. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S,
Khanna A, Marshall M, Moxon S, Sonnhammer EL et al.:
The Pfam protein families database. Nucleic Acids Res 2004,
32 Database issue:D138–141.
52. Oshiro G, Wodicka LM, Washburn MP, Yates JR III, Lockhart DJ,
Winzeler EA: Parallel identification of new genes in
Saccharomyces cerevisiae. Genome Res 2002,
12:1210-1220.
53. Kumar A, Harrison PM, Cheung KH, Lan N, Echols N, Bertone P,
Miller P, Gerstein MB, Snyder M: An integrated approach for
finding overlooked genes in yeast. Nat Biotechnol 2002,
20:58-63.
www.sciencedirect.com