insight overview

BENTLEY new
20/5/04
1:46 pm
Page 440
insight overview
Genomes for medicine
David R. Bentley
The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK (e-mail: [email protected])
We have the human genome sequence. It is freely available, accurate and nearly complete. But is the genome
ready for medicine? The new resource is already changing genetic research strategies to find information of
medical value. Now we need high-quality annotation of all the functionally important sequences and the
variations within them that contribute to health and disease. To achieve this, we need more genome
sequences, systematic experimental analyses, and extensive information on human phenotypes. Flexible and
user-friendly access to well-annotated genomes will create an environment for innovation, and the potential
for unlimited use of sequencing in biomedical research and practice.
T
he DNA sequence of Homo sapiens is freely available and in the public domain in order to encourage research and development and to maximize
its benefit to society. The commitment to this
goal was made by all participants of the Human
Genome Project in the Bermuda Statement in February 1996
(refs 1, 2). The successful outcome represents a remarkable
achievement brought about by international cooperation,
scientific excellence and altruism. It has been accompanied
by striking technological progress — in DNA manipulation, automation of biological processes and computational
methods for handling very large and complex sets of data.
The key to the success of this endeavour lies in the fundamental nature of the information itself. No other project
could produce a single data set that encompasses the genetic
basis of being human. In a tidal wave of optimism, similar
principles have been extended to other genomes2, with the
result that we now have an unrivalled foundation for biological research in the future — all the genetic information used
to make humans3–5, rodents6,7, flies8, worms9, plants10, bacteria11–13 and many more. This is a revolution in knowledge
that promises to change our way of thinking.
The foresight of sequencing a complete genome (as
opposed to cataloguing all the available messenger RNAs)
came in recognizing that a genome is more than a bundle of
genes: the organization of genes in the context of surrounding information in the rest of the DNA might be important.
It was worth characterizing everything precisely because we
did not understand it. Some thought that it was premature
to write noncoding DNA off as ‘junk’, and so it has proved.
An enormous amount of functionally important information is now being found in addition to the protein-coding
sequences. The human genome sequence contains noncoding RNA genes, regulatory sequences and structural
motifs; it maintains short-range and long-range spatial
organization of sequences; and it contains important evolutionary information. The genome sequence also provides a
record of the natural chromosomal organization of genetic
material. Given that all this information is in the sequence,
it is important not to miss anything. Only by going systematically along each chromosome from end to end could
every piece of information be captured with certainty. The
realization that this was possible for any organism gave rise
to the era of large-scale genome sequencing, which started
in earnest only ten years ago.
The essential properties of the human genome sequence
— accuracy and completeness — reflect intrinsic features of
the molecular structure of DNA14. Physically, the genome is
a continuous thread on each chromosome, with the identity
and order of each base determined unequivocally by the
atomic structure of the double helix. The information
stored within it is digital and can therefore be decoded
unambiguously. The current version of the genome
sequence (‘build 34’, http://ncbi.nlm.nih.gov) comprises
2.84 gigabases (Gb); it is more than 99.995% accurate and
covers 99% of the euchromatin5, and therefore constitutes a
high-quality reference for future work. A few parts of the
thread remain intractable to current techniques and are
targets for further research. They comprise the highly
repetitive sequence of the heterochromatin plus the few
remaining gaps in the euchromatin, many of which are
believed to be rich in G+C content or repetitive sequence.
Is the genome ready for medicine?
Disease is a malfunction of the human body. It is caused by
one or more internal changes, usually in combination with
external factors. The human genome sequence will have a
profound impact on our understanding of diseases with
a genetic component. Ultimately it should be possible to
examine an individual’s genetic make-up at any position in
the sequence, deduce a functional consequence, and make a
well-informed choice of medical action. This level of understanding requires a detailed knowledge of the information
in the human genome. How will this be achieved?
With a near-complete sequence in hand, we can draw
together all the available human genetic information for the
first time. About 22,000 protein-coding genes have been
identified so far (http://www.ensembl.org/Homo_sapiens),
and the sequence continues to be searched methodically for
clues to any previously undiscovered members of a particular protein family. For example, we can list every protein
kinase or transcription factor, or zoom in on a chromosomal
region of interest and pick off every gene. Complete gene
lists will enhance any experiment to search for medically
important genes. With every human gene on an expression
microarray, we can investigate the full extent of transcriptional changes that accompany tumorigenesis. We can use
DNA arrays or comparative genome hybridization to search
the entire genome of an individual for the germline variants,
such as insertions, deletions, amplifications or translocations, that might be associated with phenotypes such as
mental disorder or congenital abnormality. The same
approach can be used to determine the combination of
germline and somatic variants and somatic changes that
might lead to cancer (see review in this issue by Strausberg
et al., page 469).
We believe that the genome contains over ten million
‘common’ polymorphic sites (sites where the minor allele is
present in at least 1% in the human population) and an
almost unlimited number of rarer variants15. So far (as at
440
©2004 Nature Publishing Group
NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
BENTLEY new
20/5/04
1:46 pm
Page 441
insight overview
February 2004), more than seven million variants, most of which are
single-nucleotide polymorphisms (SNPs), have been catalogued in
the public databases and mapped on the genome sequence16–18
(http://www.ncbi.nlm.nih.gov/SNP). We can examine each gene for
variants that alter protein-coding sequence or splice sites, and test
them directly to determine their functional significance. We can
select polymorphisms for use as genetic markers, download the
flanking sequence, develop experimental assays to determine the
genotype of individual DNA samples, and search for associations
with disease.
More than 1,400 human genes have been correlated directly with
disease (data from Online Mendelian Inheritance in Man at
http://www.ncbi.nlm.nih.gov/Entrez). In general these are singlegene disorders. Almost every study has pinpointed a causative mutation and implicated a specific protein that is altered or absent in the
disease phenotype. Such discoveries can lead to the provision of a
precise predictive test, particularly for monogenic disorders such as
cystic fibrosis or Huntington’s disease, and also stimulate targeted
research towards an effective cure by correction or replacement of the
defective protein. The discovery that a chromosome translocation
creates a new gene structure (the abl-bcr gene) in chronic myeloid
leukaemia led to the development of the drug imatinib (Gleevec),
which binds specifically to the ABL-BCR protein and can alleviate the
leukaemia in patients for whom other treatments have failed19.
Effective cures are not necessarily guaranteed, of course: the defective
protein might be inaccessible or impossible to replace, the defect might
be lethal too early in life for administration of treatment, or there
might be unexpected complication with the therapy. Despite early
success in the treatment of severe combined immunodeficiency by
replacement of the defective adenosine deaminase gene20, it is proving
a challenge to continue translation into clinical practice21, illustrating
that further research is needed to relate genomic data to patients.
What about the more complex diseases, such as diabetes, heart
disease, cancer or schizophrenia? We could build up the picture step
by step, discovering all the genetic variants and environmental factors
that contribute to the disease, and then work out all the permutations
that are significant — a monumental task. However, we might not
need to do this. If we identify the key pathways involved, we might be
able to pinpoint the most effective points for intervention on the
basis of biochemical knowledge, and avoid characterizing many of
the contributing risk factors. The new targets then become the focus
for the development of new drugs, leading to effective treatments.
To find a pathway, it is necessary to identify at least one of its
components — for example, an enzyme in a metabolic pathway, a
receptor or transducer in a signalling pathway, or a polymerase in a
DNA repair pathway. Genetic approaches to this problem can benefit
enormously from the human genome sequence. We can choose a
gene, pick polymorphic markers from the sequence, test them for
association with the disease, and then search the region for causative
variants. If the disease has a familial mode of inheritance, it might be
possible to use linkage analysis. If not, as has been true in most studies
of common disease so far, the alternative is to use a population-based
association study and look for an imbalance in allele frequencies of
a marker in a group of unrelated cases compared with a matched
control group22. For example, an association study demonstrated
the protective effect of the 32-base deletion (32) in the cytokine
receptor 5 (CKR5) gene against HIV-1 infection or AIDS progression23. In an extensive gene-based survey, two SNPs in the LTA-3 gene
showed significant association with myocardial infarction24. The
genome is ready for us to make a start, aided by continuing efforts to
complete the gene annotation, to expand the collections of variants
in the public databases and to characterize the patterns of common
variation in human population groups18. These developments are
expected to help us dissect out the genetic basis of complex traits (see
reviews in this issue by Carlson et al. (page 446) and Bell (page 453)),
including variable drug response (see review in this issue by Evans
and Relling, page 464).
NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
Where we fall short at present is in our understanding of physiological function or how sets of molecules work together, and in our
ability to infer this from the accumulated information linked to the
genome. We lack knowledge of the functional sequences outside
genes, as well as a detailed understanding of when and where genes
are expressed, and in response to what signals. We are unaware of the
biochemical functions of most proteins, and lack knowledge of most
of the interactions between cellular components. If we were to
acquire these information sets we would be able to go much further
using the genome. Finding a genetic target would then allow us to
jump immediately into a completely characterized biochemical
pathway, to understand the functional processes that are disrupted
by a particular mutation, and to develop measures to discern the
influence of non-genetic factors on these processes.
A few examples illustrate the roadblocks in understanding. The
gene for Huntington’s disease has been known for a decade. The location of the protein is known, and a specific alteration is sufficient to
cause the disease. However, we do not yet know why this causes the
pathology of the disease25,26. One of the chromosomal regions associated with inflammatory bowel disease is 5q31 (ref. 27). Intense genetic
analysis using both linkage and population association studies has
narrowed the critical interval to 250 kilobases (kb). Known sequence
variants (many of which are tightly associated with the disease) have
been examined in detail as possible risk factors28, and yet we need
more information before we can establish the mechanism of inflammatory bowel disease from this genetic association. In diabetes,
dominant-negative mutations in the gene encoding the nuclear
receptor peroxisome-proliferator-activated receptor- (PPAR-)
are associated with severe insulin resistance, and the antidiabetic
thiazolidinedione drugs have been shown to bind and activate
the receptor29. However, despite compelling genetic evidence of the
involvement of PPAR- in glucose homeostasis, the mechanism by
which the gene contributes to insulin sensitivity and glucose homeostasis is not understood. We cannot use the information we have on
PPAR- to develop a better understanding of the role of adipocyte
differentiation in diabetes or to find new targets for intervention.
How can we gain new information systematically, improve our
knowledge of the genome, and remove roadblocks like this?
Discovering all the functional information in the genome
The finite number of protein-coding genes hides a much greater
diversity and extent of functional information in the human genome,
most of which is still to be discovered. For example, alternative splicing allows multiple functions encoded by the same gene to be selected
in a cell-specific manner (see ref. 30, for example). Multiple promoters can confer a diversity of inducible responses and substrate specificities on the same gene (see ref. 31, for example). Annotation of this
degree of coding diversity in human genes is at a very early stage. It
should also be noted that, as yet, only two-thirds of the genes have a
canonical structure with an open reading frame. Developing the annotation at these levels requires manual curation and more experimental
data, particularly to find many of the 5 ends and translation initiation codons of the genes. We are becoming aware of the existence of
new classes of RNA genes. In addition to the well-known ribosomal,
transfer and small nuclear RNAs, evidence is accumulating for the
existence of other RNAs, such as antisense RNAs and microRNAs32,33,
both of which might be involved in gene regulation. MicroRNAs are
19–25-nucleotide products formed by the cleavage of precursor hairpins and might influence the translational activity or stability of
mRNAs34,35. Antisense RNAs act mainly by disrupting translation
after hybridization to the sense mRNA36. These discoveries reveal a
new complexity of expressed information that is encoded in the
genome, the extent of which is unknown.
The use of genomic information by each cell is governed by the
interaction of multiple proteins with regulatory sequences that act as
signal processors. As a result, a response is initiated that takes into
account all the information received from either inside or outside the
441
©2004 Nature Publishing Group
BENTLEY new
20/5/04
1:46 pm
Page 442
insight overview
cell37,38. When analysing genome sequence, it is much harder to recognize regulatory sequences than protein-coding sequences, because
the rules are more complex and less obvious. Yet, like the proteincoding regions, many regulatory sequences have been conserved
during evolution, allowing us to use information from other organisms to try to find these functionally important elements of the
human genome. Gene regulation is also governed by modifications
to the DNA sequence or epigenetic changes (see review in this issue by
Egger et al., page 457), and an important adjunct to the genome
sequence is to study patterns of differential methylation of genomic
DNA and how they affect gene expression39.
Not everything is to do with genes; much of the information in the
genome sequence must be considered in the larger context of the
chromosomes. Human chromosomes undergo replication and segregation by an intricate and controlled process. In yeast and other
organisms, replication origins are sequence specific. It is possible that
human replication origins also contain functionally important DNA
sequence motifs, but we cannot recognize them yet. Important
regions such as centromeres lie in highly repetitive heterochromatin;
thus, we have not yet sequenced through a human centromere or
determined the exact sequences needed for proper chromosome segregation. It is not known whether sequence motifs are involved in
chromosome pairing or chiasma formation and crossing over during
meiosis. Finally, DNA sequences might be involved in the positioning
of chromosomes in the three-dimensional space of the nucleus;
sequences on different chromosomes might be brought near to each
other, leading to translocations and disease40. How will we discover all
the functional information stored in the human genome sequence?
Comparing genome sequences
Comparing whole genome sequences between species will make an
important contribution to high-quality annotation, although by
definition it will only reveal features that have survived during
evolution, as opposed to features that are species specific. The first
comparison of two mammalian genomes (human and mouse)
revealed that where the two sequences can be aligned there is more
than 68% nucleotide identity6 — too much for a very precise definition of conserved motifs. However, by comparing the variable degree
of conservation in the aligned sequences, it was possible to deduce
that about 5% of the human genome seems to be under selection
to conserve features in common with the mouse (‘purifying’ selection)6. As expected, these features include most of the protein-coding
exons (1.5% of the genome) and additional untranslated regions of
the genes (1%). It is a high priority to target the other 2.5% in the
search for new functionally important sequences.
Adding genome sequences from more organisms will greatly
improve the analysis. In a comparison of 1.8 million bases of finished
sequence of human chromosome 7 and that of 11 other vertebrates
(an analysis termed ‘phylogenetic footprinting’), 228 ‘multi-species
conserved sites’ (MCSs) overlapped virtually all the protein-coding
exons, whereas another 966 were non-exonic41,42. Extrapolating this
analysis to the whole genome would result in the detection of about
200,000 protein-coding exons (close to estimates based on other
analyses) and 1.5 million vertebrate-specific MCSs to help in
pinpointing possible regulatory elements that have been as wellconserved during evolution as the protein-coding sequences. To
highlight conserved sequences that are specific to primates, a similar
approach would examine the variation between multiple primate
genome sequences as a function of the evolutionary distance between
them (‘phylogenetic shadowing’). A pilot study suggests that an analysis of about seven carefully selected primate genome sequences would
be sufficient to detect conserved signatures for a majority of proteincoding exons and many putative regulatory elements, including
examples that are exclusive to primates and the hominid lineage43.
Not all functional sequences are likely to be conserved between
genomes. Comparisons within a genome sequence offer a complementary approach to the problem. For example, transcription-factor-
binding motifs (typically 6–10 base pairs in length; http://www.generegulation.com/pub/databases.html#transfac) occur in abundance
throughout the genome. Searching for single motifs alone is not very
informative. However, some well-studied promoters and other regulatory elements contain multiple motifs in short windows of genome
sequence (typically less than 400 bases), and these can be correlated
with experimental evidence38,44 and added to the annotation.
Functional annotation
Experimental confirmation of protein-coding gene structures is
relatively straightforward. Ascertaining the structure and function
of each protein is more of a challenge. The three-dimensional structures of only 1,539 distinct human proteins have been determined
experimentally so far (http://www.rcsb.org/pdb), and currently the
function of only 6,000 human proteins is known with certainty.
Although the gene sequence does not generally provide the answer, it
serves as the platform for determining function. It is possible to start
with a gene sequence, disrupt it and correlate it with the resulting
phenotype; or to start with a disrupted phenotype and map the
mutation back to the sequence (‘forward’ or ‘reverse’ genetics,
respectively). Either approach relies on the nature of the disrupted
phenotype to determine the underlying biochemical function,
which is then assigned to the gene. We rely extensively on using
experimentally tractable model organisms for functional annotation. Accurate alignment of their corresponding genome sequences
is crucial for extrapolating the experimental findings from one
organism to another.
Understanding biological function also requires characterization
of the many interactions of proteins with each other and with other
cellular components. Large-scale screens for protein–protein interactions use methods involving yeast two-hybrid systems or immunoprecipitation of protein complexes with antibodies against native or
tagged proteins. These methods require considerable validation to
detect false positive and negative results, but the available experience
suggests that it is possible to detect interactions involved in key cellular functions45.
Experimental approaches are also important in identifying and
validating sequences involved in gene regulation or chromatin function. In particular, methods are becoming available to recognize
regions of chromatin that are subject to histone modification or transcription factor binding, and to map them back to the genome
sequence38,46. As high-quality experimental data sets of this kind
emerge, they will need to be integrated as part of the annotation of the
human genome in linked databases (see below).
Human sequence variation
Sequencing individual human genomes provides the richest source
of genetic information. Projects underway have begun by selecting
genes, exons or promoters for targeted sequencing in depth. On
average there are fewer variants in protein-coding than in other
sequences, and they tend to have lower minor allele frequencies. This
presumably reflects selection against some variants that are nonsynonymous (that is, those that cause amino-acid changes). If an
excess in the rate of non-synonymous over synonymous substitution
is observed in a particular coding region, it can be taken as an indicator of positive (diversifying) selection. As we find more variation and
improve the sequence annotation, a fuller picture will emerge of the
variants that alter genome function, and hence those that contribute
to health and disease.
To realize the benefits of genomic medicine fully, we should start
resequencing individual genomes in their entirety. With the benefit
of emerging new technologies (reviewed in refs 47, 48), it is feasible to
consider generating gigabases of data as short sequence reads (for
example, 25–50 bases each) in a single experiment, and assembling
the data accurately using the existing finished sequence as a template.
The first 200 haploid human genome sequences will give us an indepth view of human sequence variation (with a 99% chance of
442
©2004 Nature Publishing Group
NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
BENTLEY new
20/5/04
1:46 pm
Page 443
insight overview
detecting variants at an allele frequency of at least 1%; see table 2 in
ref. 15). Comparison of these data sets will give us a full profile of
common germline variation along each chromosome — something
that is impossible to see with any of the current resequencing programmes. Detecting every variant (subject to the sensitivity and
accuracy of the method) would permit the precise estimation of recombination rates and correlation with variability along chromosomes,
and would give an indication of the forces of natural selection acting
on the genome. This approach could be applied to give important
baseline information in healthy tissue to compare with cancer
genome sequences, and hence the ability to monitor DNA changes on
a genome-wide basis during tumour progression. The same complete data sets could be generated for genomes that are associated
with any other disease, with the prospect of removing the bias in
ascertainment that is currently a limitation of gene-specific and
regional investigations.
Box 1
Translation of genomic information to future clinical practice
As the annotation of the
human genome becomes
Human Genome Project
Individual human sequence
e
Human resequencing
stable, a user-friendly, distilled Functional annotation
PGI i.d.: 5910322–61215923014
Other genomes
HapMap
view can be developed, as in
a
PPAR-γ
the figure. The diagram (a) of a
chromosome 3 region
(12,300–12,450 kb,
(kb)
3: 12,300
3: 12,400
1 2 3 4 5 6 7 8 9 10 11 12
Genomic information
numbering as in build 34,
http://www.ensembl.org)
3: 12,433,340
3: 12,450,610
b 3: 12,433,320|
|
|
contains the PPAR- gene
13 14 15 16 17 18 19 20 21 22 23 24
..CAGGGCTGCCAGTTTCGCTCCGTGGAGGCT.....CTTCACCCGCTCCTG...
C
A
A
T
structure (dark blue) with an
Personal genetic information (owned by individual)
Gln Gly Cys Gln Phe Arg Ser Val Glu Ala …… Leu His Pro Leu Leu
alternative promoter (light
Pro
His
Met
Leu
blue), hypothetical noncoding
314
316
318
495
• Genetic counselling
f
• Patient consent
Functional variants
functional regions (green
• Risk information selected and
shaded boxes), and functional c
used to interpret PGI of patient
Drug binding
variants (red). Note that introns
(thiazolidinedione)
TC at 3: 12,450,610
Gln 314 Pro
in the gene structure are
Risk information
Arg 316 His
scaled down relative to the
Pro 495 Leu
exons. Zooming in on two
Val 318 Met
Medical consultation
sequence segments (b) shows
g
PGI
i.d.:
5910322–61215923014
Structural context
the translated sequence with
Disease: Diabetes (type2)
functional variants highlighted
Risk
Nucleotide
Individual
Other
Medical
d
Genome base
3: 12,450,610
genotype
position
risk
risks/effects
risk
in blue (nucleotide changes)
Genotype
t/c
INS-resistance Thiazolidinedione
and pink (amino-acid
3: 12,450,610
TC
Hypertension resistance
Individual information
Pro/Leu
changes). Amino-acid
…
6: 149,031,974
GG
Biochemical consequence
numbering includes the
Personal genetic assessment
Modelled: slight altered conformation of peptide
Structural consequence
propeptide sequence. The
backbone; increased local hydrophobicity.
h
• Order further tests
variants (c, pink) can be
Known: associated with severe insulin resistance,
• Provide genetic counselling
Medical consequence
diabetes mellitus and hypertension.
viewed in the monomer protein
• Recommend exposure avoidance
structure (grey) in a linked
Pharmacological
• Prescribe or change medication
Known: resistant to thiazolidinediones.
consequence
Clinical decision
database. Also shown is the
Biological consequences
binding position of an
antidiabetic thiazolidinedione
drug (blue), part of the other monomeric unit (green) of the dimeric
might be stored electronically, protected by a high-security code
receptor, and the ligand (yellow). Using linked information from a
requiring unique personal identifiers (such as multiple fingerprint
range of sources, a summary of the known, modelled or predicted
identification) for access only with consent of the individual (e). The
biological consequences (such as biochemical, structural, medical or
information might be taken either before consultation (as illustrated
pharmacological) could be curated (and updated regularly) for each
here) or afterwards, and in either case would be subject to
functional variant in tabular form (d). A small subset of this information
counselling by the practitioner and consent by the individual.
would define the disease or drug outcome or side effect associated
A specific investigation would be initiated by a consultation (f).
with each variant, would constitute specific risk information of value
The personal genetic information would then be supplied by the
in clinical assessment, and would be exported (red outlined boxes).
individual, for interpretation with respect to an agreed set of variants
For maximum usefulness, therefore, the exported information would
and/or a specific phenotype. The practitioner would use the available
be subject to stringent filters and would include only data for which
risk information concerning each variant to provide a genetic
the medical relevance was well established for each particular
assessment for the individual (g). The top line refers to the variant
disease discipline. For example, variants of uncertain significance
featured in d and f; the second line is a hypothetical entry for a variant
would be excluded from the filtered risk information, although all data
on another chromosome and does not represent a known variant. In
would be available in the public domain. All the information in a–d
the case illustrated, the individual has the heterozygous genotype TC
would be curated in the public domain.
at position 3: 12,450,610. This corresponds to having both Pro 495
The use of personal genetic information in a clinical setting
and Ala 495 forms of the protein PPAR-. This genotype confers an
would be initiated or consented to by an individual. The individual
increased risk of insulin-resistant diabetes on the individual, and also
sequence acquired could be as little as one or more individual
resistance to the thiazolidinedione class of antidiabetic drugs29.
genotypes, or as much as a complete genome sequence. The
Combining this with risk information for other genotypes would help
information would be private and owned by the individual, and
to inform subsequent clinical decisions (h).
NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
443
©2004 Nature Publishing Group
BENTLEY new
20/5/04
1:46 pm
Page 444
insight overview
From genotype to phenotype
Observation of medical phenotypes is at the heart of accurate diagnosis. In addition to following rigorously defined clinical criteria,
measurement of biochemical phenotype data helps considerably. For
example, in a prospective cohort, 1% of children tested positive for
antibodies against transglutaminase, a sensitive indicator for coeliac
disease, although nearly all were asymptomatic with respect to gluten
sensitivity. The disease incidence in adults is also 1% (ref. 49), indicating that we should revise our view of the onset of the condition,
adopt the new phenotype data for genotype–phenotype association
studies, and try to develop early preventive measures for those at risk.
Accurate measurement of molecular variables can also be used to
reclassify the disease (‘molecular taxonomy’) with greater phenotypic accuracy, as exemplified in the use of expression microarrays
to subdivide some cancer types in retrospectively collected cases50.
These principles should be applied rigorously in characterizing
population sample collections to underpin genotype–phenotype
studies.
To find the major genetic factors in common disease, it is best to
base the study on retrospective sample collections. This is the only
way of obtaining enough cases of a well-defined phenotype to provide sufficient statistical power for the study. By contrast, prospective
collection within the general population is appropriate for an unbiased sampling and measurement of environmental exposures before
the onset of a specific condition. The challenge is to collect sufficient
phenotype and exposure data on enough samples for statistically significant correlations. Studying environmental factors will enable us
to explain more of the variance of common disease. Some genetic
effects will be detected only in conjunction with environmental
exposures. We can use genetic analysis to identify a susceptible or
resistant subpopulation so as to strengthen the power to detect association with the environmental factors. We can learn how to provide
individualized prevention advice based on genotypes. Correlations
can be made on the basis of systematically collected, accurate data on
factors such as nutrition, environmental toxins, exercise and lifestyle,
with detailed information on individual genetic make-up. This
would provide information for identifying avoidable risk, protective
effects (either individually or on a population basis) for improved
awareness, and recommendations for the implementation of health
standards, food production, waste exposure and so on. Examples of
regional and national programmes include the Avon Longitudinal
Study of Parents and Children (http://www.alspac.bris.ac.uk), the
Framingham study (http://www.nhlbi.nih.gov/about/framingham),
the Estonia Genome Project (http://www.geenivaramu.ee), and the
deCODE programme in Iceland (http://www.decode.com), an isolated
population that arose from relatively few founders that exhibits elevated levels of association of certain rare phenotypes. Studies of large outbred populations, such as those in the United Kingdom and the United
States, are also under development (http://www.ukbiobank.ac.uk;
see commentary in this issue by Collins, page 475). There is a tremendous opportunity for these projects to stimulate public engagement
at a new level. This is essential if these projects are to survive in the
long term and if the general public are to embrace the new era of
genetics.
Seeing is believing
The need to view, analyse and download genomic and functional
data presents a formidable challenge. We need to have access to an
enormous, complex and continually evolving body of information.
At the same time, the displays must be flexible and user-friendly,
allowing appropriate subsets of the data to be viewed clearly from any
perspective (Box 1). For some model organisms, such as worm, fly
and yeast, there is a single comprehensive database, but this seems to
be impracticable for humans, given the scale and complexity of the
data sets. Instead, there are genome ‘browsers’ at several sites, which
gather data from multiple locations and layer it on the human genome
sequence (see http://genome.ucsc.edu; http://www.ensembl.org;
http://ncbi.nlm.nih.gov). Personal data sets can be added to the display
as separate ‘tracks’ and viewed either publicly or privately anywhere
in the world by using DAS, the distributed annotation system51.
How will this evolve? At present, views of genome annotation
include the supporting raw data — for example, individual expressed
sequence tags, complementary DNA sequences, and protein homology
matches can all be viewed underneath a gene structure that is computed automatically. This allows the individual user to judge the
strength of the automatic prediction before using the information in
further research. However, the automatic annotation inevitably
includes unresolved conflicts, incomplete gene structures, missing
data and errors. Two things are needed to make the genome more
accessible: a stable, reliable core annotation, and a simple, distilled
view of the genome for easy use in medical (and other) situations.
Curated gene structures are already part of the annotation of the published human chromosomes (see refs 17, 52–59 and articles in this
issue of Nature by Deloukas et al. (page 375) and Humphray et al.
(page 369)) and are being displayed at a single dedicated site
(http://vega.sanger.ac.uk). This should replace the automated reprocessing system, which would otherwise overwrite the results of the
manual curation. Stable gene structures and similar information
should become core annotation that is adopted as a single ‘gold standard’. The core annotation should expand to include noncoding RNA
genes, alternative transcripts, promoters and regulatory elements.
This will be complemented by links to the specialist data sets (such as
protein structures), which are best maintained by expert groups at
local sites.
Medical sequencing
Medicine already benefits from the human genome sequence. Using
the genome in basic research to help understand the cause of disease
and variable response to toxins will be its most important application
to medicine. The emphasis is shifting from positional cloning to
screening candidate genes for disease-related variants. If necessary, it
will soon be possible to scale up the process to cover the whole
genome (or all genes), thus obviating the need for a prior hypothesis
to select candidate genes or regions. New genes offer new avenues for
diagnosis and intervention, and possibilities for translating the output of research into tangible improvements in health care. The other
articles in this issue explore some of the current directions of
genomic medicine.
The most striking future clinical application arising from the
reference human genome sequence is the possibility of unlimited
medical sequencing. For some years we have used targeted sequencing in diagnostic applications. For example, testing for mutations in
specific genes or exons is used to assist genetic counselling in diseases
such as cystic fibrosis or breast cancer. With easy access to a wellannotated human genome and cheap, accurate technology for
whole-genome sequencing, an individual could acquire either a
specific or a complete personal genetic health profile, including risk
and resistance factors. This information could be used to improve
and guide important medical decisions, to assess the risk of possible
future exposures, and to select preventive treatments for improved
health (Box 1). It would also provide important baseline information
for retesting later in life. Consider the possibility of obtaining a full
sequence on each case in the growing number of asbestos-induced
mesotheliomas60, finding somatic mutations in a key gene, and using
the gene product as a target for the development of new anticancer
drugs. Similarly, sequencing pathogen-resistant human genomes
would provide new clues for countering our susceptibility to infectious disease, our biggest health burden worldwide. The use of new
individual sequence information anonymously for research purposes (as in the last two examples above) should be relatively free of
ethical constraints. However, it must be acknowledged that the
increasing use of personal genetic information for individual health
applications raises substantial ethical and moral issues. Personal
information must remain the property of the individual, shared in
444
©2004 Nature Publishing Group
NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
BENTLEY new
20/5/04
1:46 pm
Page 445
insight overview
confidence only if proper consent has been given, and then with qualified practitioners. New levels of genetic counselling must accompany
our practical ability to use the data so that, for example, risk information is used appropriately and is limited to guiding decisions for
which there is a potentially beneficial treatment or action available.
The human genome sequence was made freely available for
everyone to use. However, putting the genome sequence in the public domain also provides a golden opportunity for all scientists to
contribute their expertise and vision to the genome. Only by
continuing to make all the new data available will we develop a full
understanding of the genome and provide a fertile environment for
future innovation.
■
doi:10.1038/nature02622
1. Bentley, D. R. Genomic sequence information should be released immediately and freely in the public
domain. Science 274, 533–534 (1996).
2. Guyer, M. Statement on the rapid release of genomic DNA sequence. Genome Res. 8, 413 (1998).
3. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human
genome. Nature 409, 860–921 (2001).
4. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
5. Rogers, J. The finished sequence of Homo sapiens. Cold Spring Harb. Symp. Quant. Biol. 68 (in the
press).
6. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse
genome. Nature 420, 520–562 (2002).
7. Rat Genome Project Sequencing Consortium. Genome sequence of the Brown Norway rat yields
insights into mammalian evolution. Nature 428, 493–521 (2004).
8. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
9. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for
investigating biology. Science 282, 2012–2018 (1998).
10. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant
Arabidopsis thaliana. Nature 408, 796–815 (2000).
11. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277,
1453–1474 (1997).
12. Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae
Rd. Science 269, 496–512 (1995).
13. Cole, S. T. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome
sequence. Nature 393, 537–544 (1998).
14. Watson, J. D. & Crick, F. H. Molecular structure of nucleic acids; a structure for deoxyribose nucleic
acid. Nature 171, 737–738 (1953).
15. Kruglyak, L. & Nickerson, D. A. Variation is the spice of life. Nature Genet. 27, 234–236 (2001).
16. The International SNP Map Working Group. A map of human genome sequence variation
containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001).
17. Dunham, A. et al. The DNA sequence and analysis of human chromosome 13. Nature 428,
522–528 (2004).
18. The International HapMap Consortium. The International HapMap Project. Nature 426,
789–796 (2003).
19. Druker, B. J. Imatinib alone and in combination for chronic myeloid leukemia. Semin. Hematol. 40,
50–58 (2003).
20. Aiuti, A., Ficara, F., Cattaneo, F., Bordignon, C. & Roncarolo, M. G. Gene therapy for adenosine
deaminase deficiency. Curr. Opin. Allergy Clin. Immunol. 3, 461–466 (2003).
21. Hacein-Bey-Abina, S. et al. A serious adverse event after successful gene therapy for X-linked severe
combined immunodeficiency. N. Engl. J. Med. 348, 255–256 (2003).
22. Risch, N. J. Searching for genetic determinants in the new millennium. Nature 405, 847–856 (2000).
23. Dean, M. et al. Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of
the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort
Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science 273,
1856–1862 (1996).
24. Ozaki, K. et al. Functional SNPs in the lymphotoxin- gene that are associated with susceptibility to
myocardial infarction. Nature Genet. 32, 650–654 (2002).
25. Van Dellen, A. & Hannan, A. J. Genetic and environmental factors in the pathogenesis of
Huntington’s disease. Neurogenetics 5, 9–17 (2004).
26. Georgiou-Karistianis, N. et al. Future directions in research with presymptomatic individuals
carrying the gene for Huntington’s disease. Brain Res. Bull. 59, 331–338 (2003).
27. Rioux, J. D. et al. Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn
disease. Nature Genet. 29, 223–228 (2001).
NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
28. Peltekova, V. D. et al. Functional variants of OCTN cation transporter genes are associated with
Crohn disease. Nature Genet. 36, 471–475 (2004).
29. Barroso, I. et al. Dominant negative mutations in human PPAR associated with severe insulin
resistance, diabetes mellitus and hypertension. Nature 402, 880–883 (1999).
30. Orr-Urtreger, A. et al. Developmental localization of the splicing alternatives of fibroblast growth
factor receptor-2 (FGFR2). Dev. Biol. 158, 475–486 (1993).
31. Gong, Q. H. et al. Thirteen UDP glucuronosyltransferase genes are encoded at the human UGT1 gene
complex locus. Pharmacogenetics 11, 357–368 (2001).
32. Kampa, D. et al. Novel RNAs identified from an in-depth analysis of the transcriptome of human
chromosomes 21 and 22. Genome Res. 14, 331–342 (2004).
33. Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of
chromosome 22. Genome Res. 13, 27–36 (2003).
34. Calin, G. A. et al. Human microRNA genes are frequently located at fragile sites and genomic regions
involved in cancers. Proc. Natl Acad. Sci. USA 101, 2999–3004 (2004).
35. Ambros, V. et al. A uniform system for microRNA annotation. RNA 9, 277–279 (2003).
36. Lehner, B., Williams, G., Campbell, R. D. & Sanderson, C. M. Antisense transcripts in the human
genome. Trends Genet. 18, 63–65 (2002).
37. Davidson, E. H. et al. A genomic regulatory network for development. Science 295, 1669–1678
(2002).
38. Lee, T. I. et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298,
799–804 (2002).
39. Novik, K. L. et al. Epigenomics: genome-wide study of methylation phenomena. Curr. Issues Mol.
Biol. 4, 111–128 (2002).
40. Roix, J. J., McQueen, P. G., Munson, P. J., Parada, L. A. & Misteli, T. Spatial proximity of
translocation-prone gene loci in human lymphomas. Nature Genet. 34, 287–291 (2003).
41. Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions.
Nature 424, 788–793 (2003).
42. Margulies, E. H., Blanchette, M., Haussler, D. & Green, E. D. Identification and characterization of
multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).
43. Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the
human genome. Science 299, 1391–1394 (2003).
44. Pilpel, Y., Sudarsanam, P. & Church, G. M. Identifying regulatory networks by combinatorial analysis
of promoter elements. Nature Genet. 29, 153–159 (2001).
45. von Mering, C. et al. Comparative assessment of large-scale data sets of protein–protein interactions.
Nature 417, 399–403 (2002).
46. Bar-Joseph, Z. et al. Computational discovery of gene modules and regulatory networks. Nature
Biotechnol. 21, 1337–1342 (2003).
47. Smith, T. Whole genome variation analysis using single molecule sequencing. Targets (in the press).
48. Shendure, J., Mitra, R. D., Varma, C. & Church, G. M. Advanced sequencing technologies: methods
and goals. Nature Rev. Genet. 5, 335–344 (2004).
49. Bingley, P. J. et al. Undiagnosed coeliac disease at age seven: population based prospective birth cohort
study. Br. Med. J. 328, 322–323 (2004).
50. Golub, T. R. Genomic approaches to the pathogenesis of hematologic malignancy. Curr. Opin.
Hematol. 8, 252–261 (2001).
51. Dowell, R. D., Jokerst, R. M., Day, A., Eddy, S. R. & Stein, L. The distributed annotation system. BMC
Bioinformatics 2, 7 (2001).
52. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
53. Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature
414, 865–871 (2001).
54. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
55. Hillier, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003).
56. Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421,
601–607 (2003).
57. Mungall, A. J. et al. The DNA sequence and analysis of human chromosome 6. Nature 425,
805–811 (2003).
58. Skaletsky, H. et al. The male-specific region of the human Y chromosome is a mosaic of discrete
sequence classes. Nature 423, 825–837 (2003).
59. Grimwood, J. et al. The DNA sequence and biology of human chromosome 19. Nature 428,
529–535 (2004).
60. Treasure, T., Waller, D., Swift, S. & Peto, J. Radical surgery for mesothelioma. Br. Med. J. 328,
237–238 (2004).
Acknowledgements I thank I. Barroso, A. Coffey, T. Cox, S. Grant, T. Hubbard, S. Hunt,
G. Leschziner, E. Margulies, K. Rice, J. Rogers, M. Ross, C. Shaw-Smith, R. Steward, M.
Stratton, C. Tyler-Smith and others for assistance, discussion and critical reading of the
manuscript. The author is supported financially by the Wellcome Trust.
Competing interests statement The author declares that he has no competing financial
interests.
445
©2004 Nature Publishing Group