BENTLEY new 20/5/04 1:46 pm Page 440 insight overview Genomes for medicine David R. Bentley The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK (e-mail: [email protected]) We have the human genome sequence. It is freely available, accurate and nearly complete. But is the genome ready for medicine? The new resource is already changing genetic research strategies to find information of medical value. Now we need high-quality annotation of all the functionally important sequences and the variations within them that contribute to health and disease. To achieve this, we need more genome sequences, systematic experimental analyses, and extensive information on human phenotypes. Flexible and user-friendly access to well-annotated genomes will create an environment for innovation, and the potential for unlimited use of sequencing in biomedical research and practice. T he DNA sequence of Homo sapiens is freely available and in the public domain in order to encourage research and development and to maximize its benefit to society. The commitment to this goal was made by all participants of the Human Genome Project in the Bermuda Statement in February 1996 (refs 1, 2). The successful outcome represents a remarkable achievement brought about by international cooperation, scientific excellence and altruism. It has been accompanied by striking technological progress — in DNA manipulation, automation of biological processes and computational methods for handling very large and complex sets of data. The key to the success of this endeavour lies in the fundamental nature of the information itself. No other project could produce a single data set that encompasses the genetic basis of being human. In a tidal wave of optimism, similar principles have been extended to other genomes2, with the result that we now have an unrivalled foundation for biological research in the future — all the genetic information used to make humans3–5, rodents6,7, flies8, worms9, plants10, bacteria11–13 and many more. This is a revolution in knowledge that promises to change our way of thinking. The foresight of sequencing a complete genome (as opposed to cataloguing all the available messenger RNAs) came in recognizing that a genome is more than a bundle of genes: the organization of genes in the context of surrounding information in the rest of the DNA might be important. It was worth characterizing everything precisely because we did not understand it. Some thought that it was premature to write noncoding DNA off as ‘junk’, and so it has proved. An enormous amount of functionally important information is now being found in addition to the protein-coding sequences. The human genome sequence contains noncoding RNA genes, regulatory sequences and structural motifs; it maintains short-range and long-range spatial organization of sequences; and it contains important evolutionary information. The genome sequence also provides a record of the natural chromosomal organization of genetic material. Given that all this information is in the sequence, it is important not to miss anything. Only by going systematically along each chromosome from end to end could every piece of information be captured with certainty. The realization that this was possible for any organism gave rise to the era of large-scale genome sequencing, which started in earnest only ten years ago. The essential properties of the human genome sequence — accuracy and completeness — reflect intrinsic features of the molecular structure of DNA14. Physically, the genome is a continuous thread on each chromosome, with the identity and order of each base determined unequivocally by the atomic structure of the double helix. The information stored within it is digital and can therefore be decoded unambiguously. The current version of the genome sequence (‘build 34’, http://ncbi.nlm.nih.gov) comprises 2.84 gigabases (Gb); it is more than 99.995% accurate and covers 99% of the euchromatin5, and therefore constitutes a high-quality reference for future work. A few parts of the thread remain intractable to current techniques and are targets for further research. They comprise the highly repetitive sequence of the heterochromatin plus the few remaining gaps in the euchromatin, many of which are believed to be rich in G+C content or repetitive sequence. Is the genome ready for medicine? Disease is a malfunction of the human body. It is caused by one or more internal changes, usually in combination with external factors. The human genome sequence will have a profound impact on our understanding of diseases with a genetic component. Ultimately it should be possible to examine an individual’s genetic make-up at any position in the sequence, deduce a functional consequence, and make a well-informed choice of medical action. This level of understanding requires a detailed knowledge of the information in the human genome. How will this be achieved? With a near-complete sequence in hand, we can draw together all the available human genetic information for the first time. About 22,000 protein-coding genes have been identified so far (http://www.ensembl.org/Homo_sapiens), and the sequence continues to be searched methodically for clues to any previously undiscovered members of a particular protein family. For example, we can list every protein kinase or transcription factor, or zoom in on a chromosomal region of interest and pick off every gene. Complete gene lists will enhance any experiment to search for medically important genes. With every human gene on an expression microarray, we can investigate the full extent of transcriptional changes that accompany tumorigenesis. We can use DNA arrays or comparative genome hybridization to search the entire genome of an individual for the germline variants, such as insertions, deletions, amplifications or translocations, that might be associated with phenotypes such as mental disorder or congenital abnormality. The same approach can be used to determine the combination of germline and somatic variants and somatic changes that might lead to cancer (see review in this issue by Strausberg et al., page 469). We believe that the genome contains over ten million ‘common’ polymorphic sites (sites where the minor allele is present in at least 1% in the human population) and an almost unlimited number of rarer variants15. So far (as at 440 ©2004 Nature Publishing Group NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature BENTLEY new 20/5/04 1:46 pm Page 441 insight overview February 2004), more than seven million variants, most of which are single-nucleotide polymorphisms (SNPs), have been catalogued in the public databases and mapped on the genome sequence16–18 (http://www.ncbi.nlm.nih.gov/SNP). We can examine each gene for variants that alter protein-coding sequence or splice sites, and test them directly to determine their functional significance. We can select polymorphisms for use as genetic markers, download the flanking sequence, develop experimental assays to determine the genotype of individual DNA samples, and search for associations with disease. More than 1,400 human genes have been correlated directly with disease (data from Online Mendelian Inheritance in Man at http://www.ncbi.nlm.nih.gov/Entrez). In general these are singlegene disorders. Almost every study has pinpointed a causative mutation and implicated a specific protein that is altered or absent in the disease phenotype. Such discoveries can lead to the provision of a precise predictive test, particularly for monogenic disorders such as cystic fibrosis or Huntington’s disease, and also stimulate targeted research towards an effective cure by correction or replacement of the defective protein. The discovery that a chromosome translocation creates a new gene structure (the abl-bcr gene) in chronic myeloid leukaemia led to the development of the drug imatinib (Gleevec), which binds specifically to the ABL-BCR protein and can alleviate the leukaemia in patients for whom other treatments have failed19. Effective cures are not necessarily guaranteed, of course: the defective protein might be inaccessible or impossible to replace, the defect might be lethal too early in life for administration of treatment, or there might be unexpected complication with the therapy. Despite early success in the treatment of severe combined immunodeficiency by replacement of the defective adenosine deaminase gene20, it is proving a challenge to continue translation into clinical practice21, illustrating that further research is needed to relate genomic data to patients. What about the more complex diseases, such as diabetes, heart disease, cancer or schizophrenia? We could build up the picture step by step, discovering all the genetic variants and environmental factors that contribute to the disease, and then work out all the permutations that are significant — a monumental task. However, we might not need to do this. If we identify the key pathways involved, we might be able to pinpoint the most effective points for intervention on the basis of biochemical knowledge, and avoid characterizing many of the contributing risk factors. The new targets then become the focus for the development of new drugs, leading to effective treatments. To find a pathway, it is necessary to identify at least one of its components — for example, an enzyme in a metabolic pathway, a receptor or transducer in a signalling pathway, or a polymerase in a DNA repair pathway. Genetic approaches to this problem can benefit enormously from the human genome sequence. We can choose a gene, pick polymorphic markers from the sequence, test them for association with the disease, and then search the region for causative variants. If the disease has a familial mode of inheritance, it might be possible to use linkage analysis. If not, as has been true in most studies of common disease so far, the alternative is to use a population-based association study and look for an imbalance in allele frequencies of a marker in a group of unrelated cases compared with a matched control group22. For example, an association study demonstrated the protective effect of the 32-base deletion (32) in the cytokine receptor 5 (CKR5) gene against HIV-1 infection or AIDS progression23. In an extensive gene-based survey, two SNPs in the LTA-3 gene showed significant association with myocardial infarction24. The genome is ready for us to make a start, aided by continuing efforts to complete the gene annotation, to expand the collections of variants in the public databases and to characterize the patterns of common variation in human population groups18. These developments are expected to help us dissect out the genetic basis of complex traits (see reviews in this issue by Carlson et al. (page 446) and Bell (page 453)), including variable drug response (see review in this issue by Evans and Relling, page 464). NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature Where we fall short at present is in our understanding of physiological function or how sets of molecules work together, and in our ability to infer this from the accumulated information linked to the genome. We lack knowledge of the functional sequences outside genes, as well as a detailed understanding of when and where genes are expressed, and in response to what signals. We are unaware of the biochemical functions of most proteins, and lack knowledge of most of the interactions between cellular components. If we were to acquire these information sets we would be able to go much further using the genome. Finding a genetic target would then allow us to jump immediately into a completely characterized biochemical pathway, to understand the functional processes that are disrupted by a particular mutation, and to develop measures to discern the influence of non-genetic factors on these processes. A few examples illustrate the roadblocks in understanding. The gene for Huntington’s disease has been known for a decade. The location of the protein is known, and a specific alteration is sufficient to cause the disease. However, we do not yet know why this causes the pathology of the disease25,26. One of the chromosomal regions associated with inflammatory bowel disease is 5q31 (ref. 27). Intense genetic analysis using both linkage and population association studies has narrowed the critical interval to 250 kilobases (kb). Known sequence variants (many of which are tightly associated with the disease) have been examined in detail as possible risk factors28, and yet we need more information before we can establish the mechanism of inflammatory bowel disease from this genetic association. In diabetes, dominant-negative mutations in the gene encoding the nuclear receptor peroxisome-proliferator-activated receptor- (PPAR-) are associated with severe insulin resistance, and the antidiabetic thiazolidinedione drugs have been shown to bind and activate the receptor29. However, despite compelling genetic evidence of the involvement of PPAR- in glucose homeostasis, the mechanism by which the gene contributes to insulin sensitivity and glucose homeostasis is not understood. We cannot use the information we have on PPAR- to develop a better understanding of the role of adipocyte differentiation in diabetes or to find new targets for intervention. How can we gain new information systematically, improve our knowledge of the genome, and remove roadblocks like this? Discovering all the functional information in the genome The finite number of protein-coding genes hides a much greater diversity and extent of functional information in the human genome, most of which is still to be discovered. For example, alternative splicing allows multiple functions encoded by the same gene to be selected in a cell-specific manner (see ref. 30, for example). Multiple promoters can confer a diversity of inducible responses and substrate specificities on the same gene (see ref. 31, for example). Annotation of this degree of coding diversity in human genes is at a very early stage. It should also be noted that, as yet, only two-thirds of the genes have a canonical structure with an open reading frame. Developing the annotation at these levels requires manual curation and more experimental data, particularly to find many of the 5 ends and translation initiation codons of the genes. We are becoming aware of the existence of new classes of RNA genes. In addition to the well-known ribosomal, transfer and small nuclear RNAs, evidence is accumulating for the existence of other RNAs, such as antisense RNAs and microRNAs32,33, both of which might be involved in gene regulation. MicroRNAs are 19–25-nucleotide products formed by the cleavage of precursor hairpins and might influence the translational activity or stability of mRNAs34,35. Antisense RNAs act mainly by disrupting translation after hybridization to the sense mRNA36. These discoveries reveal a new complexity of expressed information that is encoded in the genome, the extent of which is unknown. The use of genomic information by each cell is governed by the interaction of multiple proteins with regulatory sequences that act as signal processors. As a result, a response is initiated that takes into account all the information received from either inside or outside the 441 ©2004 Nature Publishing Group BENTLEY new 20/5/04 1:46 pm Page 442 insight overview cell37,38. When analysing genome sequence, it is much harder to recognize regulatory sequences than protein-coding sequences, because the rules are more complex and less obvious. Yet, like the proteincoding regions, many regulatory sequences have been conserved during evolution, allowing us to use information from other organisms to try to find these functionally important elements of the human genome. Gene regulation is also governed by modifications to the DNA sequence or epigenetic changes (see review in this issue by Egger et al., page 457), and an important adjunct to the genome sequence is to study patterns of differential methylation of genomic DNA and how they affect gene expression39. Not everything is to do with genes; much of the information in the genome sequence must be considered in the larger context of the chromosomes. Human chromosomes undergo replication and segregation by an intricate and controlled process. In yeast and other organisms, replication origins are sequence specific. It is possible that human replication origins also contain functionally important DNA sequence motifs, but we cannot recognize them yet. Important regions such as centromeres lie in highly repetitive heterochromatin; thus, we have not yet sequenced through a human centromere or determined the exact sequences needed for proper chromosome segregation. It is not known whether sequence motifs are involved in chromosome pairing or chiasma formation and crossing over during meiosis. Finally, DNA sequences might be involved in the positioning of chromosomes in the three-dimensional space of the nucleus; sequences on different chromosomes might be brought near to each other, leading to translocations and disease40. How will we discover all the functional information stored in the human genome sequence? Comparing genome sequences Comparing whole genome sequences between species will make an important contribution to high-quality annotation, although by definition it will only reveal features that have survived during evolution, as opposed to features that are species specific. The first comparison of two mammalian genomes (human and mouse) revealed that where the two sequences can be aligned there is more than 68% nucleotide identity6 — too much for a very precise definition of conserved motifs. However, by comparing the variable degree of conservation in the aligned sequences, it was possible to deduce that about 5% of the human genome seems to be under selection to conserve features in common with the mouse (‘purifying’ selection)6. As expected, these features include most of the protein-coding exons (1.5% of the genome) and additional untranslated regions of the genes (1%). It is a high priority to target the other 2.5% in the search for new functionally important sequences. Adding genome sequences from more organisms will greatly improve the analysis. In a comparison of 1.8 million bases of finished sequence of human chromosome 7 and that of 11 other vertebrates (an analysis termed ‘phylogenetic footprinting’), 228 ‘multi-species conserved sites’ (MCSs) overlapped virtually all the protein-coding exons, whereas another 966 were non-exonic41,42. Extrapolating this analysis to the whole genome would result in the detection of about 200,000 protein-coding exons (close to estimates based on other analyses) and 1.5 million vertebrate-specific MCSs to help in pinpointing possible regulatory elements that have been as wellconserved during evolution as the protein-coding sequences. To highlight conserved sequences that are specific to primates, a similar approach would examine the variation between multiple primate genome sequences as a function of the evolutionary distance between them (‘phylogenetic shadowing’). A pilot study suggests that an analysis of about seven carefully selected primate genome sequences would be sufficient to detect conserved signatures for a majority of proteincoding exons and many putative regulatory elements, including examples that are exclusive to primates and the hominid lineage43. Not all functional sequences are likely to be conserved between genomes. Comparisons within a genome sequence offer a complementary approach to the problem. For example, transcription-factor- binding motifs (typically 6–10 base pairs in length; http://www.generegulation.com/pub/databases.html#transfac) occur in abundance throughout the genome. Searching for single motifs alone is not very informative. However, some well-studied promoters and other regulatory elements contain multiple motifs in short windows of genome sequence (typically less than 400 bases), and these can be correlated with experimental evidence38,44 and added to the annotation. Functional annotation Experimental confirmation of protein-coding gene structures is relatively straightforward. Ascertaining the structure and function of each protein is more of a challenge. The three-dimensional structures of only 1,539 distinct human proteins have been determined experimentally so far (http://www.rcsb.org/pdb), and currently the function of only 6,000 human proteins is known with certainty. Although the gene sequence does not generally provide the answer, it serves as the platform for determining function. It is possible to start with a gene sequence, disrupt it and correlate it with the resulting phenotype; or to start with a disrupted phenotype and map the mutation back to the sequence (‘forward’ or ‘reverse’ genetics, respectively). Either approach relies on the nature of the disrupted phenotype to determine the underlying biochemical function, which is then assigned to the gene. We rely extensively on using experimentally tractable model organisms for functional annotation. Accurate alignment of their corresponding genome sequences is crucial for extrapolating the experimental findings from one organism to another. Understanding biological function also requires characterization of the many interactions of proteins with each other and with other cellular components. Large-scale screens for protein–protein interactions use methods involving yeast two-hybrid systems or immunoprecipitation of protein complexes with antibodies against native or tagged proteins. These methods require considerable validation to detect false positive and negative results, but the available experience suggests that it is possible to detect interactions involved in key cellular functions45. Experimental approaches are also important in identifying and validating sequences involved in gene regulation or chromatin function. In particular, methods are becoming available to recognize regions of chromatin that are subject to histone modification or transcription factor binding, and to map them back to the genome sequence38,46. As high-quality experimental data sets of this kind emerge, they will need to be integrated as part of the annotation of the human genome in linked databases (see below). Human sequence variation Sequencing individual human genomes provides the richest source of genetic information. Projects underway have begun by selecting genes, exons or promoters for targeted sequencing in depth. On average there are fewer variants in protein-coding than in other sequences, and they tend to have lower minor allele frequencies. This presumably reflects selection against some variants that are nonsynonymous (that is, those that cause amino-acid changes). If an excess in the rate of non-synonymous over synonymous substitution is observed in a particular coding region, it can be taken as an indicator of positive (diversifying) selection. As we find more variation and improve the sequence annotation, a fuller picture will emerge of the variants that alter genome function, and hence those that contribute to health and disease. To realize the benefits of genomic medicine fully, we should start resequencing individual genomes in their entirety. With the benefit of emerging new technologies (reviewed in refs 47, 48), it is feasible to consider generating gigabases of data as short sequence reads (for example, 25–50 bases each) in a single experiment, and assembling the data accurately using the existing finished sequence as a template. The first 200 haploid human genome sequences will give us an indepth view of human sequence variation (with a 99% chance of 442 ©2004 Nature Publishing Group NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature BENTLEY new 20/5/04 1:46 pm Page 443 insight overview detecting variants at an allele frequency of at least 1%; see table 2 in ref. 15). Comparison of these data sets will give us a full profile of common germline variation along each chromosome — something that is impossible to see with any of the current resequencing programmes. Detecting every variant (subject to the sensitivity and accuracy of the method) would permit the precise estimation of recombination rates and correlation with variability along chromosomes, and would give an indication of the forces of natural selection acting on the genome. This approach could be applied to give important baseline information in healthy tissue to compare with cancer genome sequences, and hence the ability to monitor DNA changes on a genome-wide basis during tumour progression. The same complete data sets could be generated for genomes that are associated with any other disease, with the prospect of removing the bias in ascertainment that is currently a limitation of gene-specific and regional investigations. Box 1 Translation of genomic information to future clinical practice As the annotation of the human genome becomes Human Genome Project Individual human sequence e Human resequencing stable, a user-friendly, distilled Functional annotation PGI i.d.: 5910322–61215923014 Other genomes HapMap view can be developed, as in a PPAR-γ the figure. The diagram (a) of a chromosome 3 region (12,300–12,450 kb, (kb) 3: 12,300 3: 12,400 1 2 3 4 5 6 7 8 9 10 11 12 Genomic information numbering as in build 34, http://www.ensembl.org) 3: 12,433,340 3: 12,450,610 b 3: 12,433,320| | | contains the PPAR- gene 13 14 15 16 17 18 19 20 21 22 23 24 ..CAGGGCTGCCAGTTTCGCTCCGTGGAGGCT.....CTTCACCCGCTCCTG... C A A T structure (dark blue) with an Personal genetic information (owned by individual) Gln Gly Cys Gln Phe Arg Ser Val Glu Ala …… Leu His Pro Leu Leu alternative promoter (light Pro His Met Leu blue), hypothetical noncoding 314 316 318 495 • Genetic counselling f • Patient consent Functional variants functional regions (green • Risk information selected and shaded boxes), and functional c used to interpret PGI of patient Drug binding variants (red). Note that introns (thiazolidinedione) TC at 3: 12,450,610 Gln 314 Pro in the gene structure are Risk information Arg 316 His scaled down relative to the Pro 495 Leu exons. Zooming in on two Val 318 Met Medical consultation sequence segments (b) shows g PGI i.d.: 5910322–61215923014 Structural context the translated sequence with Disease: Diabetes (type2) functional variants highlighted Risk Nucleotide Individual Other Medical d Genome base 3: 12,450,610 genotype position risk risks/effects risk in blue (nucleotide changes) Genotype t/c INS-resistance Thiazolidinedione and pink (amino-acid 3: 12,450,610 TC Hypertension resistance Individual information Pro/Leu changes). Amino-acid … 6: 149,031,974 GG Biochemical consequence numbering includes the Personal genetic assessment Modelled: slight altered conformation of peptide Structural consequence propeptide sequence. The backbone; increased local hydrophobicity. h • Order further tests variants (c, pink) can be Known: associated with severe insulin resistance, • Provide genetic counselling Medical consequence diabetes mellitus and hypertension. viewed in the monomer protein • Recommend exposure avoidance structure (grey) in a linked Pharmacological • Prescribe or change medication Known: resistant to thiazolidinediones. consequence Clinical decision database. Also shown is the Biological consequences binding position of an antidiabetic thiazolidinedione drug (blue), part of the other monomeric unit (green) of the dimeric might be stored electronically, protected by a high-security code receptor, and the ligand (yellow). Using linked information from a requiring unique personal identifiers (such as multiple fingerprint range of sources, a summary of the known, modelled or predicted identification) for access only with consent of the individual (e). The biological consequences (such as biochemical, structural, medical or information might be taken either before consultation (as illustrated pharmacological) could be curated (and updated regularly) for each here) or afterwards, and in either case would be subject to functional variant in tabular form (d). A small subset of this information counselling by the practitioner and consent by the individual. would define the disease or drug outcome or side effect associated A specific investigation would be initiated by a consultation (f). with each variant, would constitute specific risk information of value The personal genetic information would then be supplied by the in clinical assessment, and would be exported (red outlined boxes). individual, for interpretation with respect to an agreed set of variants For maximum usefulness, therefore, the exported information would and/or a specific phenotype. The practitioner would use the available be subject to stringent filters and would include only data for which risk information concerning each variant to provide a genetic the medical relevance was well established for each particular assessment for the individual (g). The top line refers to the variant disease discipline. For example, variants of uncertain significance featured in d and f; the second line is a hypothetical entry for a variant would be excluded from the filtered risk information, although all data on another chromosome and does not represent a known variant. In would be available in the public domain. All the information in a–d the case illustrated, the individual has the heterozygous genotype TC would be curated in the public domain. at position 3: 12,450,610. This corresponds to having both Pro 495 The use of personal genetic information in a clinical setting and Ala 495 forms of the protein PPAR-. This genotype confers an would be initiated or consented to by an individual. The individual increased risk of insulin-resistant diabetes on the individual, and also sequence acquired could be as little as one or more individual resistance to the thiazolidinedione class of antidiabetic drugs29. genotypes, or as much as a complete genome sequence. The Combining this with risk information for other genotypes would help information would be private and owned by the individual, and to inform subsequent clinical decisions (h). NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature 443 ©2004 Nature Publishing Group BENTLEY new 20/5/04 1:46 pm Page 444 insight overview From genotype to phenotype Observation of medical phenotypes is at the heart of accurate diagnosis. In addition to following rigorously defined clinical criteria, measurement of biochemical phenotype data helps considerably. For example, in a prospective cohort, 1% of children tested positive for antibodies against transglutaminase, a sensitive indicator for coeliac disease, although nearly all were asymptomatic with respect to gluten sensitivity. The disease incidence in adults is also 1% (ref. 49), indicating that we should revise our view of the onset of the condition, adopt the new phenotype data for genotype–phenotype association studies, and try to develop early preventive measures for those at risk. Accurate measurement of molecular variables can also be used to reclassify the disease (‘molecular taxonomy’) with greater phenotypic accuracy, as exemplified in the use of expression microarrays to subdivide some cancer types in retrospectively collected cases50. These principles should be applied rigorously in characterizing population sample collections to underpin genotype–phenotype studies. To find the major genetic factors in common disease, it is best to base the study on retrospective sample collections. This is the only way of obtaining enough cases of a well-defined phenotype to provide sufficient statistical power for the study. By contrast, prospective collection within the general population is appropriate for an unbiased sampling and measurement of environmental exposures before the onset of a specific condition. The challenge is to collect sufficient phenotype and exposure data on enough samples for statistically significant correlations. Studying environmental factors will enable us to explain more of the variance of common disease. Some genetic effects will be detected only in conjunction with environmental exposures. We can use genetic analysis to identify a susceptible or resistant subpopulation so as to strengthen the power to detect association with the environmental factors. We can learn how to provide individualized prevention advice based on genotypes. Correlations can be made on the basis of systematically collected, accurate data on factors such as nutrition, environmental toxins, exercise and lifestyle, with detailed information on individual genetic make-up. This would provide information for identifying avoidable risk, protective effects (either individually or on a population basis) for improved awareness, and recommendations for the implementation of health standards, food production, waste exposure and so on. Examples of regional and national programmes include the Avon Longitudinal Study of Parents and Children (http://www.alspac.bris.ac.uk), the Framingham study (http://www.nhlbi.nih.gov/about/framingham), the Estonia Genome Project (http://www.geenivaramu.ee), and the deCODE programme in Iceland (http://www.decode.com), an isolated population that arose from relatively few founders that exhibits elevated levels of association of certain rare phenotypes. Studies of large outbred populations, such as those in the United Kingdom and the United States, are also under development (http://www.ukbiobank.ac.uk; see commentary in this issue by Collins, page 475). There is a tremendous opportunity for these projects to stimulate public engagement at a new level. This is essential if these projects are to survive in the long term and if the general public are to embrace the new era of genetics. Seeing is believing The need to view, analyse and download genomic and functional data presents a formidable challenge. We need to have access to an enormous, complex and continually evolving body of information. At the same time, the displays must be flexible and user-friendly, allowing appropriate subsets of the data to be viewed clearly from any perspective (Box 1). For some model organisms, such as worm, fly and yeast, there is a single comprehensive database, but this seems to be impracticable for humans, given the scale and complexity of the data sets. Instead, there are genome ‘browsers’ at several sites, which gather data from multiple locations and layer it on the human genome sequence (see http://genome.ucsc.edu; http://www.ensembl.org; http://ncbi.nlm.nih.gov). Personal data sets can be added to the display as separate ‘tracks’ and viewed either publicly or privately anywhere in the world by using DAS, the distributed annotation system51. How will this evolve? At present, views of genome annotation include the supporting raw data — for example, individual expressed sequence tags, complementary DNA sequences, and protein homology matches can all be viewed underneath a gene structure that is computed automatically. This allows the individual user to judge the strength of the automatic prediction before using the information in further research. However, the automatic annotation inevitably includes unresolved conflicts, incomplete gene structures, missing data and errors. Two things are needed to make the genome more accessible: a stable, reliable core annotation, and a simple, distilled view of the genome for easy use in medical (and other) situations. Curated gene structures are already part of the annotation of the published human chromosomes (see refs 17, 52–59 and articles in this issue of Nature by Deloukas et al. (page 375) and Humphray et al. (page 369)) and are being displayed at a single dedicated site (http://vega.sanger.ac.uk). This should replace the automated reprocessing system, which would otherwise overwrite the results of the manual curation. Stable gene structures and similar information should become core annotation that is adopted as a single ‘gold standard’. The core annotation should expand to include noncoding RNA genes, alternative transcripts, promoters and regulatory elements. This will be complemented by links to the specialist data sets (such as protein structures), which are best maintained by expert groups at local sites. Medical sequencing Medicine already benefits from the human genome sequence. Using the genome in basic research to help understand the cause of disease and variable response to toxins will be its most important application to medicine. The emphasis is shifting from positional cloning to screening candidate genes for disease-related variants. If necessary, it will soon be possible to scale up the process to cover the whole genome (or all genes), thus obviating the need for a prior hypothesis to select candidate genes or regions. New genes offer new avenues for diagnosis and intervention, and possibilities for translating the output of research into tangible improvements in health care. The other articles in this issue explore some of the current directions of genomic medicine. The most striking future clinical application arising from the reference human genome sequence is the possibility of unlimited medical sequencing. For some years we have used targeted sequencing in diagnostic applications. For example, testing for mutations in specific genes or exons is used to assist genetic counselling in diseases such as cystic fibrosis or breast cancer. With easy access to a wellannotated human genome and cheap, accurate technology for whole-genome sequencing, an individual could acquire either a specific or a complete personal genetic health profile, including risk and resistance factors. This information could be used to improve and guide important medical decisions, to assess the risk of possible future exposures, and to select preventive treatments for improved health (Box 1). It would also provide important baseline information for retesting later in life. Consider the possibility of obtaining a full sequence on each case in the growing number of asbestos-induced mesotheliomas60, finding somatic mutations in a key gene, and using the gene product as a target for the development of new anticancer drugs. Similarly, sequencing pathogen-resistant human genomes would provide new clues for countering our susceptibility to infectious disease, our biggest health burden worldwide. The use of new individual sequence information anonymously for research purposes (as in the last two examples above) should be relatively free of ethical constraints. However, it must be acknowledged that the increasing use of personal genetic information for individual health applications raises substantial ethical and moral issues. Personal information must remain the property of the individual, shared in 444 ©2004 Nature Publishing Group NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature BENTLEY new 20/5/04 1:46 pm Page 445 insight overview confidence only if proper consent has been given, and then with qualified practitioners. New levels of genetic counselling must accompany our practical ability to use the data so that, for example, risk information is used appropriately and is limited to guiding decisions for which there is a potentially beneficial treatment or action available. The human genome sequence was made freely available for everyone to use. However, putting the genome sequence in the public domain also provides a golden opportunity for all scientists to contribute their expertise and vision to the genome. Only by continuing to make all the new data available will we develop a full understanding of the genome and provide a fertile environment for future innovation. ■ doi:10.1038/nature02622 1. Bentley, D. R. Genomic sequence information should be released immediately and freely in the public domain. Science 274, 533–534 (1996). 2. Guyer, M. Statement on the rapid release of genomic DNA sequence. Genome Res. 8, 413 (1998). 3. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). 4. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). 5. Rogers, J. The finished sequence of Homo sapiens. Cold Spring Harb. Symp. Quant. Biol. 68 (in the press). 6. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). 7. Rat Genome Project Sequencing Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521 (2004). 8. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000). 9. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998). 10. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000). 11. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1474 (1997). 12. Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995). 13. Cole, S. T. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998). 14. Watson, J. D. & Crick, F. H. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737–738 (1953). 15. Kruglyak, L. & Nickerson, D. A. Variation is the spice of life. Nature Genet. 27, 234–236 (2001). 16. The International SNP Map Working Group. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001). 17. Dunham, A. et al. The DNA sequence and analysis of human chromosome 13. Nature 428, 522–528 (2004). 18. The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003). 19. Druker, B. J. Imatinib alone and in combination for chronic myeloid leukemia. Semin. Hematol. 40, 50–58 (2003). 20. Aiuti, A., Ficara, F., Cattaneo, F., Bordignon, C. & Roncarolo, M. G. Gene therapy for adenosine deaminase deficiency. Curr. Opin. Allergy Clin. Immunol. 3, 461–466 (2003). 21. Hacein-Bey-Abina, S. et al. A serious adverse event after successful gene therapy for X-linked severe combined immunodeficiency. N. Engl. J. Med. 348, 255–256 (2003). 22. Risch, N. J. Searching for genetic determinants in the new millennium. Nature 405, 847–856 (2000). 23. Dean, M. et al. Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science 273, 1856–1862 (1996). 24. Ozaki, K. et al. Functional SNPs in the lymphotoxin- gene that are associated with susceptibility to myocardial infarction. Nature Genet. 32, 650–654 (2002). 25. Van Dellen, A. & Hannan, A. J. Genetic and environmental factors in the pathogenesis of Huntington’s disease. Neurogenetics 5, 9–17 (2004). 26. Georgiou-Karistianis, N. et al. Future directions in research with presymptomatic individuals carrying the gene for Huntington’s disease. Brain Res. Bull. 59, 331–338 (2003). 27. Rioux, J. D. et al. Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nature Genet. 29, 223–228 (2001). NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature 28. Peltekova, V. D. et al. Functional variants of OCTN cation transporter genes are associated with Crohn disease. Nature Genet. 36, 471–475 (2004). 29. Barroso, I. et al. Dominant negative mutations in human PPAR associated with severe insulin resistance, diabetes mellitus and hypertension. Nature 402, 880–883 (1999). 30. Orr-Urtreger, A. et al. Developmental localization of the splicing alternatives of fibroblast growth factor receptor-2 (FGFR2). Dev. Biol. 158, 475–486 (1993). 31. Gong, Q. H. et al. Thirteen UDP glucuronosyltransferase genes are encoded at the human UGT1 gene complex locus. Pharmacogenetics 11, 357–368 (2001). 32. Kampa, D. et al. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 14, 331–342 (2004). 33. Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003). 34. Calin, G. A. et al. Human microRNA genes are frequently located at fragile sites and genomic regions involved in cancers. Proc. Natl Acad. Sci. USA 101, 2999–3004 (2004). 35. Ambros, V. et al. A uniform system for microRNA annotation. RNA 9, 277–279 (2003). 36. Lehner, B., Williams, G., Campbell, R. D. & Sanderson, C. M. Antisense transcripts in the human genome. Trends Genet. 18, 63–65 (2002). 37. Davidson, E. H. et al. A genomic regulatory network for development. Science 295, 1669–1678 (2002). 38. Lee, T. I. et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804 (2002). 39. Novik, K. L. et al. Epigenomics: genome-wide study of methylation phenomena. Curr. Issues Mol. Biol. 4, 111–128 (2002). 40. Roix, J. J., McQueen, P. G., Munson, P. J., Parada, L. A. & Misteli, T. Spatial proximity of translocation-prone gene loci in human lymphomas. Nature Genet. 34, 287–291 (2003). 41. Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003). 42. Margulies, E. H., Blanchette, M., Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003). 43. Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003). 44. Pilpel, Y., Sudarsanam, P. & Church, G. M. Identifying regulatory networks by combinatorial analysis of promoter elements. Nature Genet. 29, 153–159 (2001). 45. von Mering, C. et al. Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417, 399–403 (2002). 46. Bar-Joseph, Z. et al. Computational discovery of gene modules and regulatory networks. Nature Biotechnol. 21, 1337–1342 (2003). 47. Smith, T. Whole genome variation analysis using single molecule sequencing. Targets (in the press). 48. Shendure, J., Mitra, R. D., Varma, C. & Church, G. M. Advanced sequencing technologies: methods and goals. Nature Rev. Genet. 5, 335–344 (2004). 49. Bingley, P. J. et al. Undiagnosed coeliac disease at age seven: population based prospective birth cohort study. Br. Med. J. 328, 322–323 (2004). 50. Golub, T. R. Genomic approaches to the pathogenesis of hematologic malignancy. Curr. Opin. Hematol. 8, 252–261 (2001). 51. Dowell, R. D., Jokerst, R. M., Day, A., Eddy, S. R. & Stein, L. The distributed annotation system. BMC Bioinformatics 2, 7 (2001). 52. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999). 53. Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001). 54. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000). 55. Hillier, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003). 56. Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003). 57. Mungall, A. J. et al. The DNA sequence and analysis of human chromosome 6. Nature 425, 805–811 (2003). 58. Skaletsky, H. et al. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825–837 (2003). 59. Grimwood, J. et al. The DNA sequence and biology of human chromosome 19. Nature 428, 529–535 (2004). 60. Treasure, T., Waller, D., Swift, S. & Peto, J. Radical surgery for mesothelioma. Br. Med. J. 328, 237–238 (2004). Acknowledgements I thank I. Barroso, A. Coffey, T. Cox, S. Grant, T. Hubbard, S. Hunt, G. Leschziner, E. Margulies, K. Rice, J. Rogers, M. Ross, C. Shaw-Smith, R. Steward, M. Stratton, C. Tyler-Smith and others for assistance, discussion and critical reading of the manuscript. The author is supported financially by the Wellcome Trust. Competing interests statement The author declares that he has no competing financial interests. 445 ©2004 Nature Publishing Group
© Copyright 2026 Paperzz