Decoding the human genome sequence

© 2000 Oxford University Press
Human Molecular Genetics, 2000, Vol. 9, No. 16 Review 2353–2358
Decoding the human genome sequence
David R. Bentley+
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
Received 11 August 2000; Accepted 17 August 2000
The year 2000 is marked by the production of the sequence of the human genome. A ‘working draft’ of high
quality sequence covering 90% of the genome has been determined and a quarter is in finished form, including the first two completed chromosomes. All sequence data from the project is made freely available to the
community via the Internet, for further analysis and exploitation. The challenge which lies ahead is to decipher the information. Knowledge of the human genome sequence will enable us to understand how the
genetic information determines the development, structure and function of the human body. We will be able
to explore how variations within our DNA sequence cause disease, how they affect our interaction with our
environment and ultimately to develop new and effective ways to improve human health.
THE IMPORTANCE OF SEQUENCING WHOLE
GENOMES
The ability to undertake DNA sequencing on a large scale
started a revolution in biology, as it became possible to determine the complete DNA sequence of any organism and therefore obtain a full description of genes and other important
biological information stored in the genome. For the first time,
it was possible to define the complete set of proteins required
for a particular life form, to make full comparisons of protein
sets between different species, to discover the basis of their
similarities and differences and to explore the evolutionary
relationship between them. For example, the 15 Mb genome
sequence of baker’s yeast Saccharomyces cerevisiae,
completed in 1996, contains >6000 genes(1,2). These encode
all the functions required for a eukaryotic cell. The 99 Mb
genome sequence of the free-living soil nematode worm
Caenorhabditis elegans, completed 2 years later, contains
19 000 genes (3). This constitutes a complete gene set for a
multicellular organism, encoding proteins involved in basic
eukaryotic cell functions and also many others involved, for
example, in development, cell–cell interactions, motility,
feeding and reproduction. A recent cross comparison of the
complete gene sets (4) has revealed that 23% of the proteins
encoded by yeast genes have apparent homologues in the
nematode worm, reflecting functions common to both organisms. A closer examination of matches between individual
protein domains illustrates another level of homology, with
41% of yeast proteins matching at least one protein domain in
a nematode protein and 36% being homologous to a
representative mammalian protein set (4). It is evident that
diversity of function has arisen in evolution as a result of
existing domains being used in new combinations, as well as
from the generation of entirely new proteins.
These model genome projects used a strategy by which a
physical map was constructed of overlapping bacterial or yeast
artificial chromosome (YAC) clones representing the entire
genomic DNA of the organism (5–7). A minimum overlapping
+
Tel: +44 1223 834244; Fax: +44 1223 494919; Email: [email protected]
set of clones (the tile path) was then chosen for sequencing.
Each clone was then subjected to complete sequencing,
comprising an initial random shotgun phase followed by a
phase of directed finishing, resulting in a final reference
sequence with an accuracy of >99.99% (8). Completion of
these model genomes thus provided many important technological and biological insights that have since been applied to
the human genome.
The extraordinary progress of the Human Genome Project was
made possible by the remarkable international collaboration
which developed from the early 1990s. Fostered by the close cooperation of multiple national funding agencies, 20 centres in six
countries have worked together (see Appendix), co-ordinating
their efforts by sharing technology, resources and information
which were freely available throughout the project. Sequencing
in many laboratories was also co-ordinated substantially using the
genome map itself. This helped to minimize unnecessary duplication and to allow both large and small partners to contribute effectively. The development of new automated sequencing
technology, coupled with the adoption of streamlined approaches
to generating genome sequence information on a large scale, led
to a dramatic acceleration of the programme during the last
2 years, driven by a growing awareness of the enormous value of
gaining a first view of the human genome. The result is the emergence of the ‘working draft’. Although incomplete, it provides
extensive coverage of most of the human genes. The working
draft is therefore both an advanced intermediate on the pathway
to the production of the finished reference sequence and a product
of immediate value in accelerating studies of individual genes,
associations with disease and the study of human sequence variation. The working draft is a dynamic product, subject to continual
refinement as new data are added. A brief summary of the
strategy, milestones and status of the international programme is
outlined below, based on current information that is publicly
available. A central list of websites providing information is available at (http://www.ncbi.nlm.nih.gov/genome/central; http://
2354 Human Molecular Genetics, 2000, Vol. 9, No. 16 Review
www.ensembl.org/genome/central ). A full analysis is due for
publication at the end of 2000.
STRATEGY
An early milestone in tackling the human genome was the
construction of a genome-wide genetic linkage map of polymorphic microsatellites, which provided a first framework of
2000 unique landmarks ordered at high odds (>1000:1) across
the genome (9). The use of overlapping YACs (10) and, later,
radiation hybrid (RH) mapping (11,12) resulted in construction
of maps of much higher densities of landmarks for each of the
human chromosomes (13,14–20). The RH mapping work
included another milestone, the human gene map (21), which
was constructed using 30 181 gene-based markers derived
from expressed sequence tags (ESTs) (22) selected from nonredundant clusters, or Unigenes (http://www.ncbi.nlm.nih.gov/
Unigene ).
To provide the starting material for sequencing, a complete
physical map of overlapping clones was constructed using
libraries of large insert (150–250 kb) P1-derived artificial
chromosomes (PACs) (23) or BACs (24). Clones were assembled into contigs on the basis of overlaps identified by pairwise
comparison of restriction fingerprints (25,26) using data generated either by fingerprinting whole genomic libraries or
chromosome-specific clone collections. Many overlaps were
also identified directly by screening clones with markers from
the landmark maps as part of the initial contig assembly
process. More recently, large-scale sequencing of BAC clone
ends has provided an additional resource to enable new clones
to be added to the map directly by matching the BAC end
sequences to draft or finished sequences. Additional mapping
information on the order, orientation and gap size between
contigs has also been established using clones as probes in fluorescence in situ hybridization (FISH) analysis of interphase nuclei
or extended DNA fibres. On 15 June 2000, the BAC clone map
contained >330 000 clones and comprised 1743 contigs representing an estimated 97–98% of the genome, from which 32 221
clones formed the tile path (or map of accessioned clones)
(details are listed at http://genome.wustl.edu/gsc/ ).
To generate the working draft DNA sequence, each BAC or
PAC clone in the tile path was subjected to an initial shotgun
sequencing phase in which sufficient random reads were
assembled into a consensus to provide at least 3-fold coverage
of each sequenced base at a quality score [Q value, calculated
by the base-calling program PHRED (27,28)] of 20 or higher
(29). This provided ∼95% coverage of the insert. At present,
each clone is being taken to a full depth shotgun (typically 8to 10-fold), and this is being followed by a directed ‘finishing’
phase in which all gaps are closed, ambiguities are resolved,
and the entire finished sequence is checked to ensure an accuracy of >99.99% before submission to the public databases
(see Appendix, nos 3, 4 and 12).
The availability of draft sequence for every clone in the tile
path allows re-calculation of the entire map. Overlaps between
all clones can now be based on matching sequences, which
provide much more information than the earlier overlap analysis
based on fingerprint or landmark data. A set of fragments of
accessioned sequence has been merged together for each chromosome and the order and orientation of non-overlapping fragments have been determined where possible using additional
information (see http://genome.cse.ucsc.edu/goldenPath). This
will continue to be refined as more clone sequences are
completed. This dataset enables re-assessment of the order of
all landmarks relative to the previous maps, thus providing
independent validation of long-range order in the working
draft.
The status of the project can be assessed based on the amount
of sequence that is available and comparing this figure to the
estimated size of the genome. Of course, any estimate is necessarily an approximation, as the true size of the genome is
unknown; this will only be determined when all remaining
gaps in the genome sequence are measured [for example using
the approaches taken when finishing chromosomes 21 and 22
(see below)]. In addition, analysis of the existing sequence is
being refined continually to confirm new overlaps, or to define
genomic duplications and to make other changes which affect
estimates of sequence coverage. The current estimate is that
>90% of the euchromatic human genome sequence is publicly
available and 23% is finished [as of 31 August 2000: see current
progress at (http://www.ncbi.nlm.nih.gov/genome/seq )].
The identification of genes and other important biological
features in the sequence requires further analysis. This is
carried out on each BAC or PAC as it is released, whether in
draft or in finished form. A number of data analysis centres are
involved in providing up-to-date annotation of the rapidly
evolving draft sequence, generally using fast, automatic
analyses. These sites provide consistent annotation and interfaces to the data. The Ensembl database, based at the Sanger
Centre and the European Bioinformatics Institute, annotates
known genes by aligning mRNA sequence to genomic DNA
using the EST_GENOME program. Further genes are detected
by predicting candidate gene features, which are then
confirmed using homology searches against EST and protein
sequence databases (see http://www.ensembl.org ). Analysis of
a pre-release of the working draft (24 May 2000, containing
2.2 Gb of sequence data) at Ensembl resulted in detection of 31
000 predicted gene features; extrapolating from this figure to
the amount of sequence that is now available, this figure would
increase to ∼45 000 genes (see below for discussion of gene
number estimates). The National Centre for Biotechnology
Information provides a Map Viewer, which displays the
available human genome sequence data and includes the location of genes, polymorphisms and other features (http://
www.ncbi.nlm.nih.gov ).
WHOLE GENOME SHOTGUNS
The recent advances in sequencing technology and throughput
have led researchers to reconsider the possibility of shotgun
sequencing large and complex genomes directly, thus
bypassing the mapping stage (30,31). The potential advantages
of speed and simplicity of the strategy are countered by the difficulties faced in the scale of computing analysis required for
assembly of the DNA fragments, plus the many false joins
which would inevitably arise from assembly of an entire
complex genome sequence containing a range of repetitive
sequences (32). Assembly of 3.2 million reads (12.8 × coverage)
of the 120 Mb euchromatic genome of the fruit fly Drosophila
melanogaster resulted in an assembly covering 115 Mb of the
120 Mb euchromatic region, with 1630 gaps (33). Closure of
the gaps and correction of the remaining misassemblies are
Human Molecular Genetics, 2000, Vol. 9, No. 16 Review 2355
being done using the scaffold of sequenced BAC clones which
was also available and had been aligned to the genome
sequence (33). The finishing process is thus modelled on the
earlier experience of the nematode genome project.
The experience gained in the D.melanogaster whole genome
shotgun assembly was used as a basis for a human whole
genome assembly. There are two important differences
between the two analyses, however, which would be expected
to have a marked effect on the efficiency of the sequence
assembly process. First, the human genome contains a diverse
collection of repetitive sequences comprising ∼50% of the
total. The experience gained in Drosophila is therefore not
necessarily representative of the human (or mouse) genomes.
Second, the increased size of the human genome (25 times that
of the fruit fly) would result in a greatly increased scale of computational analysis. The current availability of a complete working
draft of the human genome, assembled on a clone-by-clone
basis, however, obviates the need for a whole genome shotgun
assembly to be done in isolation. Most of the data from a whole
genome shotgun can be aligned with the mapped human
sequence that is currently available. The remaining unmatched
sequences would be the result of either failures in the matching
process, or fragments of human DNA that were represented in
the small insert libraries used for the whole genome shotgun, but
not represented in the mapped BACs used so far to generate the
working draft. Testing of these hypotheses will require access
to merged sequence from both projects to enable the appropriate analysis to be carried out.
FINISHING HUMAN CHROMOSOMES
In December 1999, the sequence of the euchromatic part of
chromosome 22q (an acrocentric chromosome) was finished
(34) and 33.5 Mb was represented in map 10 contigs, of which
the largest was 25 Mb. The nine gaps in the clone map (plus
one small gap in the sequence, bridged by a clone) covered a
total of just over 1 Mb, and the total coverage obtained was
therefore 97%. The gaps were not closed after exhaustive
screening of BAC and PAC libraries; work is continuing using
other cloning systems. At least three of the gaps were deemed
unstable in BACs: fibre FISH analysis of several BAC clones
showed that each one contained sequence on both sides of a
gap but was missing a central portion, which may have been
lost during cloning or in subsequent bacterial culture. Chromosome 21, another acrocentric chromosome, was completed
5 months later (35), 33.8 Mb of 21q was represented in just
three contigs and the remaining three gaps in the clone map
(plus seven small sequence gaps) accounted for only 0.1 Mb.
The total coverage was therefore 99.7% of the euchromatin. In
addition, 0.28 Mb of the heterochromatic sequence of 21p was
also determined.
Both these chromosomes were finished to an accuracy of
>99.99%, and annotation of the sequence revealed a marked
contrast in gene content. Chromosomes 22 and 21 were found
to contain 545 and 225 genes, respectively, plus 134 and
59 pseudogenes, respectively. Differences in base composition
and repeat sequence content between the two chromosomes
reflected this: the gene-rich chromosome 22 had a higher GC
content, more SINEs and slightly fewer LINEs than the genepoor chromosome 21.
HOW MANY GENES IN THE GENOME?
It is important to realize that we will not know the number of
genes in the human genome when the genome sequence is
finished. Finding all the genes in the genome is dependent on
the resolution of methods to analyse the sequence and to identify genes based on characteristic features such as recognizable
motifs, homology to other genes, matches to mRNA sequences
and ab initio predictions. In many cases, hypothetical gene
structures also require experimental validation. With the
sequence of the human genome still in draft form, and our
knowledge of the repertoire of gene features still limited, estimates of the total number of human genes are necessarily
approximate.
An early estimate of 60 000–70 000 human genes was based
on the range of different sequences found in a population of
ESTs sequenced at random from tissues. A similar figure
(80 000) was derived from measuring the total number of CpG
islands in the genome, and extrapolating on the basis that 56%
of known genes were associated with CpG islands (36). The
total gene number based on CpG islands may be an overestimate if some CpG islands are not associated with genes, or if
there are cases where more than one CpG island is associated
with the same gene.
The availability of much larger EST datasets led to greater
divergence in gene number estimates, with the most recent
examples ranging from 120 000 (37) to 35 000 (38). Two problems confound gene number estimates that are based on the
complexity of EST collections. First, sampling of ESTs is
biased by the tissue sources selected and the depth to which
each one is sampled. Therefore, genes which are expressed
either at low abundance (i.e. lower than the depth of sampling
of the library sequencing), or only in inaccessible tissues or
developmental stages, will be under-represented or absent
from current datasets. This will lead to an underestimate of the
complexity of the EST dataset. Second, EST datasets are
composed of incomplete sequences of mRNA and independent
ESTs which are derived from the same mRNA may not overlap
on the basis of the available sequence data, thus leading to an
overestimate of complexity of the EST collection. Ewing and
Green. (38) took both these factors into account in their analysis, by using only 3′ EST sequences and by normalizing their
results using the annotated sequence of chromosome 22.
Two other analyses have yielded similar estimates. First, the
identification of evolutionarily conserved regions (‘ecores’)
between a random set of genomic sequences of the puffer fish
Tetraodon nigroviridis and the human genome resulted in estimates of 28 000–34 000 human genes. This study was also
normalized with respect to the annotated genes on chromosome 22 (39). Second, a direct extrapolation from the annotated genes on chromosome 22 alone, or chromosomes 21 and
22 combined (assuming that 35 Mb sequence is 1.1% of the
genome) leads to estimates of 43 000 or 34 000 genes, respectively, excluding pseudogenes. These extrapolations assume
that either chromosome 22, or the combination of chromosomes 21 and 22, respectively, are representative of the rest of
the human genome.
These last estimates are significantly lower than previously
assumed views of the number of genes in the human genome.
One of the most interesting insights to emerge from comparing
complete gene sets in model organisms to date is that gene
2356 Human Molecular Genetics, 2000, Vol. 9, No. 16 Review
number is not necessarily a good measure of complexity and
this trend may be borne out in the human genome. For
example, the fruit fly would be considered more anatomically
complex, with 10 times more cells, than the nematode (which
has 1000 cells) and the fruit fly also undergoes a more complex
developmental process than the nematode worm. Yet the fruit
fly genome contains only 13 000 genes, compared with the
19 000 genes found in the nematode genome. Instead, it
appears that biological complexity is achieved in other ways.
Relatively few gene products may determine relatively
complex biochemical and developmental processes. Greater
functional complexity and diversity may result from a relatively small number of novel regulatory proteins working in a
variety of combinations, for example, and from the existence
of multiple protein isoforms.
In practice, the estimates of gene number are most likely to
stabilize when the genome sequences of multiple vertebrate
genomes are produced and aligned to the human sequence
(notably the mouse sequence, for example, is projected to be
available within 2 years). This will enable the annotation of
genes to be done accurately based on conserved sequences.
The location of putative exons can be correlated with functional motifs, such as exon–intron boundaries, and matches to
known mRNA and ESTs, and gene structures can be accepted
or rejected based on the occurrence of an open reading frame.
GENETIC RECOMBINATION IN THE HUMAN
GENOME
The earliest maps of the human genome have relied on the
measurement of genetic recombination occurring between two
loci in the genome. Genetic linkage maps have provided a view
of the order and distance between landmarks (polymorphic
loci) that is derived by scoring the number of recombinations
observed between two loci in a given number of meioses. For
mapping applications, it has generally been assumed that the
physical distance (in base pairs) is proportional to the genetic
distance (in centiMorgans) between the loci. However, it is
well-known that certain regions of the genome exhibit a higher
frequency of recombination than others. The frequency of
genetic recombination can now be compared with the physical
distance between any two markers, given an accurate measurement of the physical distance (or contiguous sequence)
between them. This has been demonstrated on chromosome
22, by plotting the genetic distance between microsatellite
markers in the genetic map relative to their physical separation
along the finished chromosome sequence. The results showed
that the frequency of recombination along the chromosome
varied (within the limits of resolution of the study) between
1.1 cM/Mb (over most of the chromosome: 28.2 Mb) to
6.2 cM/Mb (averaged over four regions totalling 5.2 Mb) (34).
The analysis illustrates how to identify potential recombination
hotspots. It may require a much higher resolution of genetic
analysis, with many more polymorphic markers (see below),
but it is clearly possible to identify minimum regions of
genomic sequence which may play a critical role in enhanced
recombination and to use knowledge of the sequence to design
experimental tests of this hypothesis.
FINDING CONSERVED REGULATORY SEQUENCE
ELEMENTS
A wealth of information can be obtained from comparisons
between genome sequences, to identify the features which
have been conserved during evolution and are therefore likely
to be important for survival. Comparisons made between
distantly related species (in different phyla) yield information
chiefly on gene homologues, as discussed above. Comparison
of more closely related species can reveal similarities in gene
order (synteny), as seen for example in the extensive synteny
over large genomic distances between man and mouse. Great
similarity between human and murine sequences is also
evident at the sequence level. A recent survey of 77 orthologous mouse and human gene pairs, for example, revealed an
average of 82% identity in coding regions (40), in close agreement to an earlier study comparing cDNA sequences (41). The
gene pair study also illustrated the frequent occurrence of an
extensive degree of homology in non-coding regions. Fifty-six
per cent of 3′-untranslated regions (UTRs) and 50% of
5′-UTRs had >60% sequence identity. Notably, this level of
sequence identity was also observed in the 100 bases immediately upstream of the promoters in 70% of the gene pairs and,
over larger blocks, generally conservation was found more
often in the upstream sequences (36%) than in introns (23%)
(40).
The systematic identification of conserved sequences in noncoding regions offers potentially an excellent strategy for the
identification of transcriptional regulatory elements. The
strategy is most applicable to genes that have conserved functions and expression patterns. For example, the stem cell
leukaemia (SCL) gene encodes a basic helix–loop–helix transcription factor with a key role in the early stages of haematopoiesis and vasculogenesis (42). The SCL gene has a specific
pattern of expression that is highly conserved in mammals,
birds, frogs and teleost fish, indicating that the sequence
elements responsible for this level of transcriptional control
might also be conserved. This was confirmed following a
comparison of the human, mouse and chicken SCL gene
sequences, which revealed a series of highly conserved noncoding segments, some of which corresponded precisely with
previously characterized SCL enhancers. The analysis also
identified a peak of mouse–human–chicken homology, located
∼10 kb downstream of the gene, of no known significance.
However, the sequence was found to act as a neural enhancer,
directing an appropriate SCL-like pattern of expression of GFP
or lacz reporter genes in transgenic frogs and mice, respectively (43).
CATALOGUING HUMAN SEQUENCE VARIATION
The first human genome sequence provides a reference for
characterizing all genes and other features and is applicable to
the study of the entire human race. However, there is significant variation between individuals at the level of the DNA
sequence. On average, comparison of the sequence of any two
genomes results in the detection of a sequence variant every
1000 bases. Most (∼90%) sequence variants are single base
substitutions (44,45), known as single nucleotide polymorphisms (SNPs), and the remainder are insertions or deletions of
one or more bases. Sequence variants that occur within genes
Human Molecular Genetics, 2000, Vol. 9, No. 16 Review 2357
or in regulatory elements may affect the function of expression
of a gene. These ‘functional variants’ are responsible for the
genetic component of phenotypic variables, including disease
susceptibility and drug response. A major goal in human
genetics is to identify these functional variants and then to
determine how a specific allele contributes to a particular
phenotypic characteristic. The association of a phenotype with
a particular functional variant allele may be tested directly, by
genotyping the variant in an appropriately selected population,
or indirectly using a nearby SNP which is in linkage disequilibrium (LD) with the functional variant. The indirect approach,
which is likely to predominate this kind of study in the near
future, depends on the extent of LD in the human genome. This
is unknown and there is likely to be considerable local variation. Estimates based on simulations vary from 3 to 100 kb
(46,47). Work is in progress in multiple laboratories to obtain
a true measure of LD in different regions.
The availability of the working draft has been critical for the
development of a first human map of SNPs of sufficient
density to allow the ascertainment of LD on a genome-wide
basis and to underpin association studies. In addition to locusspecific targeted approaches for SNP finding, which are based
on re-sequencing DNA fragments after amplification from
different individuals, three genome-wide approaches have
been taken. Matching EST sequences to each other (48), and
now to the working draft, allows detection of sequence differences and hence candidate SNPs. Similarly, candidate SNPs
have been detected by comparing the draft or finished
sequence of overlapping BAC or PAC clones (45; E. Dawson
et al., unpublished data), and thus identifying dense clusters of
SNPs which are separated by regions of typically 0.1–0.5 Mb.
More recently, The SNP Consortium (TSC; a consortium of
industrial organizations, the Wellcome Trust and genome
research laboratories; see http://www.snp.cshl.org ), have
undertaken large-scale genomic sequencing approaches to
identify SNPs across the entire genome. Sequences obtained
from the DNA of 24 individuals representing a range of ethnic
groups were aligned to each other, or to the working draft. Pilot
studies carried out either on the whole genome or on chromosome 22 demonstrated the feasibility of the project and,
further, that the availability of the genome sequence would
more than double the number of candidate SNPs produced
(49,50). Nearly all the SNPs can also be mapped directly by
alignment to the draft sequence and the TSC programme is on
course to produce a map at a density of at least one SNP per
5 kb (double the original target) and to finish well ahead of
schedule. The integration of SNPs from all sources using the
working draft will thus pave the way for a new era in human
genetic studies.
CONCLUSION
Sequencing whole genomes has brought many new insights in
our understanding of the biology and genetics of the organism.
For the first time we are able to compare complete gene sets
and to begin to unravel the molecular basis for biological
complexity. The sequence of one organism can be used for
annotation of another. With the benefit of experience and data
from the model organism studies, the emergence of the
working draft of the human genome provides an immediate
platform to study the biology of Homo sapiens. Already there
are great advances being made in resolving the human gene
set, establishing studies of their expression in different
environments using microarrays and in studying their association with human disease. The value of the working draft is
already illustrated from discovery and rapid characterization of
many genes involved in disease, notably including genes
involved in cancer [the breast cancer gene BRCA2 (51)],
immune defects [e.g. X-linked lymphoproliferative disease
(52)], deafness [Pendred syndrome (53), DFNA5 (54),
DFNA13 (55)] to name but a few. The working draft has also
made it possible to construct a comprehensive SNP map of the
genome. At the same time, we should not lose sight of the task
of finishing the reference sequence. Only then will the full
potential of sequence analysis be realized. Armed with a
genome sequence that is >99.99% accurate and has no gaps,
researchers will be able to explore the functional significance
of every last base in the human genome and all the variants
thereof, assured of the reliability of their basic reference.
Determining the genome sequence has been the main purpose
of the Human Genome Project until now; realizing its full
potential is the challenge that lies ahead.
ACKNOWLEDGEMENTS
The author is most grateful for helpful discussions with Alex
Bateman, Richard Durbin, Tim Hubbard, Jim Mullikin, Jane
Rogers, John Sulston and other members of the Sanger Centre.
APPENDIX
Centres involved in the international human sequencing
consortium (source: http://www.sanger.ac.uk/HGP ; http://
www.nhgri.nih.gov )
1. Baylor College of Medicine, Houston, TX, USA
2. Beijing Human Genome Center, Institute of Genetics,
Chinese Academy of Sciences, Beijing, China
3. DNA Database of Japan
4. European Bioinformatics Institute, Cambridge, UK
5. Gesellschaft für Biotechnologische Forschung mbH,
Braunschweig, Germany
6. Genoscope, Evry, France
7. Genome Therapeutics Corporation, Waltham, MA, USA
8. Institute for Molecular Biotechnology, Jena, Germany
9. Joint Genome Institute, US Department of Energy, Walnut
Creek, CA, USA
10. Keio University, Tokyo, Japan
11. Max Planck Institute for Molecular Genetics, Berlin,
Germany
12. National Center for Biotechnology Information, NIH,
Bethesda, MD, USA
13. RIKEN Genomic Sciences Center, Saitama, Japan
14. Sanger Centre, Hinxton, UK
15. Stanford DNA Sequencing and Technology Development
Center, Palo Alto, CA, USA
16. Stanford Human Genome Center, Palo Alto, CA, USA
17. University of Washington Genome Center, Seattle, WA,
USA
18. University of Washington Multimegabase Sequencing
Center, Seattle, WA,USA
2358 Human Molecular Genetics, 2000, Vol. 9, No. 16 Review
19. Whitehead Institute for Biomedical Research, MIT,
Cambridge, MA, USA
20. Washington University Genome Sequencing Center, St
Louis, MO, USA
REFERENCES
1. Goffeau, A. et al. (1996) Life with 6000 genes. Science, 274, 563–567.
2. Goffeau, A. et al. (1997) The Yeast Genome Directory. Nature, 387, 1–105.
3. The C.elegans Sequencing Consortium (1998) Genome sequence of the
nematode C.elegans: a platform for investigating biology. Science, 282,
2012–2018.
4. Rubin, G.M. et al. (2000) Comparative genomics of the eukaryotes.
Science, 87, 2204–2215.
5. Coulson, A. et al. (1986) Towards a physical map of the genome of the nematode Caenorhabditis elegans. Proc. Natl Acad. Sci. USA, 83, 7821–7825.
6. Olson, M.V. et al. (1986) Random-clone strategy for genomic restriction
mapping in yeast. Proc. Natl Acad. Sci. USA, 83, 7826–7830.
7. Coulson, A. et al. (1988) Genome linking with yeast artificial chromosomes. Nature, 335, 184–186.
8. Wilson, R. et al. (1994) 2.2 Mb of contiguous nucleotide sequence from
chromosome III of C.elegans. Nature, 368, 32–38.
9. Dib, C. et al. (1996) A comprehensive genetic map of the human genome
based on 5264 microsatellites. Nature, 380, 152–154.
10. Burke, D.T., Carle, G.F. and Olson, M.V (1987) Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome
vectors. Science, 236, 806–812.
11. Cox, D.R. et al. (1990) Radiation hybrid mapping: a somatic cell genetic
method for constructing high-resolution maps of mammalian chromosomes. Science, 250, 245–250.
12. Walter, M.A. et al. (1994) A method for constructing radiation hybrid
maps of whole genomes. Nature Genet., 7, 22–28.
13. Foote, S. et al. (1992) The human Y chromosome: overlapping DNA
clones spanning the euchromatic region. Science, 258, 60–66.
14. Chumakov, I. et al. (1992) Continuum of overlapping clones spanning the
entire human chromosome 21q. Nature, 359, 380–387.
15. Gemmill, R.M. et al. (1995) A second-generation YAC contig map of
human chromosome 3. Nature, 377, 299–319.
16. Krauter, K. et al. (1995) A second-generation YAC contig map of human
chromosome 12. Nature, 377, 321–333.
17. Doggett, N.A. et al. (1995) An integrated physical map of human chromosome 16. Nature, 377, 335–365.
18. Collins, J.E. et al. (1995) A high-density YAC contig map of human chromosome 22. Nature, 377, 367–379.
19. Bouffard, G.G. et al. (1997) A physical map of human chromosome 7: an
integrated YAC contig map with average STS spacing of 79 kb. Genome
Res., 7, 673–692.
20. Hudson, T.J. et al. (1995) An STS-based map of the human genome.
Science, 270, 1945–1954.
21. Deloukas, P. et al. (1998) A physical map of 30 000 human genes.
Science, 282, 744–746.
22. Hillier, L.D. et al. (1996) Generation and analysis of 280 000 human
expressed sequence tags. Genome Res., 6, 807–828.
23. Ioannou, P.A. et al. (1994) A new bacteriophage P1-derived vector for the
propagation of large human DNA fragments. Nature Genet., 6, 84–89.
24. Shizuya, H. et al. (1992) Cloning and stable maintenance of 300-kilobasepair fragments of human DNA in Escherichia coli using an F-factor-based
vector. Proc. Natl Acad. Sci. USA, 89, 8794–8797.
25. Gregory, S.G., Howell, G.R. and Bentley, D.R. (1997) Genome mapping
by fluorescent fingerprinting. Genome Res., 7, 1162–1168.
26. Marra, M.A. et al. (1997) High throughput fingerprint analysis of largeinsert clones. Genome Res., 7, 1072–1084.
27. Ewing, B. et al. (1998) Base-calling of automated sequencer traces using
phred. I. Accuracy assessment. Genome Res., 8, 175–185.
28. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer
traces using phred. II. Error probabilities. Genome Res., 8, 186–194.
29. Bouck, J. et al. (1998) Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res., 8, 1074–1084.
30. Weber, J.L. and Myers, E.W. (1997) Human whole-genome shotgun
sequencing. Genome Res., 7, 401–409.
31. Venter, J.C. et al. (1998) Shotgun sequencing of the human genome.
Science, 280¸ 1540–1542.
32. Green, P. (1997) Against a whole-genome shotgun. Genome Res., 7, 410–417.
33. Myers, E.W. et al. (2000) A whole-genome assembly of Drosophila.
Science, 287, 2196–2204.
34. Dunham, I. et al. (1999) The DNA sequence of human chromosome.
Nature, 402, 489–495.
35. Hattori, M. et al. (2000) The DNA sequence of human chromosome 21.
The chromosome 21 mapping and sequencing consortium. Nature, 405,
311–319.
36. Antequera, F. and Bird, A. (1994) Predicting the total number of human
genes. Nature Genet., 8, 114.
37. Liang, F. et al. (2000) Gene index analysis of the human genome estimates
approximately 120 000 genes. Nature Genet., 25, 239–240.
38. Ewing, B. and Green, P. (2000) Analysis of expressed sequence tags indicates 35 000 human genes. Nature Genet., 25, 232–234.
39. Roest Crollius, H. et al. (2000) Estimate of human gene number provided
by genome-wide analysis using Tetraodon nigriviridus DNA sequence.
Nature Genet., 25, 235–238.
40. Jareborg, N., Birney, E.and Durbin, R. (1999) Comparative analysis of
noncoding regions of 77 orthologous mouse and human gene pairs.
Genome Res., 9, 815–824.
41. Makalowski, W. et al. (1996) Comparitive analysis of 1196 orthologous
mouse and human full-length in mRNA and protein sequences. Genome
Res., 6, 846–857.
42. Begley, C.G. and Green, E.W. (1999) The SCL gene: from case report to
critical hematopoietic regulator. Blood, 93, 2760–2770.
43. Gottgens, B. et al. (2000) Analysis of vertebrate SCL loci identifies conserved enhancers. Nature Biotechnol., 18, 181–186.
44. Wang, D.G. et al. (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome.
Science, 280, 1077–1082.
45. Taillon-Miller, P. et al. (1998) Overlapping genomic sequences: a treasure
trove of single-nucleotide polymorphisms. Genome Res., 8, 748–754.
46. Kruglyak, L. (1999) Prospects for whole-genome linkage disequilibrium
mapping of common disease genes. Nature Genet., 22, 139–144.
47. Collins, A., Lonjou, C. and Morton, N.E. (1999) Genetic epidemiology of single-nucleotide polymorphisms. Proc. Natl Acad. Sci. USA, 96, 15173–15177.
48. Picoult-Newberg, L. et al. (1999) Mining SNPs from EST databases.
Genome Res., 9, 167–174.
49. Altschuler, D. et al. (2000) A human SNP map generated by reduced representation shotgun sequencing. Nature, in press.
50. Mullikin, J.C. et al. (2000) A SNP map of human chromosome 22. Nature,
in press.
51. Wooster, R. et al. (1995) Identification of the breast cancer susceptibility
gene BRCA2. Nature, 378, 789–792.
52. Coffey, A.J. et al. (1998) Host response to EBV infection in X-linked lymphoproliferative disease results from mutations in an SH2-domain encoding gene. Nature Genet., 20, 129–135.
53. Everett, L.A. et al. (1997) Pendred syndrome is caused by mutations in a
putative sulphate transporter gene (PDS). Nature Genet., 17, 411–422.
54. Van Laer, L. et al. (1998) Nonsyndromic hearing impairment is associated
with a mutation in DFNA5. Nature Genet., 20, 194–197.
55. McGuirt, W.T. et al. (1999) Mutations in COL11A2 cause non-syndromic
hearing loss (DFNA13). Nature Genet., 23, 413–419.