Shaping Biomedicine as an Information Science

Shaping Biomedicine as an Information Science
Timothy 1enoir
A New Biology for the Information Age
Computational approaches have substantially transformed and extended the domain of theorizing in these
areas in ways unavailable to older, non-computer-based
forms of theorizing.
But other information technologies have also proved
crucial to bringing about this change. In the 1970s
through the 1990s, armed with such new tools of molecular biology as cloning, restriction enzymes, protein
sequencing, and gene product amplification, biologists
were awash in a sea of new data. They deposited this
data in large and growing electronic databases of genetic
maps, atomic coordinates for chemical and protein structures, and protein sequences. These developments in
technique and instrumentation launched biology onto
the path of becoming a data-bound science, a “science”
in which all the data of a domain-such as a genomeare available before the laws of the domain are understood. Biologists have coped with this data explosion by
turning to information science: applying artificial intelligence and expert systems and developing search tools
to identify structures and patterns in their data.
The aim of this paper is to explore early developments in the introduction of computer modeling tools
from artificial intelligence (AI) and expert systems into
biochemistry in the 1960s and 1970s, and the introduction of informatics techniques for searching databases and extracting biological function and structure
in the emerging field of genomics during the 1980s and
1990s. I have two purposes in this line of inquiry.
First I want to suggest that by introducing tools of information science biologists have sought to make biology a unified theoretical science with predictive powers
analogous to other theoretical disciplines. But I want
also to suggest that along with this highly heterogeneous
and hybrid form of computer-based experimentation
S
ometime in the mid- 1960s biology became an information science. While Franqois Jacob and Jacques
Monod’s work on the genetic code is usually credited
with propelling biology into the Information Age, in
this essay I explore the transformation of biology by what
have become essential tools to the practicing biochemist and molecular biologist: namely, the contributions
of information technology. About the same time as Jacob
and Monod‘s work, developments in computer architectures and algorithms for generating models of chemical structures and simulations of chemical interactions
were created that allowed computational experiments
to interact with and draw together theory and laboratory experiment in completely novel ways. The new computational science linked with visualization has had a
major impact in the fields of biochemistry, molecular
dynamics, and molecular pharmacology (Friedhoff &
Benzon, 1989; Panel on Information Technology and
the Conduct of Research, 1989; McCormick, DeFanti,
& Brown, 1987, p. A- 1 ;Hall, 1995). By “computational
science” I mean the use of computers in science disciplines like these as distinct from computer science (McCormick, DeFanti, & Brown, p. 11). The sciences of
visualization are defined by McCormick, DeFanti, and
Brown as follows:
Images and signals may be captured from cameras or sensors, transformed by image processing, and presented
pictorially on hard or soft copy output. Abstractions of
these visual representationscan be transformed by computer vision to create symbolic representations in the form
ofsymbols and structures. Using computer graphics, symbols or structures can be synthesized into visual representations. (I? A-I)
27
I
,
Timothy Lenoir
28
and theorizing has come a different conception of
theorizing itself: one based on models of informationprocessing and best captured by the phrase “knowledge
engineering” developed within the AI community. My
second concern is to contribute to recent discussions on
the transformation of biology into an information science. Lily Kay, Evelyn Fox Keller, Donna Haraway, and
Richard Doyle have explored the role of metaphor, disciplinary politics, economics, and culture in shaping the
context in which the language of “DNA code,” “genetic
information,” “text,” and “transcription” have been inserted into biological discourse, often in the face of resistance from some of the principal actors themselves
(Doyle, 1997). I am more interested than these authors
in sofiware and the computational regimes that it enables. Elaborating on the theme of “tools to theory,” recently espoused in science and technology studies, I am
interested in exploring the role of the computational medium itself in shaping biology as an information science.
But a further crucial stimulation to the takeoff of bioinformatics, of course, was provided by hardware and
networking developments underwritten by the NIH and
NSF (Hughes, 1999).
Computers and Biochemistry:
Molecular Modeling
The National Institutes of Health have been active at
every stage in making biology an information science.
NIH support was crucial to the explosive take-off of computational chemistry and the promotion of computerbased visualization technologies in the mid- 1960s. The
agency sponsored a conference at UCLA in 1966 on “Image Processing in Biological Science.” The NIH’s Bruce
Waxman, co-chair of the meeting, set out the NIH agenda
for computer visualization by sharply criticizing the notion of mere “image processing” as the direction that
should be pursued in computer-enhanced vision research.
The goal of computer-assisted “vision,” he asserted, was
not to replicate relatively low-order motor and perceptual capabilities even at rapid speeds. “I have wondered
whether the notion of image processing is itself restrictive; it may connote the reduction of and analysis of ‘natural’ observations but exclude from consideration two- or
three-dimensional data which are abstractions of phenomena rather than the phenomena themselves” (Ramsey,
1968, pp. xiii-xiv). Waxman suggested “pattern recognition” as the subject that they should really pursueand in particular where the object was what he termed
“non-natural.” In general, Waxman asserted, by its capacity to quantize massive data sets automatically, the
computer, linked with pattern-recognition methods of
imaging the non-natural, would permit the development
of stochastically based biological theory.
Waxman’s comments point to one of the important
and explicit goals of the NIH and other funding agencies: to mathematize biology. That biology should follow in the footsteps of physics had been the centerpiece
of a reductionist program since at least the middle of
the nineteenth century. But the development of molecular biology in the 1950s and 1960s encouraged the notion that a fully quantitative theoretical biology was on
the horizon. The computer was to be the motor for this
change. Analogies were drawn between highly mathematized and experimentally based Big Physics and the
anticipated “Big Biology.” As Lee B. Lusted, the chairman of the Advisory Committee to the National Research Council on Electronic Computers in Biology and
Medicine argued, because of the high cost of computer
facilities for conducting biological research, computer
facilities would be to biology what SLAC (the Stanford
Linear Accelerator) and the Brookhaven National Laboratory were to physics (Ledley, 1965, pp. ix-x). Robert
Ledley, then affiliated with the Division of Medical Sciences, National Research Council, and author of the
volume expressed the committee’s interest in fostering
computing and insisted that biology was on the threshold of a new era. New emphasis on quantitative work
and experiment was changing the conception of the biologist: the view of the biologist as an individual scientist, personally carrying through each step of his investigation and his data-reduction processes, was rapidly
broadened to include the biologist as a part of an intricate organizational chart that partitions scientific, technical, and administrative responsibilities. In the new
organization of biological work, modeled on the physicists’ work at large national labs, the talents and knowledge of the biologist “must be augmented by those of
the engineer, the mathematical analyst, and the professional computer programmer” (Ledley, 1965, p. xi).
At the UCLA meeting Bruce Waxman held up as a
model the work on three-dimensional representations
of protein molecules carried out by Cyrus Levinthal.
Levinthal worked with the facilities of MIT’s Project on
Mathematics and Computation (MAC), one of the first
centers in the development of graphics. Levinthal’s
project was an experiment in computer time-sharing
linking biologists, engineers, and mathematicians in the
construction of Big Biology, Levinthal’s work at MIT
illustrates the role of computer visualization as a condition for theory development in molecular biology and
biochemistry.
Since the work of Linus Pauling and Robert Corey
Shaping Biomedicine as an Information Science
on the a-helical structure of most protein molecules in
1953, models have played a substantial role in biochemistry. Watson and Crick‘s construction of the double helix
model for DNA depended crucially upon the construction of a physical model. Subsequently, work in the field
of protein biology has demonstrated that the functional
properties of a molecule depend not only on the interlinkage of its chemical constituents but also on the way
in which the molecule is configured and folded in three
dimensions. Much of biochemistry has focused on understanding the relationship between biological function and molecular conformational structure.
A milestone in the making of physical models (in
three dimensions) of molecules was John Kendrew’s construction of myoglobin. The power of models in investigations of biomolecular structure was evident from
work such as this, but such tools had limitations as well.
Kendrew’s model, for instance, was the first successful
attempt to build a physical model into a Fourier map of
a molecule’s electron densities derived from X-ray crystallographic sources. As a code for electron density, clips
of different colors were put at the proper vertical positions on a forest of steel rods. A brass wire model of the
alpha helices and beta sheets that make up the molecule
was then built in among the rods. Mechanical interference made it difficult to adjust the structure, and the
model was hard to see because of the large number of
supporting rods. The model incorporated both too little
and too much: too little, in that the basic shape of the
molecule was not represented; too much, in that the forest of rods made it difficult to see the three-dimensional
folding of the molecule (even though bond connectivity was represented). Perhaps the greatest drawback
was the model’s size: It filled a large room. The answer
to these problems was computer representation. For an
early stereogram of myoglobin constructed by computer on the basis of Kendrew’s data, see Watson (1969).
It was obvious that such three-dimensional representations would only become really useful when it was possible to manipulate them at will. Proponents of computer graphics argued that this flexibility is exactly what
computer representations of molecular structure would
allow. Cyrus Levinthal first illustrated these methods in
1965.
Levinthal reasoned that since protein chains are
formed by linking molecules of a single class, amino
acids, it should be relatively easy to specify the linkage
process in a form mathematically suitable for a digital
computer (Levinthal, 1966). Initially the computer
model considers the molecule as a set of rigid groups of
constant geometry linked by single bonds around which
29
rotation is possible. Program input consists of a set of
coordinates consistent with the molecular stereochemistry as given in data from X-ray crystallographic studies. Several constraints delimit stable configurations
among numerous possibilities resulting from combinations of linkages among the twenty different amino
acids. These include bond angles, bond lengths, van der
Waals radii for each species of atom, and the planar
configuration of the peptide bond.
Molecular biologists, particularly the biophysicists
among them, were motivated to build a unified theory,
and the process of writing a computer program that could
simulate protein structure would assist in this goal by
providing a framework of mental and physical discipline
from which would emerge a fully mathematized theoretical biology. In such non-mathematized disciplines as
biology, the language of the computer program would
serve as the language of science (Oettinger, 1966, p.
161). But there was a hitch: In an ideal world dominated by a powerful central theory, one would like, for
example, to use the inputs of xyz coordinates of the atoms, types of bond, and so forth, to calculate the pairwise
interaction of atoms in the amino acid chain, predict
the conformation of the protein molecule, and check
this against its corresponding X-ray crystallographic image. As described, however, the parameters used as input in the computer program do not provide much limitation on the number of molecular conformations. Other
sorts of input are needed to filter among the myriad
possible structures. Perhaps the most important of these
is energy minimization. In explaining how the thousands
of atoms in a large protein molecule interact with one
another to produce a stable conformation, one hypothesizes that, like water running downhill, the molecular
string will fold to reach a lowest energy level. To carry
out this sort of minimization would entail calculating
the interactions of all pairs of active structures in the
chain, minimizing the energy corresponding to these
interactions over all possible configurations, and then
displaying the resulting molecular picture. Unfortunately,
this objective could not be achieved, as Levinthal noted,
because a formula describing such interactions could not,
given the state of molecular biological theory in 1965,
even be stated, let alone be manipulated with a finite
amount of labor. In Levinthal’s words:
The principal problem, therefore, is precisely how to
provide correct values for the variable angles. . . . I should
emphasize the magnitude of the problem that remains
even after one has gone as far as possible in using chemical constraints to reduce the number of variables from
Timothy Lenoir
30
several thousand to a few hundred. . . . I therefore decided to develop programs that would make use of a mancomputer combination to do a kind of model-building
that neither a man nor a computer could accomplish
alone. This approach implies that one must be able to
obtain information from the computer and introduce
changes in the way the program is running in a span of
time that is.appropriate to human operation. This in turn
suggests that the output of the computer must be presented not in numbers but in visual form. (Levinthal,
1966, pp. 48-49)
i
In Levinthal’s view, visualization generated in real-time
interaction between human and machine can assist
theory construction. The computer becomes in effect
both a microscope for examining molecules as well as a
laboratory for quantitative experiment. Levinthal’s program, CHEMGRAF, could be programmed with sufficient structural information as input from physical and
chemical theory to produce a trial molecular configuration as graphical output. A subsystem called SOLVE then
packed the trial structure by determining the local minimum energy configuration due to non-bonded interactive forces. A subroutine of this program, called ENERGY, calculated the torque vector caused by the atomic
pair interactions on rotatable bond angles. An additional
procedure for determining the conformation of the
model structure was “cubing.” This procedure searched
for nearest neighbors of an atom in the center of a
3 x 3 x 3 cube and reported whether any atoms were in
the twenty-six adjacent cubes. The program checked for
atom pairs in the same or adjacent cubes and for atoms
within a specified distance. It maintained a list of pairs
that were, for instance, in contact violation, while another routine calculated energy contribution of the pair
to the molecule. The cubing program rejected as early
as possible all those atom pairs where the interatomic
distance was too great to be of more than negligible
contribution, and it enabled more efficient use of comLevinthal emphasized that interactivity was a crucial component of CHEMGRAF, Built into his system
was the requirement of observing the result of the calculations interactively so that one could halt the minimization process at any step, either to terminate it completely or to alter the conformation and then resume it
(Katz & Levinthal, 1972). Levinthal noted that often,
as the analytical procedures were grinding on, a molecule would be trapped in an unfavorable conformation
or in a local minimum and the problem would be ob-
-
scure until the conformation could be viewed threedimensionally. CHEMGRAF enabled the investigator
to assist in generating the local minimization of energy
for a subsection of the molecule through three different
types of user-guided empirical manipulation of structure: “close,” “glide,” and “revolve.” These manipulations in effect introduced external “pseudo-energy”terms
into the computation that pulled the structure in various ways (Levinthal, Barry, Ward, & Zwick, 1968). Atoms could be rotated out of the way by direct command
and a new starting conformation chosen from which to
continue the minimization procedure. By pulling individual atoms to specific locations indicated by experimental data from X-ray diffraction studies, a fit between
X-ray crystallographic data and the computer model of
a specific protein, such as myoglobin, could ultimately
be achieved. With the model in hand of the target molecule, such as myoglobin, one could then proceed to investigate the various energy terms involved in holding
the protein molecule together. Thus, the goal of this interactive effort involving human and machine was eventually to generate a theoretical formulation for the lowest energy state of a protein molecule, to predict its
structure, and to have that prediction confirmed by
X-ray crystallographic images (Hall, 1995).
The enormous number of redundant trial calculations involved in Levinthal’s work hints at the desirability of combining an expert system with a visualization system. E. J. Corey and W. Todd Wipke, working
nearby at Harvard, took this next step. (Space limitations prevent me from discussing their work here.) In
developing their work, Wipke and Corey drew upon a
prototype expert system at Stanford called DENDRAL,
the result of a collaboration at Stanford among computer scientist Edward Feigenbaum, biologist Joshua
Lederberg, and organic chemist Carl Djerassi, working
on another of the NIH initiatives to bring computers
directly into the laboratory. The Stanford project, called
DENDRAL, was an early effort in the field of what
Feigenbaum and his mentors Herbert Simon and Marvin
Minsky termed “knowledge engineering.” In effect, it
attempted to put the human inside the machine.
DENDRAL: The AI Approach at Stanford
DENDRAL aimed at emulating an organic chemist operating in the harsh environment of Mars (Lederberg,
n.d.; Lederberg, Sutherland, Buchanan, & Feigenbaum,
1969). The ultimate goal was to create an automated
laboratory as part of the Viking mission planned to land
a mobile instrument pod on Mars in 1975. Given the
Shaping Biomedicine as an Information Science
mass spectrum of an unknown compound, the specific
goal was to determine the structure of the compound.
To accomplish this, DENDRAL would analyze the data,
generate a list of plausible candidate structures, predict
the mass spectra of those structures from the theory of
mass spectrometry, and select as a hypothesis the structure whose spectrum most closely matched the data.
A key part of this program was the representation
of chemical structure in terms of topological graph
theory. Chemical graphs were the visual “language” to
augment the theoretical and practical knowledge of the
chemist with the calculating power of the computer. This
part of the effort was contributed by Lederberg, the winner of the 1958 Nobel Prize in medicine or physiology,
for his work on genetic exchange in bacteria, who had
been interested in the introduction of information concepts into biology for most of his professional life. Selfdescribed as a man with a Leibnizian dream of a universal calculus for the alphabet of human thought,
Lederberg’s interest in mass spectrometry and topological mapping of moleculeswas in part driven by the dream
of mathematizing biology, starting with organic chemistry. The structures of organic molecules are bewilderingly complex, and the “theory” of organic chemistry
does not have an elegant axiomatic structure analogous,
say, to Newtonian mechanics, even though it is sprinkled
with lots of theory derived from quantum mechanics
and thermodynamics. Lederberg felt that a first step toward such a quantitative, predictive theory would be a
rational systematization of organic chemistry. Trampling
upon a purist‘s notion of theory, Lederberg thought that
computers were the royal road to mathematization in
chemistry:
Could not the computer be of great assistance in the
elaboration of novel and valid theories? I can dream of
machines that would not only execute experiments in
physical and chemical biology but also help design them,
subject to the managerial control and ultimate wisdom
of their human programmer. (Lederberg, 1969)
Mass spectrometry, the area upon which Feigenbaum
and Lederberg concentrated with Carl Djerassi, was a
particularly appropriate challenge. It differed in at least
one crucial aspect from the molecular modeling of proteins I have considered above. Whereas in those areas a
well-understood theory, such as the quantum mechanical theory of the atomic bond, was the basis for developing the computer program to examine effects in large
calculations, there was no theory of mass spectrometry
that could be transferred to the program from a text-
31
book (Lederberg, Sutherland, Buchanan, & Feigenbaum,
1969). The field has bits of theory to draw upon, but it
has developed mainly by following rules of thumb, which
are united in the form of the chemist-expert. The field
thrives on tacit knowledge. The following excerpt from
a memo by Feigenbaum written after his first meetings
with Lederberg on the DENDR4L project provides a
vivid sense of the objective and the problems faced:
The main assumption we are operating under is that the
required information is buried in chemists’brains if only
we can extract it. Therefore, the initiative for the interaction must come from the system not the chemist, while
allowing the chemist the flexibility to supply additional
information and to modify the question sequence or content of the system. . . . What we want to design then is a
question asking system [that] will gather rules about the
feasibility of the chemical molecules and their subgraphs
being displayed. (“SecondCut,” n.d.)
In short, Feigenbaum sought to emulate a gifted chemist with the computer. That chemist was Carl Djerassi,
nicknamed “El Supremo” by his graduate and postdoctoral students. Djerassi’s astonishing achievements
as a mass spectrometrist relied on his abilities to feel his
way through the process without the aid of a complete
theory, relying rather on experience, tacit knowledge,
hunches, and rules of thumb. In interviews Feigenbaum
elicited this kind of information from Djerassi, in a process that heightened awareness of the structure of the
field for both participants. The process of involving a
computer in chemical research in this way organized a
variety of kinds of information, which constituted a crucial step toward theory.
A Paradigm Shift in Biology
Thus far I have been considering efforts to predict structure from physical principles as the first path through
which computer science and computer-based information technology began to reshape biology. The Holy Grail
of biology has always been the construction of a mathematized theoretical biology, and for most molecular
biologists the journey there has been directed by the
notion that the information for the three-dimensional
folding and structure of proteins is uniquely contained
in the linear sequence of their amino acids (Anfinsen,
1973). As we have seen, the molecular dynamics approach assumed that if all the forces between atoms in a
molecule, including bond energies and electrostatic attraction and repulsion, are known, then it is possible to
calculate the three-dimensional arrangement of atoms
32
rimothy Lenoir
that requires the least energy. Christian B. Anfinsen
(1973) discussed the work for which he was awarded
the Nobel Prize in chemistry in 1972:
This hypothesis (the “thermodynamichypothesis”)states
that the three-dimensional structure of a native protein
in its normal physiologicalmilieu . . . is the one in which
the Gibbs free energy of the whole system is lowest; that
is, that the totality of interatomic interactions and hence
by the amino acid sequence, in a given environment.
(I?223)
Because this method requires intensive computer calculations, shortcuts have been developed that combine
computer-intensive molecular dynamics computations,
artificial intelligence, and interactive computer graphics
in deriving protein structure directly from chemical
structure.
While theoretically elegant, the determination of
protein structure from chemical and dynamical principles
has been hobbled with difficulties. In the abstract, analysis of physical data generated from protein crystals, such
as X-ray and nuclear magnetic resonance data, should
offer rigorous ways to connect primary amino acid sequences to three-dimensional structure. But the problems of acquiring good crystals and the difficulty of getting NMR data of sufficient resolution are impediments
to this approach. Moreover, while quantum mechanics
provides a solution to the protein-folding problem in
theory, the computational task of predicting structure
from first principles for large protein molecules containing many thousands of atoms has proved impractical.
Furthermore, unless it is possible to grow large, wellordered crystals of a given protein, X-ray structure
determination is not an option. The development of
methods of structure determination by high-resolution
two-dimensional NMR has alleviated this situation
somewhat, but this technique is also costly and timeconsuming, requiring large amounts of protein of high
solubility, and is severely limited by protein size. These
difficulties have contributed to the slow rate of progress
in registering atomic coordinates of macromolecules.
An indicator of the difficulty of pursuing this approach alone is suggested by the relatively slow growth
of databanks of atomic coordinates for proteins. The
Protein Data Bank (PDB) was established in 1971 as a
computer-based archival resource for macromolecular structures. The purpose of the PDB was to collect,
’
standardize, and distribute atomic coordinates and other
data from crystallographic studies. In 1977 the PDB
listed atomic coordinates for forty-seven macromolecules
(Bernstein et al., 1977). In 1987 that number began to
increase rapidly at a rate of about 10 percent per year
because of the development of area detectors and widespread use of synchrotron radiation; by April 1990
atomic coordinate entries existed for 535 macromolecules. Commenting on the state of the art in 1990,
Holbrook and colleagues (1993) noted that crystal determination could require one or more man-years. Currently (1999), the PDB’s Biological Macromolecule
Crystallization Database (BMCD) contains entries for
2,526 biological macromolecules for which diffraction
quality crystals have been obtained. These include proteins, protein:protein complexes, nucleic acid, nucleic
acid:nucleic acid complexes, protein:nucleic acid complexes, and viruses.‘
While structure determination was moving at a
snail’s pace, beginning in the 1970s, another stream of
work contributed to the transformation of biology into
an information science. The development of restriction
enzymes, recombinant DNA techniques, gene cloning
techniques, and polymerase chain reactions (PCRs) resulted in a flood of data on DNA, RNA, and protein
sequences, Indeed more than 140,000 genes were cloned
and sequenced in the twenty years from 1974 to 1994,
of which more than 20 percent were human genes
(Brutlag, 1994, p. 159). By the early 1990s, well before
the beginning of the Human Genome Initiative, the NIH
GenBank database (release 70) contained more than
74,000 sequences, while the Swiss Protein database
(Swiss-hot) included nearly 23,000 sequences. Protein
databases were doubling in size every twelve months,
and some were predicting that by the year 2000 ten
million base pairs a day would be sequenced as a result
of the technological impact of the Human Genome
Initiative. Such an explosion of data encouraged the development of a second approach to determining the function and structure of protein sequences: namely, prediction from sequence data alone. This “bioinformatics”
approach identifies the function and structure of unknown proteins by applying search algorithms to existing protein libraries in order to determine sequence similarity, percentages of matching residues, and the statistical
significance of each database sequence.
A key project illustrating the ways in which com-
Biological Macromolecule Crystallization Database and the NASA Archive for Protein Crystal Growth Data (version 2.00) are located
the Web at http:llwww.bmcd.nist.gov:8O8O/bmcd/bmcd.html.
on
Shaping Biomedicine as an Information Science
puter science and molecular biology began to merge in
the formation of bioinformatics was the MOLGEN
project at Stanford and events related to the formation
and subsequent development of BIONET. MOLGEN
was a continuation of the projects in artificial intelligence and knowledge engineering begun at Stanford with
DENDRAL. MOLGEN was started in 1975 as a project
in the Heuristic Programming Project with Edward
Feigenbaum as principal investigator directing the thesis projects of Mark Stefik and Peter Friedland (Feigenbaum & Martin, 1977). The aim of MOLGEN was to
model the experimental design activity of scientists in
molecular genetics (Friedland, 1979). Before an experimentalist sets out to achieve some goal, he produces a
working outline of the experiment, guiding each step of
the process. The central idea of MOLGEN was based
on the observation that scientists rarely plan from scratch
in designing a new experiment. Instead, they find a skeletal plan, an overall design that has worked for a related
or more abstract problem, and then adapt it to the particular experimental context. Like DENDRAL, this approach is heavily dependent upon large amounts of
domain-specific knowledge in the field of molecular biology and even more upon good heuristics for choosing
among alternative implementations.
MOLGEN’s designers chose molecular biology as
appropriate for the application of artificial intelligence
because the techniques and instrumentation generated
in the 1970s seemed ripe for automation. The advent of
rapid DNA cloning and sequencing methods had had
an explosive effect on the amount of data that could be
most readily represented and analyzed by a computer.
Moreover, it appeared that very soon progress in analyzing information in DNA sequences would be limited by
the lack of an appropriate combination of search and
statistical tools. MOLGEN was intended to apply rules
to detect profitable directions for analysis and to reject
unpromising ones (Feigenbaum et al., 1980).
Peter Friedland was responsible for constructing the
knowledge-base component of MOLGEN. Though not
himself a molecular biologist, he made a major contribution to the field by assembling the rules and techniques of molecular biology into an interactive, computerized system of analytical programs. Friedland
worked with Stanford molecular biologists Douglas
Brutlag, Laurence Kedes, John Sninsky, and Rosalind
Grymes, who provided expert knowledge on enzymatic
methods, nucleic acid structures, detection methods, and
pointers to key references in all areas of molecular biology. Along with providing an effective encyclopedia
33
of information about technique selection in planning a
laboratory experiment, the knowledge base contained a
number of tools for automated sequence analysis. Brutlag, Kedes, Sninsky, and Grymes were interested in having a battery of automated tools for sequence analysis,
and they contracted with Friedland and Stefik-both
gifted computer program designers-to build these tools
in exchange for contributing their expert knowledge to
the project (Douglas Brutlag, personal communication;
Peter Friedland, personal communication). (In 1987,
after his work on MOLGEN and at IntelliGenetics [discussed below], Friedland went on to become chief scientist at the NASA-Ames Laboratory for Artificial Intelligence.)
This collaboration of computer scientists and molecular biologists helped move biology along the road to
becoming an information science. Among the programs
Friedland and Stefik created for MOLGEN was SEQ,
an interactive self-documenting program for nucleic acid
sequence analysis, which had thirteen different procedures with over twenty-five different subprocedures,
many of which could be invoked simultaneously to provide various analytical methods for any sequence of interest. SEQ brought together in a single program methods for primary sequence analysis described in the
literature by L. J. Korn and colleagues, R. Staden, and
numerous others (Korn, Queen, & Wegman, 1977);
Staden, 1977; Staden, 1978; Staden, 1979). S E Q also
performed homology searches on DNA sequences and
specified the degree of homology, and conducted dyad
symmetry (inverted repeats) searches (Friedland, Brutlag,
Clayton, & Kedes, 1982). Another feature of SEQ was
its ability to prepare restriction maps with the names
and locations of the restriction sites marked on the nucleotide sequence. In addition it had a facility for calculating the length of DNA fragments from restriction digests of any known sequence. Another program in the
MOLGEN suite was GAl (later called MAP). Constructed by Stefik, GA1 was an artificial intelligence program that allowed the generation of restriction enzyme
maps of DNA structures from segmentation data (Stefik, 1977). It would construct and evaluate all logical
alternative models that fit the data and rank them in
relative order of fit. A further program in MOLGEN
was SAFE, which aided in enzyme selection for gene
excision. SAFE took amino acid sequence data and predicted the restriction enzymes guaranteed not to cut
within the gene itself.
In its first phase of development (1977-1980)
MOLGEN consisted of the programs described above
34
Timothy Lenoir
and a knowledge base containing information on about
three hundred laboratory methods and thirty strategies
for using them. It also contained the best currently available data on about forty common phages, plasmids,
genes, and other known nucleic acid structures. The second phase of developxpent beginning in 1980 scaled up
the analytical tools and the knowledge base. Perhaps the
most significant,aspect of the second phase was making
MOLGEN available to the scientificcommunity at large
on the Stanford University Medical Experimental national computer resource, SUMEX-AIM. SUMEXAIM, supported by the Biotechnology Resources Program at NIH since 1974, had been home to DENDRAL
and several other programs. The new experimental resource on SUMEX, comprising the MOLGEN programs
and access to all major genetic databases, was called
GENET. In February 1980 GENET was made available to a carefully limited community of users (Rindfleisch, Friedland, & Clayton, 1981).
MOLGEN and GENET were immediate successes
with the molecular biology community. In their first few
months of operation in 1980 more than two hundred
labs (with several users in each of those labs) accessed
the system. By 1 November 1982 more than three hundred labs on the system around the clock accessed the
system from a hundred institutions (Douglas Brutlag,
personal communication; NIH Special Study Section,
1983; Lewin, 1984). Traffic on the site was so heavy
that restrictions had to be implemented and plans for
expansion considered. In addition to the academic users
a number of biotech firms, such as Monsanto, Genentech, Cetus, and Chiron, used the system heavily. Feigenbaum, principal investigator in charge of the SUMEX
resource, and Thomas Rindfleisch, facility manager, decided to exclude commercial users in order to ensure
that the academic community had unrestricted access
to the SUMEX computer and to answer the NIH’s concern that commercial users gain unfair access to the resource (Maxam to GENET community, 1982).
To provide commercial users with their own unrestricted access to the GENET and MOLGEN programs,
Brutlag, Feigenbaum, Friedland, and Kedes formed a
company, IntelliGenetics, which would offer the suite
of MOLGEN software for sale or rental to the emerging
biotechnology industry, With 125 research labs doing
recombinant DNA research in the United States alone
and a number of new genetic engineering firms starting
up, opportunities looked outstanding. No one was currently supplying software in this rapidly growing genetic
engineering marketplace. With their exclusive licensing
arrangement with Stanford for the MOLGEN sofnvare,
IntelliGenetics was poised to lead a huge growth area.
The business plan expressed well the excellent position
of the company:
A major key to the success of IntelliGeneticswill be the
fact that the recombinant DNA research revolution is so
recent. While every potentia1 customer is well capitalized, few have the manpower they say they need; this
year several firms are hiring 50 molecular geneticist
Ph.D.s, and one company speaks of 1000 within five
years. These firms require computerized assistance-for
the storage and analysis of very large amounts of DNA
sequence information which is growing at an exponential rate-and will continue to do so for the foreseeable
future (10 years). Access to this information and the ability to perform rapid and efficient pattern recognition
among these sequences is currently being demanded by
most of the firms involved in recombinant DNA research.
The programs offered by IntelliGenetics will enable
the researchers to perform tasks that are: 1) virtually impossible to perform with hand calculations, and 2) extremely time-consuming and prone to human error. In
other word, IntelliGenetics ofers researcherproductivity improvement to an industty with expanding demandfor more
researchers which is experiencing a severe supply shortage
[emphasis in original]. (“Business plan for IntelliGenetics,”
1981; Friedland to Reimers, 1984)*
The resource that IntelliGenetics eventually offered to
commercial users was BIONET. Like GENET, BIONET
combined all databases of DNA sequences with programs
to aid in their analysis in one computer site.
Prior to the startup of BIONET and contemporaneous with GENET, other resources for DNA sequences
were developed. Several researchers were making their
databases available. Under the auspices of the National
Biomedical Research Foundation, Margaret Dayhoff had
created a database of DNA sequences and some software for sequence analysis that was marketed commercially. Walter Goad, a physicist at Los Alamos National
Laboratory, collected DNA sequences from the published
literature and made them freely available to researchers.
But by the late 1970s the number of bases sequenced
was already approaching three million and expected to
Details of the software licensing arrangement and the revenues generated are discussed in a letter to Niels Reimers, at the Stanford Office
of Technology Licensing, on the occasion of renegotiating the terms.
Shaping Biomedicine as an Information Science
double soon. Some form of easy communication between
labs for effective data handling was considered a major
priority in the biological community, While experiments were going on with GENET, a number of nationally prominent molecular biologists had been pressing to start an NEH-sponsored central repository for
DNA sequences. In 1979, at Rockefeller University,
Joshua Lederberg organized an early meeting with such
an agenda. The proposed NIH initiative was originally
supposed to be coordinated with a similar effort at the
European Molecular Biology Laboratory (EMBL) in
Heidelberg, but the Europeans became dissatisfied with
the lack of progress on the American end and decided
to go ahead with their own databank. EMBL announced
the availability of its Nucleotide Sequence Data Library
in April 1982, several months before the American
project was funded. Finally, in August 1982, the NIH
awarded a contract for $3 million over five years to
the Boston-based firm of Bolt, Berenek, and Newman
(BB&N) to set up the national database known as
GenBank in collaboration with Los Alamos National
Laboratory. IntelliGenetics submitted an unsuccessful
bid for that contract.
The discussions leading up to GenBank included
consideration of funding a more ambitious databank,
known as “Project 2,” which was to provide a national
center for the computer analysis of DNA sequences.
Budget cuts forced the NIH to abandon that scheme
(Lewin, 1984). However, officials there returned to it
the following year, thanks to the persistence of IntelliGenetics representatives. Although GenBank launched
a formal national DNA sequence collection effort, the
need for computational facilities voiced by molecular
biologists was still left unanswered. In September 1983,
after a review process that took over a year, the NIH
division of research resources awarded IntelliGenetics a
$5.6 million five-year contract to establish BIONET
(Lewin, 1984). The contract, the largest award of its
kind by the NIH to a for-profit organization (p. 1380),
started on 1 March 1984 and ended on 27 February
1989.
BIONET first became available to the research community in November 1984. The fee for use was $400
per year per laboratory and remained at that level
throughout its first five years. BIONET’s use grew impressively. Initially the IntelliGenetics team set the target for user subscriptions at 250 labs. However, in March
1985, the annual report for the first year’s activities of
BIONET listed 350 labs with nearly 1,132 users. By
August 1985 that number had increased dramatically to
35
450 labs and 1,500 users (Minutes of the meeting,
1985). In April 1986, for example, BIONET had 464
laboratories comprising 1,589 users. By October 1986
the numbers were 495 labs and 1,716 users (BIONET
users status, 1986). By 1989, 900 laboratories in the
United States, Canada, Europe, and Japan (comprising
about 2,800 researchers) subscribed to BIONET, and
20 to 40 new laboratories joined each month (Huberman, 1989).
BIONET was intended to establish a national computer resource for molecular biology satisfying three
goals, which it fulfilled to varying degrees. A first goal
was to provide a way for academic biologists to obtain
access to computational tools to facilitate research relating to nucleic acids and possibly proteins. In addition
to giving researchers ready access to national databases
on DNA and protein sequences, BIONET would provide a library of sophisticated software for sequence
searching, matching, and manipulation. A second goal
was to provide a mechanism to facilitate research into
improving such tools. The BIONET contract provided
research and development support of further software,
both in-house research by IntelliGenetics scientists and
through collaborative ventures with outside researchers.
A third goal of BIONET was to enhance scientific productivity through electronic communications.
The stimulation of collaborativework through electronic communication was perhaps the most impressive
achievement of BIONET. BIONET was much more
than the Stanford GENET plus the MOLGENIntelliGenetics suite of software. Whereas GENET with
its pair of ports could accommodate only two users at
any one time, BIONET had twenty-two ports providing an estimated annual thirty thousand connect hours
(Friedland, 1984; Smith, Brutlag, Friedland, & Kedes,
1986). All subscribers to BIONET were provided with
e-mail accounts. For most molecular biologists this was
something entirely new, since most university labs were
just beginning to be connected with regular e-mail service. At least twenty different bulletin boards on numerous topics were supported by BIONET. In an effort
to change the culture of molecular biologists by accustoming them to the use of electronic communications
and more collaborative work, BIONET users were required to join one of the bulletin board groups.
BIONET subscribers had access to the latest versions of the most important databases for molecular
biology. Large databases available at BIONET were GenBank, the National Institutes of Health DNA sequence
library; EMBL, the European Molecular Biology
36
Timothy Lenoir
Laboratory nucleotide sequence library; NBRF-PIR, the
National Biomedical Research Foundation’s protein sequence database, which is part of the Protein Identification Resource [PIRI] supported by NIH’s Division of
Research Resources; SWISS-PROT, a protein sequence
database founded by Amos Bairoch of the University of
Geneva and subsequently managed and distributed by
the European Molecular Biology Laboratory; VectorBank, IntelliGenetics’ database of cloning vector restriction maps and sequences; Restriction Enzyme Library, a
complete list of restriction enzymes and cutting sites
provided by Richard Roberts at Cold Spring Harbor;
and Keybank, IntelliGenetics’ collection of predefined
patterns or “keys” for database searching. Several smaller
databases were also available, including a directory of
molecular biology databases, a collection of literature
references to sequence analysis papers, and a complete
set of detailed protocols for use in a molecular biological laboratory (especially for Escherichia coli and yeast
work) (IntelliGenetics, 1987, p. 23).
Perhaps the most important contribution made by
BIONET to establishing molecular biology as an information science did not materialize until the period of
the second contract for GenBank. As described above,
BB&N was awarded the first five-year contract to manage GenBank. The contract was up for renewal in 1987,
and on the basis of its track record in managing
BIONET, IntelliGenetics submitted a proposal to manage GenBank. GenBank users had become dissatisfied
with the serious delay in sequence data publication.
GenBank was two years behind in disseminating sequence data it had received (Douglas Brutlag, personal
communication, 19 June 1999). At a meeting in Los
Alamos in 1986, Walter Goad noted that GenBank had
twelve million base pairs. Other sequence collections
available to researchers contained fourteen to fifteen million base pairs, so that GenBank was at least 14 to 20
percent out of date (Boswell, 1987). Concerned that
researchers would turn to other, more up-to-date data
sources, the NIH listed encouraging use as one of the
issues they wanted IntelliGenetics to address in their
proposal to manage GenBank (Duke, 1987).
IntelliGenetics proposed to solve this problem by
automating the submission of gene and protein sequences. The standard method up to that time required
an employee at GenBank to search the published scientific literature laboriously for sequence data, rekey these
into a GenBank standard electronic format, and check
them for accuracy. IntelliGenetics would automate the
submission procedure with an online submission program, XGENPUB (later called “AUTHORIN”).
In fact, IntelliGenetics was already progressing toward automating all levels of sequence entry and (as
much as possible) analysis. As early as 1986 IntelliGenetics included SEQIN in PC/GENE, its commercial software package designed for microcomputers.
SEQIN was designed for entering and editing nucleic
acid sequences, and it already had the functionality
needed to deposit sequences with GenBank or EMBL
electronically (“PC/Gene,” 1986). Transferring this program to the mainframe was a straightforward move. Indeed the online entry of original sequence data was already a feature of BIONET, since large numbers of
researchers were using the IntelliGenetics GEL program
on the BIONET computer. GEL was a program that
accepted and analyzed data produced by all the popular
sequencing methods. It provided comprehensive recordkeeping and analysis for an entire sequencing project
from start to finish. The final product of the GEL program was a sequence file suitable for analysis by other
programs, such as SEQ.XGENPUB, extended to this
capability by allowing the scientist to annotate asequence
according to the standard GenBank format and mail the
sequence and its annotation to GenBank electronically.
The interface was a forms-oriented display editor that
would automatically insert the sequence in the appropriate place in the form by copying the sequence from a
designated file on the BIONET computer. When completed, it could be forwarded to the GenBank computer
at Los Alamos; the National Institutes of Health DNA
sequence library, EMBL; the nucleotide sequence database from the European Molecular Biology Laboratory;
and NBRF-PIR, the National Biomedical Research
Foundation’s protein sequence database (Brutlag & Kristofferson, 1988).
Creating a new culture requires both carrot and stick.
Making the online programs available and easy to use
was one thing. Getting all molecular biologists to use
them was another. In order to doubly encourage molecular biologists to comply with the new procedure of
submitting their data online, the major molecular biology journals agreed to require evidence that data had
been so submitted before they would consider a manuscript for review. NucleicAcidsResearch was the first journal to enforce this transition to electronic data submission (Brutlag & Kristofferson, 1988). With these new
policies and networks in place, BIONET was able to
reduce the time from submission to publication and dis-
Shaping Biomedicine as an Information Science
tribution of new sequence data from two years to twentyfour hours. As noted above, just a few years earlier, at
the beginning of BIONET, there were only ten million
base pairs published, and these had been the result of
several years’ effort. The new electronic submission of
data generated ten million base pairs a month (Douglas
Brutlag, personal communication, 19 June 1999; “Nomination for Smithsonian-ComputerWorldAward,” n.d.).
Walter Gilbert may have angered some of his colleagues
at the 1987 Los Alamos Workshop on Automation in
Decoding the Human Genome when he stated that “Sequencing the human genome is not science, it is production” (Boswell, 1987). But he surely had his finger
on the pulse of the new biology.
The Matrix of Biology
The explosion of data on all levels of the biological continuum made possible by the new biotechnologies and
represented powerfully by organizations such as BIONET was a source of both exhilaration and anxiety. Of
primary concern to many biologists was how best to organize this massive outpouring of data in a way that
would lead to deeper theoretical insight, perhaps even a
unified theoretical perspective for biology. The National
Institutes of Health were among those most concerned
about these issues, and they organized a series of workshops to consider the new perspectives emerging from
recent developments. The meetings culminated in a report from a committee chaired by Harold Morowitz
titled Modelsfor Biomedical Research: A New Perspective
(1985). The committee foresaw the emergence of a new
theoretical biology “different from theoretical physics,
which consists of a small number of postulates and the
procedures and apparatus for deriving predictions from
those postulates.” The new biology was far more than
just a collection of experimental observations. Rather it
was a vast array of information gaining coherence
through organization into a conceptual matrix (Morowitz, 1985, p. 21). A point in the history of biology had
been reached where new generalizations and higherorder biological laws were being approached but obscured by the simple mass of data and volume of literature. To move toward this new theoretical biology, the
committee proposed a multidimensional matrix of biological knowledge:
That is the complete data base of published biological
experiments structured by the laws, empirical generalizations, and physical foundations of biology and con-
37
nected by all the interspecific transfers of information.
The matrix includes but is more than the computerized
data base of biological literature, since the search methods and key words used in gaining access to that base are
themselves related to the generalizations and ideas about
the structure of biological knowledge. (Morowitz, 1985,
p. 65)
New disciplinary requirements were imposed on the biologist who wanted to interpret and use the matrix of
biological knowledge:
The development of the matrix and the extraction of
biological generalizations from it are going to require a
new kind of scientist, a person familiar enough with the
subject being studied to read the literature critically, yet
expert enough in information science to be innovative
in developing methods of classification and search. This
implies the development of a new kind of theory geared
explicitly to biology with its particular theory structure.
It will be tied to the use of computers, which will be
required to deal with the vast amount and complexity of
the information, but it will be designed to search for general laws and structures that will make general biology
much more easily accessible to the biomedical scientist.
(Morowitz, 1985, p. 67)
Similar concerns about managing the explosion of
new information motivated the Board of Regents of the
National Library of Medicine. In its Long Range Plan
of 1987 the NLM drew directly on the notion of the
matrix of biological knowledge and elaborated upon it
explicitly in terms of fashioning the new biology as
an information science (Board of Regents, 1987). The
Long Range Plan contained a series of recommendations
that were the outcome of studies done by five different
panels, including a panel that considered issues connected with building factual databases, such as sequence
databases.
In the view of the panel the field of molecular biology was opening the door to an era of unprecedented
understanding and control of life processes, including
“automated methods now available to analyze and modify biologically important macromolecules” (Board of
Regents, 1987, p. 26). The report characterized biomedical databases as representing the universal hierarchy of
biological nature: cells, chromosomes, genes, proteins.
Factual databases were being developed at all levels of
the hierarchy, from cells to base-pair sequences. Because
of the complexity of biological systems, basic research
Timothy Lenoir
38
in the life sciences was increasingly dependent on automated tools to store and manipulate the large bodies of
data describing the structure and function of important
macromolecules. The N I H Long Range Plan srated,
however, that the critical questions being asked could
ofien only be answered by relating one biological level
to another, but methods for automatically suggesting
links across levels were nonexistent (Board of Regents,
1987, pp. 26-27).
A singular and immediate window of opportunity exists
for the Library in the area of molecular biology information. Because of new automated laboratory methods, genetic and biochemical data are accumulating far faster than
they can be assimilated into the scientific literature. The
problems of scientific research in biotechnology are increasingly problems of information science. By applying
its expertise in computer technologies to the work of understanding the structure and function of living cells on a
molecular level, NLM can assist and hasten the Nation’s
entry into a remarkable new age of knowledge in the biol o g i d sciences. (Board of Regents, 1987, p. 29)
To support and promote the entry into the new age of
biological knowledge, the N I H recommended building
a National Center for Biotechnology Information to
serve as a repository and distribution center for this growing body of knowledge and as a laboratory for developing new information analysis and communications tools
essential to the advance of the field. The proposal recommended $12.75 million per year for 1988-1990, with
an additional $10 million per year for work in medical
informatics (Board of Regents, 1987, pp. 4647). The
program would emphasize collaboration between computer and information scientists and biomedical reseatchers. In addition the N I H would support research in
the areas of molecular biology database representation,
retrieval-linkages, and modeling systems, while examining interfaces based on algorithms, graphics, and expert
systems. The recommendation also called for the construction of online data delivery through linked regional
centers and distributed database subsets.
Brave New Theory
Two different styles of work have characterized the field
of molecular biology. The biophysical approach has sought
to predict the hnction of a molecule from its structure.
The biochemical approach, on the other hand, has been
concerned with predicting phenotype from biochemical
function. If there has been a unifying framework for the
field, at least from its early days up through the 1980s, it
was provided by the “central dogma” emerging from the
work of James Watson, Francis Crick, Monod, and Jacob
in the late 1960s, schematized as follows:
DNA
+
RNA
+
Protein
+
Function
In this paper I have singled out molecular biologists
whose Holy Grail has always been to construct a mathematized, predictive biological theory. In terms of the
“central dogma” the measure of success in the enterprise
of making biology predictive would be-and has been
since the days of Claude Bernard-rational medicine. If
one had a complete grasp of all the levels from DNA to
behavioral function, including the processes of translation at each level, then one could target specific proteins
or biochemical processes that may be malfunctioning
and design drugs specifically to repair these disorders.
For those molecular biologists with high theory ambitions, the preferred path toward achieving this goal has
been based on the notion that the function of a molecule is determined by its three-dimensional folding and
that the structure of proteins is uniquely contained in
the linear sequence of their amino acids (Anfinsen,
1973). But determination of protein structure and function is only part of the problem confronting a theoretical biology. A hlly fledged theoretical biology would
want to be able to determine the biochemical function
of the protein structure as well as its expected behavioral
contribution within the organism. Thus biochemists
have resisted the road of high theory and have pursued a
solidly experimental approach aimed at eliciting common models of biochemical function across a range of
mid-level biological structures from proteins and enzymes through cells. Their approach has been to identify a gene by some direct experimental proceduredetermined by some property of its product or otherwise related to its phenotype-to clone it, to sequence
it, to make its product, and to continue to work experimentally so as to seek an understanding of its function.
This model, as Walter Gilbert has observed, was suited
to “small science,” experimental science conducted in a
single lab (Gilbert, 1991, p. 99).
The emergence of organizations like the Brookhaven
Protein Data Bank in 1971, GenBank in 1982, and
BIONET in 1984, and the massive amount of sequencing data that began to become available in university
and company databases, and more recently publicly
through the Human Genome Initiative, has complicated
this picture immensely through an unprecedented influx
of new data. In the process a paradigm shift has occurred
in both the intellectual and institutional Structures of
Shaping Biomedicine as an Information Science
biology. According to some of the central players in this
transformation, at the core is biology’s switch from having been an observational science, limited primarily by
the ability to make observations, to being a data-bound
science limited by its practitioners’ ability to understand
large amounts of information derived from observations.
To understand the data, the tools of information science have not only become necessary handmaidens to
theory; they have also fundamentally changed the picture of biological theory itself. A new picture of theory
radically different from even the biophysicists’ model of
theory has come into view. In terms of discipline biology has become an information science. Institutionally, it is becoming “Big Science.” Gilbert characterizes
the situation sharply:
To use this flood of knowledge, which will pour across
the computer networks of the world, biologists not only
must become computer-literate, but also change their
approach to the problem of understanding life.
The next tenfold increase in the amount of information in the databases will divide the world into haves and
have-nots, unless each of us connects to that information and learns how to sift through it for the parts we
need. (Gilbert, 1991)
The new data-bound biology implied in Gilbert’s scenario is genomics. The theoretical component of genomics might be termed computational biology, while its
instrumental and experimental component might be
considered bioinfrmatics. The fundamental dogma of
this new biology, as characterized by Douglas Brutlag,
reformulates the central dogma of Jacob-Monod in terms
of “information flow” (Brutlag, 1994):
Genetic
-+
information
Molecular
structure
+
Biochemical
function
Biologic
+ behavior
Walter Gilbert describes the newly forming genomic
view of biology:
The new paradigm now emerging is that all the “genes”
will be known (in the sense of being resident in databases available electronically), and that the starting point
of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture,
only then turning to experiment to follow or test that
hypothesis. The actual biology will continue to be done
as “small science”-depending on individual insight and
inspiration to produce new knowledge-but the reagents
that the scientist uses will include a knowledge of the
primary sequence of the organism, together with a list of
39
all previous deductions from that sequence. (Gilbert,
1991, p. 99)
Genomics, computational biology, and bioinformatics
restructure the playing field of biology, bringing a substantially modified toolkit to the repertoire of molecular biology skills developed in the 1970s. Along with
the biochemistry components, new skills are now required, including machine learning, robotics, databases,
statistics and probability, artificial intelligence, information theory, algorithms, and graph theory (Douglas
Brutlag, personal communication).
Proclamations of the sort made by Gilbert and other
promoters of genomics may seem like hyperbole. But
the Human Genome Initiative and the information technology that enables it have fundamentally changed molecular biology, and indeed, may suggest similar changes
in store for other domains of science. The online DNA
and protein databases that I have described have not just
been repositories of information for insertion into the
routine work of molecular biology, and the software programs discussed in connection with IntelliGenetics and
GenBank are more than retrieval aids for transporting
that information back to the lab. As a set of final reflections, I want to look in more detail at some ways this
software has been used to address the problems of molecular biology in order to gain a sense of the changes
taking place.
Biology in Silico
To appreciate the relationship between genomics and
earlier work in molecular biology, it is useful to compare approaches to the determination of structure and
function. Rather than an approach deriving structure
and hnction from first principles of the dynamics of
protein folding, the bioinformatics approach involves
comparing new sequences with preexisting ones and discovering structure and function by homology to known
structures. This approach examines the kinds of amino
acid sequences or patterns of amino acids found in each
of the known protein structures. The sequences of proteins whose structure have already been determined
and are already on file in the PDP are examined to infer
rules or patterns applicable to novel protein sequences
to predict their structure. For instance, certain amino
acids, such as leucine and alanine, are very common in
a-helical regions of proteins, whereas other amino acids, such as proline, are rarely if ever found in a-helices.
Using patterns of amino acids or rules based on these
patterns, the genome scientist can attempt to predict
40
Timothy Lenoir
where helical regions will occur in proteins whose structure is unknown and for which a complete sequence
exists. Clearly the lineage in this approach is work on
automated learning first begun in DENDRAL and carried forward in other AI projects related to molecular
biology such as MOLGEN.
The great challenge in the study of protein structure has been to predict the fold of a protein segment
from its amino acid sequence. Before the advent of sequencing technology it was generally assumed that
each unique protein sequence would produce a threedimensional structure radically different from every other
protein. But the new technology revealed that protein
sequences are highly redundant: Only a small percentage of the total sequence is crucial to the structure and
function of the protein. Moreover, while similar protein
sequences generally indicate similarly folded conformations and functions, the converse does not hold. In some
proteins, such as the nucleotide-binding proteins, the
structural features encoding a common function are conserved, while primary sequence similarity is almost nonexistent (Rossman, Moras, & Olsen, 1974; Creighton,
1983; Birktoft & Banaszak, 1984). Methods that detect similarities solely at the primary sequence level
turned out to have difficulty addressing functional associations in such sequences. A number of features often
only implicit in the protein’s linear or primary sequence
of twenty possible amino acids turned out to be irnportant in determining structure and function.
Such findings implied the need for more sophisticated techniques of searching than simply finding identical matches between sequences in order to elicit information about similarities between higher-ordered
structures such as folds. One solution adopted earIy on
by programs such as SEQ was to assume that if two DNA
segments are evolutionarily related, their sequences will
probably be related in structure and function. The related descendants are identifiable as homologues. For
instance, there are more than 650 globin sequences (as
in myoglobin or hemoglobin) in the protein sequence
databases, all of them very similar in structure. These
sequences are assumed to be related by evolutionary descent rather than having been created de novo. Many
programs for searching sequence databases have been
written, including an important early method written
in 1970 by S. B. Needleman and C. D. Wunsch and
incorporated into S E Q for aligning sequences based on
homologies (Needleman & Wunsch, 1970). The method
of homology depends upon assumptions related to the
genetic events that could have occurred in the divergent
(or convergent) evolution of proteins; namely, that homologous proteins are the result of gene duplication and
subsequent mutadons. If one assumes that after the duplication point mutations occur at a constant or variable rate, but randomly along the genes of the two proteins, then after a relatively short period of time the
protein pairs will have nearly identical sequences. Later
there will be gaps in the shared sets of base-pairs between the two proteins. Needleman and Wunsch determined the degree of homology between protein pairs by
counting the number of non-identical pairs (amino acid
replacements) in the homologous comparison and using this number as a measure of evolutionary distance
between the amino acid sequences. A second approach
was to count the minimum number of mutations represented by the non-identical pairs.
Another example of a key tool used in determining
structure-function relationship is a search for sequences
that correspond to small conserved regions of proteins,
modular structures known as motifs. Since insertions
and deletions (gaps) within a motif are not easily handled
from a mathematical point of view, a more technical
term, “alignment block,” has been introduced that refers to conserved parts of multiple alignments containing no insertions or deletions (Bork & Gibson, 1996).
Several different kinds of motifs are related to secondary and tertiary structure. Protein scientists distinguish among four hierarchical levels of structure. Primary structure is the specific linear sequence of the
twenty possible amino acids making up the building
blocks of the protein. Secondary structure consists of
patterns of repeating polypeptide structure within an
a-helix, P-sheet, and reverse turns. Supersecondary structure refers to a few common motifs of interconnected
elements of secondary structure. Segments of a-helix
and P-strand often combine in specific structural motifs. One example is the a-helix-turn-helix motif found
in DNA-binding proteins. This motif contains twentytwo amino acids in length that enable it to bind to DNA.
Another motif at the supersecondary level is known as
the Rossmann fold, in which three a-helices alternate
with three parallel P-strands. This has turned out to be a
general fold for binding mono- or dinucleotides and is
the most common fold observed in globular proteins
(Richardson & Richardson, 1989).
A higher order of modular structure is found at the
tertiary level. Tertiary structure is the overall spatial arrangement of the polypeptide chain into a globular mass
of hydrophobic side chains forming the central core, from
which water is excluded, and more polar side chains fa-
SSL
Shaping Biomedicine as an Information Science
voring the solvent-exposed surface. Within tertiary structures are certain domains on the order of a hundred
amino acids, which are themselves structural motifs.
Domain motifs have been shown to be encoded by exons, individual DNA sequences that are directly translated into peptide sequences. Assuming that all contemporary proteins have been derived from a small number
of original ones, Walter Gilbert and colleagues have argued that the total number of exons from which all existing protein domains have been derived is somewhere
between one thousand and seven thousand (Dorit,
Schoenback, & Gilbert, 1990).
Motifs are powerful tools for searching databases of
known structure and function to determine the structure and function of an unknown gene or protein. The
motif can serve as a kind of probe for searching the
database or some new sequence, testing for the presence
of that motif. The PROCITE database, for example, has
more than a thousand of these motifs (Bairoch, 1991).
With such a library of motifs one can take a new sequence and use each one of the motifs to get clues about
its structure. Suppose, for example, the sequence of a
gene or protein has been determined. Then the most
common way to investigate its biologic function is simply to compare its sequence with all known DNA or
protein sequences in the databases and note any strong
similarities. The particular gene or protein that has just
been determined will of course not be found in the
databases, but a homologue from another organism or a
gene or protein having a related function may be found.
The evolutionary similarity implies a common ancestor
and hence a common function. Searching with motif
probes refines the determination of the fold regions of
the protein. These methods become more and more successful as the databases grow larger and as the sensitivity
of the search procedure increases. Bork, Ouzounis, and
Sander (1994) state that the likelihood of identifying
homologues is currently higher than 80 percent for bacteria, 70 percent for yeast, and about 60 percent for animal sequence series (Bork & Gibson, 1996).
The all-or-nothing character of consensus sequencesa sequence either matches or it does not-led researchers to modify this technique to introduce degrees of similarity among aligned sequences as a way of detecting
similarities between proteins, even distantly related ones.
Knowing the function of a protein in some genome, such
as E. coli, for instance, might suggest the same function
of a closely related protein in an animal or human genome (Patthy, 1996). Moreover, as noted above, different amino acids can fit the same pattern, such as the
41
helix-turn-helix, so that a representation of sequence
pattern in which alternative amino acids are acceptable,
as well as regions in which a variable number of amino
acids may occur, are desirable ways of extending the
power of straightforward consensus sequence comparison. One such technique is to use weights or frequencies to specify greater tolerance in some positions than
in others. An illustration of the success of this approach
is provided by the DNA-binding proteins mentioned
above, which contain a helix-turn-helix motif twentytwo acids in length (Brennan & Mathews, 1989). Comparison of the linear amino acid sequences of these
proteins revealed no consensus sequence that could distinguish them from any other protein. A weight matrix
is constructed by determining the frequency with which
each amino acid appears at each position, and then converting these numbers to a measure of the probability of
occurrence of each acid. This weight matrix can be applied to measure the likelihood that any given sequence
twenty-two amino acids long is related to the helix-turnhelix family. A further modification of the weight matrix is the profile, which allows one to estimate the probability that any amino acid will appear in a specific
position (Gribskov et al., 1987; Gribskov et al., 1988).
In addition to consensus sequences, weight matrices, and profiles, a further class of strategies for determining structure-function relations are various sequence
alignment methods. In order to detect homologies between distantly related proteins, one method is to assign
a measure of similarity to each pair of amino acids, and
then add up these pairwise scores for the entire alignment (Schwartz & DayhofF, 1979). Related proteins will
not have identical amino acids aligned, but they do have
chemically similar or replaceable amino acids in similar
positions. In a scoring method developed by R. M.
Schwartz and M. 0. Dayhoff, for example, amino acid
pairs that are identical or chemically similar were given
positive scores, and pairs of amino acids that are not
related were assigned negative similarity scores.
A dramatic illustration of how sequence alignment
tools can be brought to bear on determining function
and structure is provided by the case of cystic fibrosis.
Cystic fibrosis is caused by aberrant regulation of chloride transport across epithelial cells in the pulmonary
tree, the intestine, the exocrine pancreas, and apocrine
sweat glands, This disorder was identified as being caused
by defects in the cystic fibrosis transmembrane conductance regulator protein (CFTR). After the CFTR gene
was isolated in 1989, its protein product was identified as producing a chloride channel, which depends for
42
Timothy Lenoir
its activity on the phosphorylation of particular residues within the regulatory region of the protein. Using
computer-based sequence alignment tools of the sort described above, it was established that a consensus sequence for nucleotide binding folds that bind ATP are
present near the regulatory region and that 70 percent
of cystic fibrosis mutations are accounted for by a three
base-pair deletion that removes a phenylalanine residue
within the first nucleotide-binding position. A significant
portion of the remainder of cystic fibrosis mutations affect a second nucleotide-binding domain near the regulatory region (Hyde et al., 1990; Kerem et al., 1989;
Kerem et al., 1990; Riordan et al., 1989).
In working out the folds and binding domains for
the CFTR protein, S. C. Hyde, P. Emsley, M. J. Hartshorn, and colleagues (1990) used sequence alignment
methods similar to those available in early models of the
IntelliGenetics software suite. They used the ChouFasman algorithm (1973) for identifying consensus sequences and the Quanta modeling package produced
by Polygen Corporation (Waltham, Massachusetts) for
modeling the protein and its binding sites (Hyde et al.,
1990). In 1992 IntelliGenetics introduced BLAZE, an
even more rapid search program running on a massively
parallel computer. As an example of how computational
genomics can be used to solve structure-function problems in molecular biology, Brutlag repeated the CFTR
case using BLAZE (Brutlag, 1994). A sequence similarity search compared the CFTR protein to more than
twenty-six thousand proteins in a protein database of
more than nine million residues, resulting in a list of
twenty-seven top similar proteins, all of which strongly
suggested the CFTR protein is a membrane protein involved in secretion. Another feature of the comparison
result was that significant homologies were shown with
ATP-binding transport proteins, further strengthening
the identification of CFTR as a membrane protein. The
search algorithm identified two consensus sequence
motifs in the protein sequence of the cystic fibrosis gene
product that corresponded to the two sites on the protein involved in binding nucleotides. The search also
turned up distant homologies between the CFTR protein and proteins in E. coli and yeast. The entire search
took three hours. Such examples offer convincing evidence that tools of computational molecular biology can
lead to the understanding of protein function.
The methods for analyzing sequence data discussed
above were just the beginnings of an explosion of database mining tools for genomics that is continuing to
take place.3 In the process biology is becoming even more
aptly characterized as an information science (Hughes
et al., 1999; IntelliGenetics & MasPar Computer Corporation, 1992). Advances in the field have led to largescale automation of sequencing in genome centers employing robots. The success this large-scale sequencing
of genes has enjoyed has in turn spawned a similar approach to applying automation to sequencing proteins,
a new area complementary to genomics called proteomics. Similar in concept to genomics, which seeks to
identify all genes, proteomics aims to develop techniques
that can rapidly identify the type, amount, and activities of the thousands of proteins in a cell. Indeed, new
biotechnology companies have started marketing technologies and services for mining protein information en
masse. Oxford Glycosciences (OGS) in Abingdon, England, has automated the laborious technique of twodimensional gel electroph~resis.~
In the OGS process
an electric current applied to a sample on a polymer gel
separates the proteins, first by their unique electric charge
characteristics and then by size. A dye attaches to each
separated protein arrayed across the gel. Then a digital
imaging device automatically detects protein levels by
how much the dye fluoresces. Each of the five thousand
to six thousand proteins that may be assayed in a sample
in the course of a few days is channeled through a mass
spectrometer that determines its amino acid sequence.
The identity of a protein can be determined by comparing the amino acid sequence with information contained
in numerous gene and protein databases. One imaged
array of proteins can be contrasted with another to find
proteins specific to a disease.
In order to keep pace with this flood of data emerging from automated sequencing, genome researchers have
in turn looked increasingly to artificial intelligence,
machine learning, and even robotics in developing automated methods for discovering patterns and protein
motifs from sequence data. The power of these methods
is their ability both to represent structural features rather
than strictly evolutionary steps and to discover motifs
from sequences automatically. The methods developed
in the field of machine learning have been used to extract conserved residues, discover pairs of correlated resi-
See, for instance, the National Institute of General Medical Science (NIGMS) “Protein Structure Initiative Meeting Summary,” 24 April
1998 at http://www.nih.govlnigms/news/reports/protein-structure.html.
See the discussion of this technology at the site of Oxford Glycosciences: http://www.ogs.com/proteome/home.html.
I
A
Shaping Biomedicine as an information Science
dues, and find higher-order relationships between residues as well. Techniques from the field of machine learning have included perceptrons, discriminant analysis,
neural networks, Bayesian networks, hidden Markov
models, minimal length encoding, and context-free
grammars (Hunter, 1993). Important methods for evaluating and validating novel protein motifs have also derived from the machine learning area.
An example of this effort to scale up and automate
the discovery of structure and function is EMOTIF (for
“electronic-motif”), a program for discovering conserved
sequence motifs from families of aligned protein sequences developed by the Brutlag Bioinformatics Group
at Stanford (Nevill-Manning et al., 1998).5 Protein sequence motifs are usually generated manually with a
single “best” motif optimized at one level of specificity
and sensitivity. Brutlag’s aim was to automate this procedure. An automated method requires knowledge about
sequence conservation. For EMOTIF, this knowledge is
encoded as a particular allowed set of amino acid substitution groups. Given an aligned set of protein sequences,
EMOTIF works by generating a set of motifs with a
wide range of specificities and sensitivities. EMOTIF
can also generate motifs that describe possible subfamilies of a protein superfamily. The EMOTIF program
works by generating a new database, called IDENTIFY,
of fifty thousand motifs from the combined seven
thousand protein alignments in two widely used public databases, the PRINTS and BLOCKS databases.
By changing the set of substitution groups, the algorithm can be adapted for generating entirely new sets of
motifs.
Highly specific motifs are well suited for searching
entire proteomes. IDENTIFY assigns biological functions to proteins based on links between each motif and
the BLOCKS or PRINTS databases that describe the
family of proteins from which it was derived. Because
these protein families typically have several members, a
match to a motif may provide an association with several other members of the family. In addition, when a
match to a motif is obtained, that motif may be used to
search sequence databases, such as SWISS-PROT and
GenPept, for other proteins that share this motif, In their
paper introducing these new programs C. G. NevillManning, T. D. Wu, and Brutlag showed that EMOTIF
and IDENTIFY successfully assigned functions automatically to 25 to 30 percent of the proteins in several
bacterial genomes and automatically assigned functions
EMOTIF can be viewed at http://motiEstanford.edu/emotif.
43
to 172 proteins of previously unknown function in the
yeast genome.
Many molecular biologists who welcomed the Human Genome Initiative with open arms undoubtedly
believed that when the genome was sequenced everyone
would return to the lab to conduct their experiments in
a business-as-usual fashion, empowered with a richer set
of fundamental data. The developments in automation,
the resulting explosion of data, and the introduction of
tools of information science to master this data have
changed the playing field forever: There may be no “lab”
to return to. In its place is a workstation hooked to a
massively parallel computer, producing simulations by
drawing on the data streams of the major databanks,
and carrying out “experiments” in silicu rather than in
vitro. The result of biology’s metamorphosis into an information science just may be the relocation of the lab
to the industrial park and the dustbin of history.
References
Anfinsen, C. 8. (1973). Principles that govern the folding of protein chains.
Science, 181(4096), 223-230.
Bairoch, A. PROSITE: A dictionary of sites and patterns in proteins. Nucleic
Acids Research, 19,2241.
Bernstein, F. C., Koetzle, T. F., et al. (1977). The Protein Data Bank: A computer based archival file for macromolecular structure. Journal of
Molecular Biology, 112, 535-542.
BIONET users status. (1986, April 3 and October 9). From BIONET managers’ meetings. (Stanford University Special Collections, Brutlag Papers, Fldr BIONET). Stanford, CA: Stanford University.
Birktoft, J. J., & Banaszak, L. J. (1984). Structure-function relationships
among nicotinamide-adenine dinucleotide dependent oxidoreductases. In M. T. W. Hearn (Ed.), Peptide and Protein Reviews (Vol. 4,
pp. 1-47). New York: Marcel Dekker.
Board of Regents. (1987). NLM long range plan (report of the Board of
Regents). Bethesda, MD: National Library of Medicine.
Bork, P., & Gibson, T. J. (1996). Applying motif and profile searches. In
R. F. Doolittle (Ed.), Computer Methods for Macromolecular Sequence
Analysis (Vol. 266 in Methods in Enzymology, pp. 162-184, especially p. 163). San Diego: Academic Press.
Bork, P., Ouzounis, C., & Sander, C. (1994). From genome sequences
to protein function. Current Opinions in Structural Biu/ogx 43),
393-403.
Boswell, S. (January 9,1987). Los Alamos workshop-Exploring the role
of robotics and automation in decoding the human genome. (IntelliGenetics trip report. In Stanford Special Collections, Brutlag Papers,
Fldr BIONET). Stanford, CA: Stanford University.
Brennan, R. G.,& Mathews, 8. W. (1989). The helix-turn-helix binding
motif. Journal of Biological Chemistry, 264,1903.
Brutlag, D. L. (1994). Understanding the human genome. In P. Leder, 0. A.
Clayton, & E. Rubenstein (Eds.), Scientific American: lntroduction to
MolecularMedicine(pp.153-168). New York: Scientific American, InC.
44
rim0thy Lenoir
Brutlag, D. L., & Kristofferson, D. (1988). BIONET: An NIH computer resource for molecular biology. In R. R. Colwell (Ed.), Biomolecular
data: A resource in transition (pp. 287-294). Oxford: Oxford University Press.
Business plan for IntelliGenetics. (1981, May 8). (Stanford Special Collections, Brutlag Papers, Fldr IntelliGenetics, p. 2). Stanford, CA: Stanford
University.
Creighton, T. E. (1983). Proteins: Structure andmolecularproperties. New
York: W. H. Freeman.
Dorit, R. L., Schoenback, L., &Gilbert, W. (1990). How big is the universe
of exons? Science, 250,1377.
Doyle, R. (1997). On beyond living: Rhetorical transformations of the life
sciences. Stanford, CA: Stanford University Press.
Duke, B. H., Contracting Officer, NIH, to IntelliGenetics, Inc. (1987, June
3). Request for revised proposal in response to request for proposals
RFP no. NIH-GM-87-04 titled "Nucleic Acid Sequence Data Bank."
Letter with attachment. (Stanford Special Collections, Brutlag Papers,
Fldr BIONET). Stanford, CA: Stanford University.
Feigenbaum, E. A,, Buchanan, B., et al. (April 1980). A proposal for continuation ot the MOLGEN project: A computer science application to
molecular biology (Computer Science Department, Stanford University, Heuristic Programming Project, Technical Report No. HPP-805, Section l), p. 1. Stanford, CA: Stanford University.
Feigenbaum, E. A,, & Martin, N. (1977). Proposal: MOLGEN-a computer
science application to molecular genetics (Heuristic Programming
Project, Stanford University, Technical Report No: HPP-78-18,1977).
Stanford, CA: Stanford University.
Friedhoff, R. M., & Benzon, W. (1989) The second computer revolution:
Visualization.New York: W. H. Freeman.
Friedland, P. (1979). Knowledge-based experiment design in molecular
genetics. Unpublished doctoral dissertation, Stanford University.
Friedland, P. (1984, April 27). BIONET organizational plans. Company confidential memo. (Stanford University Special Collections, Brutlag Papers, Fldr BIONET, p. l ) . Stanford, CA: Stanford Universisty.
Friedland, P., Brutlag, D. L., Clayton, J., & Kedes, L. H. (1982). SEQ: A
nucleotide sequence analysis and recombinant system. Nucleic Acids Research, 10,279-294,
Friedland, P., to Reimers, N. Subject: Software licensing agreement: April
2,1984. (Stanford University Special Collections, Fldr IntelliGenetics).
Stanford, CA: Stanford University.
Gilbert, W. (1991). Towards a paradigm shift in biology. Nature, 349, 99.
Gribskov, M., Homyak, M., et al. (1988). Profile scanning for three-dimensional structural patterns in protein sequences. Computer Applications in the Biosciences, 4, 61,
Gribskov, M., Mclachlan, A. D., et al. (1987). Profile analysis: Detection of
distantly related proteins. Proceedings of the National Academy of
Sciences, 84, 4355.
Hall, S. S. (1995). Protein images update natural history. Science, 267(3
February), 620-624.
Holbrook, S. R., Muskal, S. M., et al. (1993). Predicting protein structural
features with artificial neural networks. In L. Hunter (Ed.), Artificial
intelligence and molecular biology (pp. 161-1 94). Menlo Park, CA:
AAAl (American Association for Artificial Intelligence) Press.
Huberman, J. (1989). BIONET: Computer power for the rest of us. (Stanford
Special Collections, Brutlag Papers,Fldr BIONET). Stanford, CA: Stanford University.
Hughes, T. P., et ai. (Eds.). (1999). Fundinga revolution: Government supporlforcomputingresearch.Washington, DC: NationalAcademy Press.
Hunter, L. (Ed.). (1993). Artificialintelligenceandmolecularbiology Menlo
Park, CA: AAAl Press.
Hyde, S. C., Emsley, P., et al. (1990). Structural model of ATP-binding
proteins associated with cystic fibrosis, multidrug resistance and
bacterial transport. Nature, 346,362-365.
IntelliGenetics. (1987). lntroduction to BIUNET A computer resource for
molecular biology. User manual for Bionef subscribers, Release 2.3.
Mountain View, CA: IntelliGenetics.
Katz, L., & Levinthal, C. (1972). Interactive computer graphics and the representation of complex biological structures. Annual Reviews in Biophysics and Bioengineering, 1,465-504.
Kerem, B. S., Rommens, J. M., et al. (1989). Identification of the cystic
fibrosis gene: Genetic analysis. Science, 245, 1073-1080.
Kerem, B. S., Zielenski, J., et al. (1990). Identification of mutations in regions corresponding to the two putative nucleotide (ATP)-binding
folds of the cystic fibrosis gene. Proceedings of the National Academy of Sciences, 87,8447-8451.
Korn, L. J., Queen, C. L., & Wegman, M. N. (1977). Computer analysis of
nucleic acid regulatory sequences. Proceedings ofthe NationalAcademy of Sciences, 74,4516-4520.
Lederberg, J. (n.d.). How DENDRAL was conceived and born (Stanford
Technical Reports 048087-54, Knowledge Systems Laboratory Report No. KSL 87-54). Stanford, CA: Stanford University.
Lederberg, J. (1969). Topology of molecules. In National Research Council Committee on Support of Research in the Mathematical Sciences
(Ed.), The mathematical sciences: A collection of essays (pp. 3751). Cambridge, MA: MIT Press.
Lederberg, J., Sutherland, G. L., Buchanan, B. G., & Feigenbaum, E. (1969
November). A heuristic program for solving a scientific inference problem: Summary of motivation and implementation (Stanford Technical
Reports 026104, Stanford Artificial Intelligence Project Memo AIM104). Stanford, CA: Stanford University.
Ledley, R. S. (1965). Use ofcomputers in biologyandmedicine. New York:
McGraw-Hill.
Levinthal, C. (1966). Molecular model-building by computer. Scientific
American, 214(6), 42-52.
Levinthal, C., Barry, C. D., Ward, S. A., & Zwick, M. (1968). Computer
graphics in macromolecular chemistry. In D. Secrest & J. Nievergelt
(Eds.), Emerging concepts in computergraphics(pp.231-253). New
York: W. A. Benjamin.
Lewin, R. (1984). National networks for molecular biologists. Science, 223,
1379-1 380.
Maxam, A. M., to GENET Community. (1982, August 23). Subject: Closing
of GENET. (Stanford University Special Collections, Peter Friedland
Papers, Fldr GENET). Stanford, CA: Stanford University.
McCormick, B. H., DeFanti, T. A,, &Brown, M. D. (1987). Visualization in
Scientific Computing. NSF Report. Computer Graphics, 21(6), special issue.
Minutes of the meeting of the National Advisory Committee for BIONET.
(1985, March 23). (Final version prepared 1985, August 1). (Stanford
University Special Collections, Brutlag Papers, Fldr BIONET, p. 4).
Stanford, CA: Stanford University.
Morowitz, H. (1985). Models for biomedical research: A new perspective.
Washington, DC: National Academy of Sciences Press.
Needleman,S. B., & Wunsch, C. D. (1970). A general method applicable to
the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443.
Nevill-Manning, C. G., Wu, T. D., et al. (1998). Highly specific protein
TT
If
p
Shaping Biomedicine as an Information Science
sequence motifs for genomic analysis. Proceedings of the National
Academy of Sciences, USA, 95(1l),5865-5871.
NIH Special Study Section. (March 17-19,1983). BIONET, national computer resource for molecular biology. (Stanford University Special
Collections, Brutlag Papers, p. 2). Stanford, CA: Stanford University.
Nomination for Smithsonian-Computerworld Award. (n.d.). (Stanford Special Collections, Brutlag Papers, Fldr Smithsonian Computerworld
Award). Stanford, CA: Stanford University.
Oettinger, A. G. (1966).The uses of computers in science, Scientific American, 275(3),161-172.
Panel on lnformation Technology and the Conduct of Research, National
Academy of Sciences. lnformation technology and the conduct ofresearch: The user's view. (1989).Washington, DC: National Academy
of Sciences.
Patthy, L. (1996).Consensus approaches in detection of distant homologies. In R. F. Doolittle (Ed.), Gompuler methods for macromolecular
sequence analysis (Vol. 266 in Methods in Enzymology, pp. 184198).San DieQo:Academic Press.
PC/Gene: Microcomputer sofhvare for protein chemists and molecular biologists, user manual. (1986). Mountain View, CA: IntelliGenetics,
pp. 99-120.
Ramsey, D. M. (Ed.). (1968).lmageprocessing in biologicalscience. Berkeley/Los Angeles: University of California Press.
Richardson, J. S., & Richardson, D. C. (1989). Principles and patterns of
protein conformation. In G. D. Gasman (Ed.), Prediction of protein
structure and the principles of prolein conformation. New York: Plenum Press.
45
Rindfleisch. T., Friedland, P., & Clayton, J. (1981).The GENET guest service on SUMEX (SUMEX-AIM Report). (Stanford University Special
Collections, Friedland Papers, Fldr GENET). Stanford, CA: Stanford
University.
Riordan, J. R., Rommens, J. M., et al. (1989). Identification of the cystic
fibrosis gene: Cloning and characterization of complementary RNA.
Science, 245,1066-1073.
Rossman, M. G., Moras, D., & Olsen, K. W. (1974).Chemical and biological evolution of a nucleotide-binding protein. Nature, 250,194-199.
Schwartz, R. M., & Dayhoff, M. 0. (1979).Matrices for detecting distant
relationships. Atlas of Protein Structure, S(Supplement 3), 353.
Second cut at interaction language and procedure. (n.d.). In Chemistry
proiect. (Edward Feigenbaum Papers, Stanford Special Collections
SC-340, Box 13). Stanford, CA: Stanford University.
Smith, D. H., Brutlag, D., Friedland, P., & Kedes, L. H. (1986). BIONET
National computer resource for molecular biology. Nucleic Acids
Research, 14,17-20.
Staden, R. (1977). Sequence data handling by computer. Nucleic Acids
Research, 4,4037-4051.
Staden, R. (1978). Further procedures for sequence analysis by computer.
Nucleic Acids Research, 5,1013-1 015.
Staden, R. (1979).A strategy of DNA sequencing employing computer programs. Nucleic Acids Research, 6,2602-2610.
Stefik, M. (1977).Inferring DNA structures from segmentation data. Artificia/ lnfelligence, I I, 85-1 14.
Watson, H. C. (1969). The stereochemistry of the protein myoglobin.
Progress in Stereochemistry, 4,299-333.