The application of artificial intelligence to problems of chemical

by William J. Cromie
The application of
artificial intelligence
to problems of chemical
analysis and synthesis
is having dramatic impact
on many operations
that were once tedious
and prohibitively
time consuming.
C
hemists would be m u c h more productive if they could see what they
do. Virtually all of their work involves the interaction of molecules whose
s t r u c t u r e d e t e r m i n e s their activity, b u t
these structures often are u n k n o w n and
a l w a y s are invisible. For e x a m p l e , the
ability of a drug to react with a key enzyme
a n d d e s t r o y a disease-causing bacteria
depends on the interlocking of the two
molecules. If chemists k n e w the details of
the molecules' three-dimensional shapes
d o w n to the last atom, they would be on
their way to the design of more effective
drugs. T h e same holds true for making
better pesticides, herbicides, s y n t h e t i c
fibers, plastics, and other products.
T h e ability to discern s t r u c t u r e and
watch molecules interact is not enough.
Chemists also face the task of planning
the best way to synthesize a compound.
T h o u s a n d s of starting materials can be
combined in millions of ways. Chemistry
would be more of a science and less of an
art if chemists could call up all the possible
routes to a target molecule, whether or
not these routes had ever been used before.
For the past dozen years, researchers
have been developing computer systems
designed to open the eyes of academic and
industrial chemists and to help them achieve
these goals. Some of the systems speed
tedious analyses to determine the structures
of molecules. Others depict invisible molec-
ular shapes as stereo color images that
can be rotated, translated, and tilted much
as spaceships are maneuvered in a video
game. O t h e r programs are being tested
and used for the design of new compounds,
the syntheses of complex molecules, and
the prediction of reaction products in a
chemical factory or in a h u m a n body.
As chemists struggle with the problems
of adapting such systems for routine use,
computer scientists work on easier-to-use
(they call them friendlier) and more intelligent systems. T h e goal of the computerc h e m i s t r y field is to i n v e n t a d a t a - i n ,
a n s w e r - o u t black b o x t h a t w o u l d , for
example, allow chemists to feed in a molecular formula and without further effort
obtain its structure or a plan by which it
could be synthesized. N o such black box
exists, and it may not exist for a long
time, b u t various versions of w h a t will
eventually become part of the inner workings of such a system already are easing
some of chemistry's trial-and-error burden.
Machine intelligence
T h e computer began its role in chemistry as an automatic librarian. At first it
performed searches of scientific literature
and lists of compounds. Today, data banks
hold information on molecular structure,
test results, and chemical reactions. T h e
A m e r i c a n C h e m i c a l S o c i e t y ' s Chemical
A b s t r a c t Service, for e x a m p l e , offers
Applications. The T-shaped thyroid (left) fits spaces in its carrier protein, albumin,
as a pimiento fits an olive. A view down the axis (bottom left) of the double-helical DNA
molecule and another view of the molecule rotated, displaying its major and minor grooves.
access to a data base containing information on the structure of more than five
million c o m p o u n d s . Using a nationwide
system k n o w n as Telnet, researchers can
telephone the Chemical Information System, developed by the National Institutes
of Health and the Environmental Protection
Agency, and obtain all the information
available on 150,000 frequently used comp o u n d s regulated by the federal government. W i t h the newer Reaction Access
System, a chemist can get data about reactions in which a compound has taken part.
Library systems simply store information
and release it on request. Computers possess far greater capabilities, and as early
as the 1960s computer scientists began
applying the techniques of artificial intelligence to problems in chemistry. The data
base for such systems consists of a large
b o d y of knowledge in a specific area o b tained from the literature and from questioning experts. Such knowledge-based
artificial intelligence systems contain both
facts and heuristic knowledge—-empirical
rules for p r o b l e m - s o l v i n g gained from
practical experience. (See "Programmed
to Think," Mosaic, Volume 11, Number 5.)
T h e first effort in this area originated
with the work of Edward A. Feigenbaum,
c h a i r m a n of the computer science departm e n t at Stanford University, and Joshua
Lederberg, then chairman of Stanford's
genetics department. Chemistry attracted
Feigenbaum because m u c h factual knowledge already existed in numerical form,
easily readable by computers. Lederberg,
a chemist, carried a wealth of heuristic
rules in his head. T h e two later recruited
Stanford organic chemist Carl Djerassi to
help get the knowledge base into the computer. Their first project, dubbed Dendral,
for Dendritic Algorithm, constituted the
first fully automatic approach to structural
analysis. T h e program uses data containing structural clues from which it generates
all relevant possible structures.
Originally, Dendral used mass spectrome t r y data. A s p e c t r o m e t e r separates a
molecule into ionized fragments and p r o duces a spectrum in which the mass of each
ion is plotted against its relative abundance.
T h e chemical composition of a molecule
can be ascertained directly from the spectrum, but the structure is not evident. In
other words, a chemist can determine which
atoms and atomic g r o u p s make u p the
molecule but cannot determine their spatial
arrangement. Structural features determine
t h e p a t t e r n of f r a g m e n t a t i o n , h o w e v e r ,
and spectra do contain structural infor-
MOSAIC January/February 1983
47
mation. One way to obtain this information
is to reassemble the fragments, or s u b structures, in all possible ways and then
determine which of the assemblies produces
the initial spectra. This would be a herculean task for a h u m a n but not for a large
computer.
D e n d r a l takes a molecule's a t o m i c
groups, whose structures are k n o w n , and
reassembles them in every possible way.
It uses its heuristic rules to predict a spectrum for each assembly, and it compares
predicted and real spectra. Ideally, only
one predicted s p e c t r u m will m a t c h t h e
real one. But m a n y predicted structures
are so similar that this t e c h n i q u e m a y
g e n e r a t e h u n d r e d s , even t h o u s a n d s , of
plausible candidates.
Making it manageable
To pare d o w n the list, the Dendral team
is adding to the program the nuclear m a g netic resonance spectrum of c o m p o u n d s
in question. Spectra are predicted for each
candidate structure and matched against
the actual ones. This constrains the list of
structural possibilities to those that p r o duce both a k n o w n mass spectrum and a
nuclear magnetic resonance spectrum. In
one experimental analysis, the computer
generated a list of one-and-a-quarter million
candidate structures for a compound based
on knowledge of organic chemistry principles alone. W h e n the mass spectrometry
and nuclear magnetic resonance spectra
were added, the list of plausible structures
was reduced to one.
At the outset, Dendral's capability was
limited to a restricted n u m b e r of molecular
families. It could not handle cyclic molecules, those, c o n t a i n i n g rings or closed
chains of atoms as their principal structural b a c k b o n e . T h i s restriction w a s a
significant gap; most Important biological
compounds are cyclic.
"Without computer assistance, a chemist
would sit d o w n with pencil and paper and
try to determine all of the possible structures a cyclic molecule could h a v e , " explains Robert S. Engelmore, a computer
scientist with Teknowledge, Inc., of Palo
Alto, California. " T h e chemist would then
publish a paper stating that the u n k n o w n
structure must be one of the five or ten
that he had laboriously worked out. But
no one could be sure he had thought of
all the possibilities."
In 1975, Stanford chemist D e n n i s H.
Smith and his colleagues created a program
Cromie is executive director of the Council
for the Advancement
of Science
Writing.
48
MOSAIC January/February 1983
that g e n e r a t e d an e x h a u s t i v e list of all
possible structures for a cyclic compound.
He used it in a review of papers in the
Journal of the American Chemical
Society
that described manually determined structures. T h e computer program did a better
job of discovering all the possible structures than any h u m a n could.
To use this program, k n o w n as Congen,
for Constrained Generation, a chemist feeds
the system all the structural information
that can be established u n a m b i g u o u s l y
from available data. The computer searches
this list for k n o w n substructural features
and puts all that it finds on what is termed
a "goodlist." It places known substructural
features that are missing on a "badlist." The
goodlist might indicate that the u n k n o w n
molecule is a ketone, but the badlist might
specify that it is not a methyl ketone. T h e
Congen program takes the goodlist of constraints and generates all molecular structures possessing those features. T h e program then predicts a mass s p e c t r u m for
each candidate and compares each one to
the spectrum of the c o m p o u n d being analyzed. The result is a short list of structures
ranked in order of plausibility. A n experienced chemist can usually handle this list
manually to find the most likely structure.
More than 20 academic and industrial
research laboratories in the United States,
Europe, and Australia are using C o n g e n
In such areas as analysis of marine products,
antibiotics and o t h e r d r u g d e r i v a t i v e s ,
narcotic analogues, and brain e n d o r p h i n s .
In the United States, the program can be
purchased commercially or entered through
terminal and telephone connections to the
Stanford University Experimental C o m puter for Artifical Intelligence in Medicine.
Paring further
" D e n d r a l and Congen show that computer programs equal the performance of
experts in certain problems of molecularstructure elucidation," says Feigenbaum.
" T h e y can solve highly complex p r o b lems, such as analysis of estrogenic steroids
and other biologically active c o m p o u n d s .
T h e programs may not k n o w as m u c h as
an expert does, b u t they succeed because
of their thorough application of the rules
they do k n o w . "
Congen has a major drawback: It cannot
handle structural features that share bonds
or atoms—adjacent rings having a common
side, for example. W h e n a chemist using
Congen pinpoints two substructures in a
molecule being analyzed, he m u s t be sure
they are s e p a r a t e a n d not o v e r l a p p i n g .
" T h i s asks a lot of h i m , " notes Stanford
chemist James G. Nourse. To eliminate this
disadvantage, Nourse helped to develop a
program called Genoa, for Generation of
Overlapping Atoms. Available commercially and through Stanford's experimental
medical system, this p r o g r a m generates
all possible structures whether or not the
constraints involve information on overl a p p i n g s u b s t r u c t u r e s in the molecule.
"Genoa has not yet replaced C o n g e n in
academic and i n d u s t r i a l l a b o r a t o r i e s , "
N o u r s e says, " b u t it is m o r e efficient,
easier to use, and capable of solving problems that Congen cannot do easily, if at all."
Genoa generates not one nor a million
candidate structures, but h u n d r e d s . If the
program ever is to be used in a comprehensive black-box system, it must be combined with a unit that can further pare
the list. Nourse's group is developing a
program that uses mass spectrometry data
and carbon-13 nuclear magnetic resonance
spectra to do the paring. Genoa, combined
with the new program, then, will result
in a kind of primitive black box; the user
feeds it available structural data, and it
delivers a list of candidate structures short
enough to be handled easily by a competent
chemist. As other types of constraint programs become available, such as those that
handle proton nuclear magnetic resonance
spectra, they can be incorporated.
Learning from nature
As Congen and then Genoa made it easier
to generate plausible structures, the speed
of programs used to predict the spectra
and to test the predictions they produced
proved inadequate. C o m p u t e r scientists
had developed specific rules for predicting the ways structures fragment, drawing
on conversations with specialists or from
examination of the scientific literature. "To
continue doing this on a o n e - b y - o n e basis
for each class of molecule w o u l d take
until the twenty-first century," Engelmore
says. So Feigenbaum and Stanford computer scientist Bruce G. B u c h a n a n tried
to design a program that would examine
m a n y examples of spectra and come u p
with general rules for molecular fragmentation. They wanted to determine if a machine
could derive its information directly from
nature instead of from experts.
The first part of the program Feigenbaum
and Buchanan developed, n a m e d Metadendral, collects and summarizes data on
the fragmentation of k n o w n structures.
T h e second part generates rules to explain
these data. The third part tests and modifies
the rules, disregarding those that overlap
and those that produce negative energy
balances. T h e resulting rules describe the
b o n d s that break and the substructures
that form in mass spectrometry.
According to Buchanan, the machineg e n e r a t e d rules " a r e as good as those
generated by h u m a n experts." He and his
co-workers proved this by using Metadendral to recreate fragmentation rules formulated in the traditional way for two
classes of compounds: aliphatic amines and
e s t r o g e n i c steroids. In more s t r i n g e n t
tests, the program examined spectra from
three classes of steroids for w h i c h n o
theory of fragmentation existed. The new
rules derived by Metadendral effectively
described the behavior of these compounds
in a mass spectrometer. " T h i s is the first
case in chemistry of a theory successfully
generated by machine," Engelmore notes.
T h e machine cannot generalize across
significant structural differences, however;
a separate theory must be generated for
each class of organic compounds. To meet
this challenge, research on Metadendral
has been replaced by work on a project so
new that it does not yet possess its o w n
acronym—an effort to develop rules for
the interpretation and prediction of nuclear
magnetic resonance spectra. Initially geared
only to carbon-13 spectra, the system has
had more success with prediction than
with interpretation.
Floppy molecyies
Stanford researchers are also wrestling
w i t h the p r o b l e m of f l o p p y molecules.
These molecules are the m a n y biologically
i m p o r t a n t , flexible s t r u c t u r e s that can
change shape as conditions change. Proteins, for instance, contain chains of atoms
that flutter or rotate about their bonds.
T h e entire molecule also vibrates—breathing, as chemists call i t ^ a n d it can change
its s h a p e by continuous alteration of such
properties as atomic distances and bond
angles. A n y black box that really was one
would have to deal with such changes.
N o u r s e h a n d l e s the p r o b l e m b y r e p r e senting floppy molecules as a matrix of
distances between atoms. T h e distances
m u s t be consistent with what is k n o w n
about a molecule, and changes in the distances are limited by energy balances and
other constraints in the program. O n this
basis, N o u r s e expects to generate lists of
conformational isomers, all the shapes a
floppy structure can assume.
T h e problem is focal to some of chemi s t r y ' s m o s t i m p o r t a n t processes. " T h e
b i n d i n g of important molecules, such as
catalysts, is exclusively sensitive to shape,"
points out Richard J. Feldmann, a com-
puter specialist at the National Institutes
of Health. " T h e effect of a molecule in
one conformation may be nil, while in
another it may catalyze or inhibit a vital
r e a c t i o n . S h a p e s r e q u i r e d for specific
biological activity have been selected over
millions of y e a r s . "
Feldmann makes movies depicting the
motions of molecules. "Fluttering or rotation o c c u r s in p i c o s e c o n d s , " he says.
" B r e a t h i n g occupies milliseconds, and
folding takes tens of seconds. It requires
40 h o u r s on a large computer to make a
movie showing 30 picoseconds in the life
of a very small protein. Dynamic modeling
of all the energetic states of a molecule
is necessary to completely understand its
functioning, but such modeling requires a
supercomputer."
Protein crystallography
Until massive computing power becomes
w i d e l y available, c h e m i s t s m u s t w o r k
with s n a p s h o t s of molecules, structural
images of a shape responsible for a specific
activity. Such a snapshot can be produced
by crystallizing a protein, bombarding it
with X rays, and determining its threedimensional structure from the diffraction pattern.
T h e diffraction data are translated into
a series of two-dimensional electron density m a p s r e p r e s e n t i n g closely spaced
contours of the molecule. T h e final step
involves construction of a wire or balland-stick model of the protein, a difficult
and frustrating chore requiring a m o n t h
or more.
In 1976, Feigenbaum and Engelmore set
out to develop a system for determining
p r o t e i n s t r u c t u r e from electron d e n s i t y
m a p s u s i n g artificial intelligence techniques. Beginning with the knowledgebase route they had used for Dendral and
other programs, Feigenbaum interviewed
people w h o had built models from maps.
" B u t we quickly found that we could not
c a p t u r e their expertise in a c o m p u t e r
p r o g r a m , " E n g e l m o r e recalls. " N o one
used any step-by-step procedure that could
be t r a n s l a t e d into rules. Each p e r s o n
developed his own method, which Involved
a great deal of staring at the maps and a
visual kind of reasoning."
Engelmore and his colleagues eventually
created a program called Crysalis, based
on w h a t he calls an "opportunistic, or jigsaw puzzle, approach. You begin with the
amino acid sequence of the protein, which
can be determined by chemical analysis,"
he explains. " T h i s is a global constraint,
like the picture on the puzzle box. T h e n
you look for parts that have a high probability of being correct, such as edges or
corner pieces in the jigsaw puzzle. These
would be places in the data where you
find some m e a n i n g . "
A heme group, for example, appears on
the electron maps as a flat, lacelike structure around an extremely dense spot, the
iron atom at its center. The program builds
u p a structural hypothesis piece by piece
as more and more information is gleaned
from the maps, the amino acid sequence,
and the emerging substructures. T h e p r o gram also uses information about the structure of related proteins. "You keep adding
bits of structure and connecting t h e m , "
Engelmore remarks. " A s In the case of the
puzzle, it becomes easier as you get more
pieces into place."
In the first test of its intelligence, Crysalis
found almost all the structural elements
of a small protein of k n o w n structure.
Later, Allan Terry of the University of
California at Irvine used it to determine the
structure of cytochrome c2, a 112-aminoacid protein. Engelmore describes the program as "a demonstration prototype." But,
he says, "it is easy to imagine an X-ray
diffractometer hooked to a system that
makes electron density maps and feeds
them to a program like Crysalis, which
w o r k s out the structure and displays it as
a stereo diagram on a computer screen.
All the elements of such a black box exist
at present; it's only a matter of t i m e though probably a long time—before they
are assembled into a single system."
Knowledge and graphics
M e a n w h i l e , c o m p u t e r s are b e c o m i n g
friendlier to chemists: They will accept
statements and instructions in basic English
rather than high-level programming lang u a g e s , and c h e m i s t s can c o m m u n i c a t e
with them using their favorite mode of
notation, a diagram of points and lines
representing atoms and the bonds between
them. In a system developed at the U n i versity of California at San Francisco, a
chemist draws a diagram of a structure
with a light pen on an electrostatic table,
and a stereo color model of the molecule
appears on a screen. T w o such models
can be d i s p l a y e d , r o t a t e d , and m o v e d
a r o u n d each other to show h o w the molecules interact. Sitting at a console equipped
with two joysticks, a chemist can m a n i p ulate the molecules like a player in a neighborhood electronic game arcade.
Robert Langridge, director of the computer graphics laboratory at the University
of California at San Francisco's school of
MOSAIC January/February 1983
49
pharmacy, employed this system to model
the docking and binding of thyroxine to
the protein, prealbumin, that t r a n s p o r t s
the thyroid h o r m o n e to its target organs.
Prealbumin, s h o w n in one color, contains
pockets for holding the iodine atoms in
thyroxine, depicted in another color. As
the late Eugene C. Jorgensen described
it, "the binding site of the transport p r o tein is shaped like an olive with the core
plugged out, and the hormone fits into
that hole like a pimiento." This was the
first detailed molecular model for the
interaction of a hormone with a biologically
relevant protein.
Jorgensen searched this model for regions
that could be modified to produce an analogue of thyroid hormone that could be
employed to speed development in premature infants. Such infants, whose lungs
may not have developed fully before birth,
experience respiratory distress s y n d r o m e ,
a principal cause of their illness and death.
Thyroxine bound to prealbumin is too large
a package to cross the placenta, so Jorgensen wanted to make an analogue that
could be a d m i n i s t e r e d p r e n a t a l l y . W i t h
the help of a computer program, he designed a g r o w t h - h o r m o n e carrier package
that crosses the placentas of pregnant rabbits and speeds lung maturation in their
fetuses. David Ballard, a pediatrician at
the University of California at San Francisco, hopes to use such thyroxine analogues to treat infants w h o do not respond
to steroid hormones, a standard treatment
for the problem.
Langridge is using the graphics system
to study the binding of proteins and DNA.
" T h e prealbumin-thyroxine interaction is
a good model for proteins in the nucleus
that bind to DNA," he explains. " I t has
enabled us to determine the structure of a
protein that binds to a k n o w n sequence
of DNA in a bacterial virus and switches
that particular gene on and off."
Herbert W. Boyer, a pioneer in recomb i n a n t DNA t e c h n o l o g y , uses c o m p u t e r
graphics to i n v e s t i g a t e the i n t e r a c t i o n
between DNA and restriction e n z y m e s
used in gene splicing e x p e r i m e n t s . A t
Stanford, researchers c o m b i n e g r a p h i c s
and artificial intelligence in a p r o g r a m
called M o l g e n , for M o l e c u l a r G e n e t i c s ,
designed to analyze DNA structures and to
plan cloning and other experiments.
Designing drugs
A well-worn metaphor compares a drug
binding a receptor to a key fitting a lock.
A researcher can practice biological locksmithing by manipulating atomic g r o u p s
50
MOSAIC January/February 1983
on a graphic display of a molecule until
the molecule slides smoothly into a s u b strate. In the living world, the keys might
not reach the lock (no transport mechanism), be inserted in the w r o n g lock (nonspecific binding), be bent or altered (metabolized), or be lost (excreted). "Nevertheless,
the l o c k - a n d - k e y a n a l o g y is a useful
s t a r t i n g p o i n t , " c o m m e n t s Peter Gund,
senior research fellow at the Merck Sharp &
D o h m e Research Laboratories in Rahway,
N e w Jersey. "Computer-assisted modeling
helps scientists understand the complex
relationships between drugs and receptors.
For example, when the structure of a receptor is u n k n o w n , it m a y be i n f e r r e d
from the s h a p e of the k e y - d r u g s that
fit it."
C o m p u t e r s also h e l p d r u g d e s i g n e r s
deal with floppy keys and locks. "Dynamic
reshaping of both may be required for
effecting biological activity," G u n d notes,
" a n d such reshaping can be more easily
handled with computer graphics. In this
sense, a better analogy would be a combination lock."
Most drugs possess electrostatic charges
that match complementary charges on the
receptor. G u n d compares this to a key
w i t h m a g n e t s that e n g a g e s a m a g n e t i c
tumbler in a lock. As a drug approaches
its receptor, the electrostatic fields become
perturbed, which may change the reactivity
of the drug.
Computer systems that have b e g u n to
deal effectively with these problems exist
at a n u m b e r of university and commercial
laboratories. M e r c k ' s R a h w a y laboratory
has developed a system that Feldmann
refers to as "a shining example of what
should h a p p e n in rational d r u g design."
Chemists regard Merck's Molecular Modeling System as friendly because its use requires a minimal knowledge of computers.
W h e n a chemist draws a r o u g h diagram
of a molecule with a light pen, the Merck
system automatically calculates the correct
atomic distances, bond angles, and energy
functions to produce a stereochemically
correct model on the display screen. These
can be compared to related crystal struct u r e s c o m p i l e d from X-ray d a t a , either
from the Cambridge Crystallographic Data
File, available through the Chemical Information System, or from M e r c k crystallographers. Merck also maintains two data
bases of its own. O n e holds 150,000 structures produced by the c o m p a n y ; the other
contains test results and biological-activity
data on these structures. A program k n o w n
as Compare superimposes modeled structures to find t h r e e - d i m e n s i o n a l atomic
patterns associated with the d r u g s ' biological activity.
"It's always easier to do computer experiments than to do laboratory experiments,"
Gund observes. In precomputer days, drug
development involved empirical testing,
or screening, of t h o u s a n d s of substances
to detect one showing a new type of desirable biological activity. This process was
followed by systematic chemical modification of the best compound found in order
to optimize its properties. " W i t h computer
systems," says Gund, "we can more readily
identify structures with the desired activity,
then combine and modify t h e m in ways
that lead to the discovery of n e w comp o u n d s or enhancement of the properties
of k n o w n o n e s . "
T h e c o m p a n y recently e m p l o y e d its
system to create a potential diabetes medication. T h e c o m p o u n d is an analogue of
somatostatin, a h o r m o n e that inhibits the
release of glucagon. Glucagon raises bloodsugar levels by increasing p r o d u c t i o n of
glucose. A n analogue is needed because
somatostatin degrades too quickly in the
body to be effective. A team at M e r c k ' s
West Point, Pennsylvania, laboratory, led
by Daniel F. Veber, a p p r o a c h e d this problem, by first isolating the active portion of
the somatostatin molecule. Veber's g r o u p
then employed Compare to model possible
structures that could hold the active portion together and prevent its metabolic
degradation.
" T h e r e are so many possible structures
that I'm not sure that we would have found
the correct one without the help of the
c o m p u t e r , " Veber o b s e r v e s . M o n t h s of
computer analysis, laboratory synthesis,
testing in rats, refining of the structure,
and retesting finally yielded an analogue
with greater potency t h a n somatostatin.
It also showed increased duration of action,
and decreased metabolic degradation. T h e
c o m p o u n d is n o w being tested as a treatment for juvenile-onset diabetes.
Tribble
Commercial molecular modeling systems
are not limited to design of n e w drugs.
C h e m i s t s at E.I. d u P o n t de N e m o u r s ,
Incorporated, use a group of 300 programs
for work on pesticides, herbicides, plastics,
films, and fibers, as well as o n drugs. T h e
programs are brought together under an
executive system dubbed Tribble, for the
friendly furry creatures in the Star Trek
television series. David A. Pensak, chief
architect of the system, gave it the name
because "it is friendly to chemists. Tribble
has more than 100 h a p p y i n - h o u s e users
and [has] thousands of compounds r u n ning through it at one time."
The system converts a light-pen drawing
of a real or hypothetical molecule into a
rough three-dimensional r e p r e s e n t a t i o n
of the structure. It also determines the
molecular shape having the least strain
energy, the conformation most likely to
be realizable. House molecules are filed
along with structures from outside data
b a n k s . (These include the C a m b r i d g e
Crystal File of more than 28,000 X-ray
crystal structures, the Protein Data Bank
library of enzyme and other protein structures compiled by Brookhaven National
L a b o r a t o r i e s , and the Protein S e q u e n c e
File, which contains all published protein
and nucleic-acid sequences.)
These structures can be fed to a variety
of programs that compute molecular properties such as area, volume, and charge
distribution. Other programs perform
q u a n t u m mechanics calculations, pattern
recognition, molecular similarity searches,
and superimposition of structures. Output
programs display and manipulate molecular
structures that can be viewed from various
angles, rotated, or translated. There is also
a c o m p o n e n t that permits researchers to
create analogues by modifying segments
of molecular structure on a screen without
the need to redraw the entire molecule.
D u P o n t chemists use T r i b b l e , w h i c h
displays molecules comprising as m a n y as
five thousand atoms, to determine precisely
iAjh-\7 h o r H i ^ i r l o c ¥HO\T
dDcifrnon in fno
frarli-
tional way are active. A polymer-products
g r o u p e m p l o y s the s y s t e m to design a
m o l e c u l a r film to s e p a r a t e water from
ethanol and make alcohol fuels more economical. Another team works on nonaddicting analogues of morphine.
" D u P o n t has made a large commitment
to the design and manufacture of biochemicals," Pensak notes, " a n d computers will
play a prominent role in this effort. W e
view the machines as internal consultants
that make chemistry easier and more effective. But we always remind ourselves that
c o m p u t i n g is not a substitute for thinking,
and computers are no substitute for an
experienced, intuitive chemist. T h e best
system produces only a model of reality;
that model has to be synthesized and tested
to have any utility."
Making it with computers
O n c e chemists roll u p their sleeves to
turn a computer model into a new comp o u n d , h o w do they begin? Starting materials for synthesizing even relatively simple
organic c o m p o u n d s may number in the
Vires model. A computer representation of an adenovirus.
t h o u s a n d s . Once a start is made, there
can be t h o u s a n d s of second steps.
C h e m i s t s with experience in m a k i n g
related c o m p o u n d s prefer certain starters
and reactions, but these may not be the
most efficient ways to synthesize a n e w
molecule. W. Todd W i p k e , a chemist at
the University of California at Santa Cruz,
says s y n t h e s i s design exceeds chess in
complexity: " T h e r e are more functional
g r o u p s than chess pieces, more kinds of
reactions than chess moves, and it is harder
to recognize a °"ood s^mthesis than to recognize a c h e c k m a t e . " C o m p u t e r s determine
all p o s s i b l e c o m b i n a t i o n s in s t r u c t u r a l
analyses, so it is natural for chemists to
seek their help for synthesis.
Researchers first tried machines for this
p u r p o s e in the 1960s. Seeking to systematize p r o c e d u r e s for p l a n n i n g o r g a n i c
syntheses, W i p k e and Harvard University
chemist E. J. Corey realized the potential
of applying computers to the task. (See
" C h o o s i n g C h e m i c a l R o u t e s , " Mosaic,
Volume 5, N u m b e r 4.) By 1969, they had
d e v e l o p e d a p r o g r a m that later b e c a m e
k n o w n as Lhasa, for Logic and Heuristics
Applied to Synthetic Analysis. T h e p r o g r a m considers all routes to a target molecule, rejects unworkable ones, and presents
the most promising paths to chemists for
their evaluation.
In the Lhasa system, a user's light-pen
diagram of the target appears on one screen;
a second screen displays a list of strategies
for taking the molecule apart chemically.
T h e strategies use transforms, the reverse
of chemical reactions that lead to synthesis.
T h e p r o g r a m works backward to starting
reactions with easily available chemicals.
T h e chemist chooses a strategy that he
believes will lead to the greatest simplification of the target. Lhasa then implements
the strategy using transforms stored in its
library. Often, more than one transform
will simplify a molecule, so Lhasa presents
a collection of reactions that lead to s u b structures one chemical step away from
the desired compound. T h e user rejects or
accepts these offspring and then repeats
the sequence of steps, treating each s u b structure as a new target. The result is a
synthesis tree showing various routes from
a molecule that may never before have
been made to familiar reactions using offthe-shelf chemicals.
" A n analysis of reaction paths to s y n thesize a molecule can be done in 20 m i n utes with a computer," Corey says. " W i t h out a computer it could take days or weeks.
Some chemists might not be able to do it
at all." Corey's program also counters the
so-called eureka syndrome, the temptation of chemists working with pencil and
paper to select the first workable route
t h a t is f o u n d instead of s e a r c h i n g for
other, possibly more efficient, routes.
Although Lhasa has been under developm e n t since 1969, c h e m i s t s do n o t y e t
employ it routinely for synthesis planning. Several universities use a teaching
v e r s i o n , h o w e v e r , to d e m o n s t r a t e the
principles of synthetic organic chemistry.
" T h e p r o g r a m is available in 12 or 14
places, i n c l u d i n g d r u g and r e s e a r c h oriented chemical companies," Corey says.
Before it becomes useful on an everyday
basis, he points out, these companies must
MOSAIC January/February 1983
51
build or have access to libraries of transforms that will do the kinds of syntheses
they wish to do.
Working forward
Another'synthesis program entering the
market works forward from known starters
rather than backward from the goal product. C h e m i s t s T i m o t h y D . Salatin a n d
William L. Jorgensen of Purdue University
created C a m e o , the C o m p u t e r - A s s i s t e d
Mechanistic Evaluation of Organic S y n thesis, which predicts the results of chemical reactions given starting materials and
certain other conditions.
Cameo applies to starting materials listed
on a display screen a menu of reagents
and mechanisms c o m m o n to a particular
class of reactions. T h e system assigns a
number to each product resulting from a
reaction, and it displays a tree s h o w i n g
each one as a b r a n c h g r o w i n g from a
trunk—the starting materials. By calling
u p each n u m b e r , a chemist can display
the structure of the product and submit it
to further reactions. Products of these reactions also are numbered and placed on the
tree, so the user can see a sequence from
beginning to end.
A separate module is employed by Cameo
for each of the m a n y molecular classes
that Jorgensen referred to as " t h e guts of
synthetic organic chemistry." In evaluating
proposed reactions for feasibility, "a chemist who might have wasted a day in the
lab on an ill-conceived reaction n o w can
find potential problems in five m i n u t e s , "
he says. " C a m e o allows its users to predict
what will be in the pot at the end of a
synthesis: whether or not the reactions
p r o d u c e low yields or u n d e s i r a b l e b y products."
T h e p r o g r a m also p r o d u c e s o r i g i n a l
reactions and products. Jorgensen and his
co-workers tested it by asking it to predict
the results of reactions that yield wellk n o w n products. T h e system not only did
this, b u t it also p r e d i c t e d u n e x p e c t e d
products whose existence was later confirmed. These results are encouraging, but
whether Cameo is u p to the task of routine
synthesis planning or will become a vital
cog in a black box system remains to be
demonstrated.
A third synthesis program, developed
by Wipke and called Sees, for Simulation
and Evaluation of Chemical Synthesis, has
been used by industry since 1973. It helps
chemists design and select syntheses of
biologically i m p o r t a n t molecules. Like
Lhasa, it works backward from the target
compound. "Sees works in two w o r l d s , "
52
MOSAIC January/February 1983
W i p k e explains, " a world of r e a c t i o n s
and a world of strategy concerned with
the a r c h i t e c t u r a l p r i n c i p l e s of b u i l d i n g
molecules."
The chemist works in the strategy world
to create a plan that contains structural
c h a n g e s required to b r e a k u p a target
molecule into chemically r e p r o d u c i b l e
units. This plan, for example, might call
for breaking certain bonds or leaving specific s u b s t r u c t u r e s intact. T h e p r o g r a m
then searches the reaction world and selects
only those reactions that satisfy the plan.
"Strategies can be developed without
[the chemist] knowing the content of the
transform library, and new transforms can
be added w i t h o u t altering s t r a t e g i e s , "
W i p k e p o i n t s out. " I n Lhasa, no s u c h
separation exists; new reactions require a
reordering of steps in a strategy sequence,
which is difficult to d o . "
W i p k e ' s system is used b y a c®nsortium
of seven of the largest p h a r m a c e u t i c a l
companies in West G e r m a n y and Switzerland. They share a large library of reactions that is closed to other organizations.
D r u g and chemical corporations in Japan
and Sweden use Sees, as does a governm e n t l a b o r a t o r y in A u s t r a l i a . In the
United States, Sees is available through
Stanford, the University of Pennsylvania
medical school, ADP N e t w o r k C o m p u t e r s ,
Merck and C o m p a n y , and several other
organizations.
T o illustrate Sees in use, W i p k e tells
the story of a Swedish c o m p a n y that manually planned the synthesis of "a relatively
simple molecule," but could not get the
process to work properly. T h e y gave the
problem to Sees, which generated what
he calls " a n obvious solution b u t one that
h a d not been c o n s i d e r e d . T h e S w e d i s h
chemists laughed at first, then after some
consultation decided that it would w o r k . "
The company now produces kilogram
quantities of the chemical by the Seesgenerated route.
Computerized metabolites
W i p k e used parts of Sees to construct
a program called Xeno for predicting the
biological activity of foreign c o m p o u n d s ,
or xenobiotics, such as pesticides, drugs,
and other chemicals not normally found
in the body. " I n m a n y cases," he says,
" t o x i c i t y , c a r c i n o g e n i c i t y , or a n o t h e r
activity is due not directly to a foreign
substance but to one or more of its metabolites. T h i s p r o g r a m identifies k n o w n
and u n k n o w n metabolites. It determines
their mechanism of formation, activity,
and ultimate fate, tests models of metab-
olism, and aids researchers in planning
experiments."
T h e program exposes a foreign comp o u n d to every biotransformation k n o w n
to exist in the species u n d e r study. It generates separate lists of metabolites for mice,
rats, and h u m a n s . Because Xeno does not
take into account factors such as excretion and membranes that block transport,
it g e n e r a t e s more m e t a b o l i t e s t h a n an
organism would. " H o w e v e r , overprediction is less of a p r o b l e m t h a n m i s s i n g
metabolites," W i p k e says. T h e program
does not predict quantities, b u t it does
provide information on the importance of
each biotransform and on the likelihood
of its occurrence.
Predictions by Xeno h a v e been tested
successfully against m e t a b o l i t e s k n o w n
from the literature. " I t also h a s f o u n d
c o m p o u n d s u n k n o w n to r e s e a r c h e r s , "
W i p k e relates. In 1978, he ran a d e m o n stration on the TCCD molecule, then considered the toxic c o m p o u n d in A g e n t
Orange. "Everyone scoffed at the list of
metabolites g e n e r a t e d b y the p r o g r a m
because, they said, TCCD I s not metabol i z e d / " W i p k e recalls. But, he reports, in
1981 chemists found that TCCD is indeed
metabolized, and some of the metabolites
predicted by Xeno have been identified
experimentally. Wipke solicits other problems for and evaluations of Xeno as part
of his effort to coax the program from what
he calls a "basic research s t a g e " into the
world of everyday chemistry.
While designing and perfecting the Inner
cogs a n d gears of f u t u r e b l a c k b o x e s ,
researchers also want to organize all the
k n o w l e d g e of o r g a n i c c h e m i s t r y Into a
structured form that fits neatly into these
boxes. Jorgensen's s t u d e n t s have begun
organizing Information from the literature
Into algorithms for c o m p u t e r use. " O u r
eventual goal," he states, "is to systematize chemists' u n d e r s t a n d i n g into rules
for teaching and doing c h e m i s t r y . "
Wipke's effort involves writing programs
"to look at molecules the w a y a h u m a n
d o e s . " T h e first step, as he sees It, is to
formalize the principles of chemistry so
that they can be tested, s o m e w h a t like
proving theorems in geometry. Included
w o u l d be b o t h the p r i n c i p l e s explicitly
stated in textbooks and those that exist
Implicitly in the heads of chemists. T h e
next step is to develop a system for machine
r e p r e s e n t a t i o n of these p r i n c i p l e s . T h e
final step is to build into a c o m p u t e r system the capacity to use the principles to
reason and solve problems-—to do w h a t a
skilled chemist does intuitively. •