View PDF - CiteSeerX

PERSPECTIVES
Dror G. Feitelson
The Hebrew University of Jerusalem
Millet Treinin
Hebrew University Hadassah Medical School
A living cell is the
product of complex
interactions between
DNA and the cellular
environment. Can
researchers recreate
the environment based
on the information
contained in the
DNA alone?
The Blueprint
for Life?
B
ioinformatics, genomics, proteomics—these fields of study have
exploded into the public eye in recent years. The dramatic
mechanization of DNA sequencing at an industrial scale, and
the advent of microarrays that let scientists simultaneously
quantify the expression patterns of thousands of genes,1 are
already providing new methods for classifying tumors and new insights
into fighting viral infections. Such achievements are based on better observations and understanding of cellular processes. This progress has also
opened the door to more basic questions involving the interplay of cellular components and DNA.
Current genome projects, especially the Human Genome Project, have
sparked great interest in the information encoded in DNA. DNA is often
referred to as “the blueprint for life,” implying that it contains all the information needed to create life. But this approach ignores the complex interactions between DNA and its cellular environment—interactions that
regulate and control the spatial and temporal patterns of gene expression.
Moreover, the particulars of many cellular structures seem not to be
encoded in DNA, and they are never created from scratch; rather, each cell
inherits templates for these structures from its parent cell. Thus, it is not
clear that all life processes are directly or indirectly encoded in DNA, casting a measure of doubt on the belief that we can understand them solely by
studying DNA sequences.
DNA ENCODING AND COMPUTER PROGRAMMING
One of the greatest scientific discoveries of the twentieth century is the
structure of DNA and how it encodes proteins. Simply put, DNA contains
the information needed to create proteins, namely the amino acid sequences
of these proteins. This information is transcribed into messenger RNA, and
a set of mediators—ribosomes and tRNAs, which DNA also encodes—translates the mRNAs into proteins. Once created, the proteins interact with each
other in the cellular environment, leading to the biochemical processes of life.
The workings of computer programs are analogous to this process.
Developers use a high-level language to code programs that specify a
sequence of instructions. A compiler translates the instructions from the
abstract HLL to an executable form—typically in two steps: parsing and
code generation. Finally, the program’s interaction with the computer hardware during execution brings it to life.
Both processes follow a simple linear progression: from information to
creation to animation and functioning in a particular environment. Figure
1 shows a simple model for this progression. Information starts the process,
and information encodes all that is needed to proceed. The translation is
merely a mechanical process that deterministically substitutes one symbol
for another.
34
Computer
0018-9162/02/$17.00 © 2002 IEEE
Transcription
DNA
mRNA
Translation
Protein
Cellular
environment
Life
processes
(a)
DNA is a special case because the program it
encodes—if we choose to call it a program—
involves reproducing its environment, including
itself. In computing terms, the program’s function
is to build new hardware and run on it as well. To
do so, the “DNA program” must acquire and
assimilate raw materials to fashion new structures.
But what about tools? While some may claim
that the program includes the instructions for creating the required tools, setting the process in
motion requires an initial toolset. Such tools convey information about the system’s functioning as
a whole—information not necessarily available in
the program. Moreover, switching the available
tools can modify the outcome, thus diverging from
the construction of a true replica.
DNA: INFORMATION STORAGE MEDIUM
The DNA molecule is an amazing example of
nature presaging technology, but it is not the only
one. Birds and insects could fly before humans developed aircraft, for example, and scientists often refer
to the way bats use ultrasonic sound reflections as
presaging electronic radar and sonar systems. But
DNA is different. It is unlike any mechanism that
operates on objects in the physical world, and it is as
close as we can imagine to pure information storage. Moreover, it uses a digital encoding.
When researchers deciphered DNA’s structure
and function in the 1950s, they found it remarkably
similar to the abstract tapes postulated for storage
in Turing machines some 20 years earlier, and to the
magnetic storage tapes used in early electronic computers. DNA, like a tape, stores data in a linear
sequence. The data comes in discrete units, like letters of an alphabet. The DNA alphabet is the four
types of nucleic acids, commonly denoted by A, T,
C, and G. These are grouped into codons—threeletter words that represent amino acids. Electronic
computers use a binary basic alphabet—zero and
one—and “words” of eight or more basic symbols
to encode letters in human languages and numbers.
The instructions for protein creation are the most
important information encoded in DNA. Like
DNA, proteins are essentially a linear sequence of
building blocks from a restricted repertoire. For
proteins, the building blocks are the 20 amino
acids. But, unlike DNA, proteins do not store information. Rather, they assume dual roles: as the infrastructure of living cells and as the agents that “make
things happen.”
The process by which DNA information becomes
a new protein is rather involved. First, the DNA is
transcribed into mRNA—essentially a copy that can
Program
in HLL
Parsing
Code
Computer
Compiler generation Executable hardware
internal
(machine
representation
instructions)
Process
running
(b)
be sent to the production plant. This step is complicated by the fact that a gene need not be stored
in a contiguous stretch of DNA, and the noncoding parts—the introns—must be spliced out, leaving
only the coding parts—the exons. The production
plant—embodied by ribosomes and tRNA—translates the mRNA codons into amino acids and facilitates their linking into a protein chain. Existing
proteins that bind to the DNA and the RNA catalyze these steps. The ribosomes, which provide the
process infrastructure, are also composed of proteins and RNA. How all this came to be is one of the
major questions facing scientists today.
Actually, DNA information only partly specifies
how proteins should be created: It specifies the
sequence of amino acids, but not the 3D structure
they should assume. Once a protein is created, it
may immediately fold into the correct configuration. But it seems that in most cases, a sequence of
amino acids has more than one stable configuration, and the configuration that takes effect depends
on other proteins—the chaperones—that exist in
the translation environment. If the appropriate
chaperones are absent, the string of amino acids will
not assume the protein’s correct functional form.
Figure 1. Comparison of the simple
linear progression in
(a) DNA protein
encoding and (b)
computer program
execution. Both
processes progress
from information to
creation to animation and function.
THE CELL AS A STATE MACHINE
In nature, DNA does not exist on its own. Even
viruses, the nearest thing to pure DNA, need a cellular environment in which to function. When
viruses are placed in a cellular environment, proteins in the cell operate on the virus DNA and
incorporate it into the cell’s reproductive cycle.
In a sense, this process resembles recent successful mammalian cloning processes: Researchers put
DNA from one animal into an embryonic cellular
environment from another animal of the same
species, and the environment took up the DNA and
entered the normal differentiation and growth
process.2 The researchers could do this without
fully understanding the environment or the process.
That proteins bind to DNA to activate transcription indicates that DNA contains more than
explicit protein sequence data. The most important binding sites are promoters, which activate
transcription, and splicing sites, which allow a single mRNA sequence to be constructed from
several disjoint pieces of DNA. Additional sites
cause transcription to terminate.
July 2002
35
STATE = {set of proteins}
Cytoplasm
(protein mix)
Protein P1
binds to
promoter of
protein P2
DNA rule:
if P1 ∈STATE then
STATE = STATE ∪ {P2}
Protein P2
is transcribed
DNA
Figure 2. Simplistic
view of the cell as a
state machine. The
expressed proteins
are the state, and
their interaction
with the DNA causes
new proteins to be
expressed, thus
changing the cell’s
state.
36
We can consider these sites collectively as control features that augment the data. Control sites
link the DNA to its environment by allowing the
environment to influence which proteins it will fabricate under what conditions. Cells generally do not
produce most of the proteins encoded in their
DNA. Only a small subset is fabricated—the subset for which proteins that already exist in the cytoplasm “turn on” the promoters. This distinguishes
the genotype—the full list of genes—from the phenotype—the genes whose effect we see.
In computer science terms, a gene is therefore similar to a state machine’s transition rule. The composition of a cell’s cytoplasm—the mix of proteins
in the cell—determines its current state. To activate
transcription, some of these proteins might interact
with the gene’s promoter, which eventually will lead
to the production of a new protein, whose addition
changes the cell’s state. Figure 2 shows a simplistic
view of DNA interactions with the cell’s state.
A genome is a collection of many transition rules.
From any given state, transitions to many other
states are possible according to the currently
enabled rules—meaning that the gene’s promoters
can become active. In a real cell, many of these transitions can take place concurrently, unlike the
sequential nature of the state-machine model. The
use of embryonic cells for cloning exemplifies the
notion of state: For cloning to work, scientists must
inject the desired DNA into a cell in the machine’s
“initial state.”
Only embryonic cells have the correct mix of proteins to start the differentiation process. Naturally,
the cellular state machine is very complex, with two
fields of study—gene expression profiling and proteomics—devoted to deciphering it. Microarrays,
which enable simultaneous recording and quantifying of the expression levels of thousands of
genes—essentially, recording the cell’s state—are
an important tool in this quest. Because a cell’s state
correlates with various normal and pathological
conditions, if we can identify the state, we can diagnose the respective conditions.
This simple description ignores many of the real
complexities known to biologists. Transitions can
be simultaneous, for example, and feedback from
proteins to gene-expression control occurs at all
Computer
stages of protein production, not only at initial
transcription.1 Thus, existing proteins affect
mRNA splicing, its transport to ribosomes, and the
folding, structuring, and stability of the newly created proteins.
All of these stages are loci for exercising control.
The complexity of these processes, best exemplified by transcription, implies that it is hard to assign
direct control to the information encoded in the
DNA. For example, multiprotein complexes—
whose formation depends in a nonlinear manner
on delicate quantitative balances that integrate
inputs from many different signaling pathways—
might regulate transcription.3 Nevertheless, we can
regard transition rules as emergent properties of
the relationship between the DNA binding sites and
the cell’s complex states.
While the full state machine is very complex, it is
actually composed of many largely independent
smaller state machines. Skin cells occupy a different part of the state space than muscle or nerve
cells, for example, and different types of cells retain
their identity across cell divisions.
The different state machines are coupled after all,
however, as can be seen in the differentiation
process, in which a single cell divides repeatedly to
create a whole organism. Each daughter cell
assumes a different role—moving to a different area
of the state space—based on stimuli from its neighbors. This is a unidirectional process: Once the cells
differentiate, they go their separate ways.
With this view in mind, we can observe that a living cell involves the interaction of two types of
information: the transition rules largely encoded in
the DNA, and the state encoded in the cytoplasm.
Neither alone can suffice, and neither one is more
important than the other.
NON-DNA INFORMATION
The cytoplasm holds information about the cell’s
current state. But can the cytoplasm affect the
developmental trajectory independently of the
DNA? Can a cell inherit changes in the cytoplasm
without corresponding changes in the DNA?
It would seem that the cytoplasm’s role is only to
interpret the DNA. To perform this interpretation,
the cytoplasm provides enzymes and structures for
use in translation and transcription. This is much
like compiling a program to an executable. But in
computer science, we know that dishonest compilation is possible, resulting in an executable that
does not reflect the source code exactly, as the
“Trojan Horses and the Evolution of Compilation”
sidebar explains.
Trojan Horses and the Evolution of Compilation
Dishonest translation, in which a product does not precisely
mirror the original encoding, also exists in computers. In software, a Trojan horse—named after the wooden artifact from
Greek mythology that contained more than could be seen on
the surface—refers to an executable that includes code that is
not a translation of the original program but was added later,
usually maliciously.
A computer virus is a good example of malicious code. A virus
is a piece of code that unobtrusively attaches to a program and,
under certain circumstances, changes its function. Thus, one
way to detect a viral infection is to create digital fingerprints
that recognize changes in the executable. An executable that no
longer fits its fingerprint is likely infected.
But the virus can avoid detection if the compiler inserts it at
the outset, as Figure A shows. In this case, the original fingerprint includes the virus in addition to the legitimate code, making it impossible to detect infected software. If we suspect the
compiler, we might inspect its source code, and even recompile
the compiler before we use it.1 Surely we can then trust the code
it produces, can’t we?
The answer is no.2 Assume we have the source code of an
honest compiler C. We can write another compiler, D, which
recognizes C’s source code. When D compiles C, it creates a dishonest version C*. This version can also recognize C’s source
code. When it compiles any other software, it simply adds the
virus. But when C* is used to compile C, it adds the same routine that D added: a routine that recognizes C’s source code,
adds viruses to other software, and propagates itself when compiling C. We can now dispose of D, leaving the system with C*
installed as the compiler, and C’s source code. Compiling this
source code with C* will create a new corrupt C*, despite the
fact that the source code only had instructions for the honest
C. Thus, even if we inspect and recompile both the application
and the compiler sources, we still cannot be sure that the resulting executable is not infected.
Similar situations can occur even without malicious intent.
Sometimes, in large software projects involving language design,
after several generations of language development, designers
can no longer recompile the entire system from scratch.
Assume a new language L, with versions denoted by a superscript such as L1, L2, and so on. We initially write a compiler in
language S for compiling language L1, and compile it, resulting
In living cells, this process is called epigenetic
inheritance because inherited traits do not pass
through the conventional genetic mechanism.
Rather, they pass as functional biochemicals—typically proteins—that exist in the cytoplasm and
mediate various effects. As cells split, their cytoplasm passes to the next generation together with
in compiler CL1. Next we use language L1 to write a compiler
for the next version, L2, and then use CL1 to compile this second compiler, creating CL2 . We can now use L2, which presumably has more features, to write an even more advanced
compiler for L3. This continues as we write compilers in more
advanced versions of the language.
In principle, we can repeat the entire iterative process starting from the original compiler in language S. But this depends
on our cataloging all the intermediate versions of the source
code in the order they were produced. If one of the intermediate versions is misplaced, we are left with a working system that
we can continue to develop, but that we cannot recreate from
scratch. Part of the required knowledge is in the operational system itself—it no longer exists in the source code.
References
1. J.M. Boyle, R.D. Resler, and V.L. Winter, “Do You Trust Your Compiler?” Computer, May 1999, pp. 65-73.
2. K. Thompson, “Reflections on Trusting Trust,” Comm. ACM, Aug.
1984, pp. 761-763.
Program
Program
Honest
compiler
Dishonest
compiler
Trojan
horse
Executable
Executable
Figure A. Honest and dishonest compilers. A dishonest compiler
can add malicious code—a Trojan horse—during compilation.
these proteins.4 Remarkably, this occurs even in
unicellular organisms such as bacteria.
Although epigenetic inheritance is uncommon,
we can cite several examples of its occurrence, many
of which depend on modifying the DNA structure
without changing the code itself—for example,
methylation and chromatin remodeling.4-6 Such
July 2002
37
modifications can affect expression patterns,
so it is difficult to claim independence from
DNA is unlikely
the DNA.
to contain full
Examples totally unrelated to the DNA
instructions on how
also exist. Prions, infamous for their role in
Mad Cow disease, are perhaps the bestto assemble the
known example. These proteins have two difproteins into
ferent stable conformations, each of which
supermolecular
has different functions. One of these funcstructures to create
tions is to control the conformations of the
a functional cell.
prion itself. Thus, once a protein adopts a
specific conformation, it causes other copies
of itself to do likewise. As the proteins propagate in the cytoplasm from each cell to their
daughter cells, this preference also propagates.7,8
Another class of DNA-less inheritance is the
inheritance of structures. Perhaps the best examples
come from Tetrahymeana and Paramecium, two
protozoa with cilia protruding from the surface of
their cells. These cilia, organized in complex patterns, move in a rhythmic motion during feeding.
An extensive analysis showed inheritance of variations in these patterns that is independent of DNA
sequence.9 Moreover, experimental manipulation
of these structures also led to inheritable changes.10
While known examples of inheriting structural
modifications are rare, the actual inheritance of structures is basic. Life requires many cellular organelles,
but they are not created directly from DNA-borne
specifications. These organelles typically replicate
independent of the encompassing cell cycle—albeit
at about the same rate. For example, membranes crucial for cellular functions are composed of lipids, not
proteins, and the cells obtain them through food
intake. Proteins modify the lipids as needed and
assimilate them into existing membranes. As the
membranes grow, the cell grows and eventually splits.
This process is not directly encoded in DNA.
Similar processes occur in other organelles. For
example, cells do not create mitochondria from
scratch—rather, existing mitochondria grow and
split.11 The endoplasmatic reticulum is built by
extending existing structures, not by recreating
them from scratch. The same process applies to specific parts of the endoplasmatic reticulum, such as
the Golgi apparatus.12
In all of these examples, the raw materials are
indeed proteins, but they do not naturally come
together to create the required complexes.
Organelles are composed of many different protein
types, which must connect to each other and be
embedded in lipid membranes. Thus, to assimilate
new proteins into the correct locations of the structure, the organelles must exist to begin with. By
38
Computer
assimilating proteins, the organelles grow and eventually split.
We propose the following five-tier model of information required by cellular life:
• classical DNA protein encoding;
• DNA control regions, by which existing proteins control the transcription of others;
• indirect control by interactions among proteins, such as protein trafficking, assembly, and
the creation of protein complexes;
• propagation of cytoplasmic content from a cell
to its daughter cells, consequently affecting
their behavior; and
• structural information, embodied in cellular
organelles.
The progression is from information encoded
directly in the DNA to information not so encoded,
which might even be completely absent.
THE CHICKEN AND EGG
A major difficulty with the notion that the DNA
contains all the required information is that this
information seems useless without the surrounding cellular machinery. While the DNA contains
basic instructions on how to prepare many components of the machinery—namely, proteins—it is
unlikely to contain full instructions on how to
assemble them into supermolecular structures to
create a functional cell.
The DNA will also be missing some nonprotein
components altogether. Moreover, the same DNA
can result in many different cell types, so additional
information is obviously needed to create one specific type. Thus, DNA is only meaningful in a cellular context in which it can express itself and in
which there is an iterative, cyclic relationship
between the DNA and the context.
Thus, we are left with the chicken-and-egg problem: How did this process start? This question is
actually quite complicated because the feedback
cycle is very tight. Proteins replicate, maintain, and
protect the DNA. We can’t create proteins from
DNA if we don’t already have proteins to assist in
the transcription process as polymerases and to
assist the new proteins in folding correctly into their
intended 3D configurations as chaperones.
The answer to this question is that the cycle we see
today results from a very long spiral that represents
the evolutionary process. The cycle can’t be split in
any given place because each element depends on
those around it. The quest to understanding the origins of life is the quest to reconstruct the spiral,
Generation 1
Phylogenesis
(evolution
of species)
Mutation
DNA
Mutation
Cell 1
Onthogenesis
(differentiation
of an individual)
Generation 3
Generation 2
DNA
Mutation
Generation 4
DNA
Mutation
Cell 3
Cell 2
Figure 3. Three
scales in the spiral
of life. (a) During
phylogenesis,
organisms reproduce and mutate;
(b) during onthogenesis, cells divide
and differentiate;
and (c) during homeostasis, transcription keeps the cell
in a stable state.
Generation 5
DNA
Cell 4
Cell 5
Differentiation DNA Differentiation DNA Differentiation DNA Differentiation DNA
Same mix
Mix
Homeostasis
(regulation to
maintain state)
Protein
DNA
Protein
which involves understanding all the processes that
drive it and are not encoded in the DNA.
As Figure 3 shows, one source of life’s complexity is the existence of spirals at three levels:
• the evolutionary scale, at which organisms
reproduce and mutate, leading to a divergence
of species, or phylogenesis;
• the multicellular scale, at which an organism’s
cells divide and differentiate, creating the different body parts, or onthogenesis; and
• the single-cell scale, in which transcription is
used to maintain a stable state, or homeostasis—genes are only transcribed to make up for
degraded proteins.
In computer science terms, we can talk about
three levels of nested parallel loops: multiple interacting organisms, multiple cells within each organism, and multiple proteins expressed in each cell.
This is massive parallelism with very little synchrony. The cytoplasmic composition must embody
Same mix
DNA
Protein
Same mix
DNA
Protein
Same mix
DNA
the force driving all three levels of the life cycle.
Fully understanding this process requires knowing
numerous details.
M
uch work remains in biology before we will
understand these and related issues. All the
information we need is not digitally
encoded. Consider a small robot charged with
reconstructing a functional cellular environment.
The task includes the structure of various organelles,
the quantities of different proteins, their localization
in the cell, and the constructs they create. How much
information will the robot need to perform its task?
Merely understanding the information encoded in
the DNA might be the easy part. ■
References
1. J.P. Fitch and B. Sokhansanj, “Genomic Engineering:
Moving Beyond DNA Sequence to Function,” Proc.
IEEE 88, Dec. 2000, pp. 1949-1971.
July 2002
39
2. I. Wilmut et al., “Viable Offspring Derived from Fetal
and Adult Mammalian Cells,” Nature, Feb. 1997,
pp. 810-813.
3. G.L.G. Miklos and G.M. Rubin, “The Role of the
Genome Project in Determining Gene Function:
Insights from Model Organisms,” Cell, Aug. 1996,
pp. 521-529.
4. J. Maynard Smith, “Models of a Dual Inheritance
System,” J. Theoretical Biology, vol. 143, 1990, pp.
41-53.
5. E. Jablonka, M. Lachmann, and M.J. Lamb, “Evidence, Mechanisms, and Models for the Inheritance
of Acquired Characters,” J. Theoretical Biology, vol.
158, 1992, pp. 245-268.
6. A. Razin and H. Cedar, “DNA Methylation and
Genomic Imprinting,” Cell, vol. 77, 1994, pp. 473476.
7. Y.O. Chernoff, “Mutation Processes at the Protein
Level: Is Lamarck Back?” Mutation Research, vol.
488, no. 1, 2001, pp. 39-64.
8. T.R. Serio and S.L. Lindquist, “Protein-Only Inheritance in Yeast: Something to Get [PSI+]-ched about,”
Trends Cell Biology, Mar. 2000, pp. 98-105.
9. D.L. Nanney, “Cortical Patterns in Cellular Morphogenesis,” Science, May 1968, pp. 469-502.
10. J. Beisson and T.M. Sonneborn, “Cytoplasmatic
Inheritance of the Organization of the Cell Cortex in
Paramecium Aurelia,” Proc. Nat’l Academy of Science, vol. 53, 1965, pp. 275-282.
11. M.P. Yaffe, “The Machinery of Mitochondrial Inheritance and Behavior,” Science, Mar. 1999, pp. 14931497.
12. M.G. Roth, “Inheriting the Golgi,” Cell, Dec.1999,
pp. 559-562.
Dror G. Feitelson is a lecturer in computer science
in the School of Computer Science and Engineering
at the Hebrew University, Jerusalem. His research
interests focus on system software for parallel processing, including job scheduling, workload modeling, and parallel I/O. Feitelson received a PhD in
computer science from Hebrew University. He is a
member of the IEEE Computer Society and the
ACM. Contact him at [email protected].
Millet Treinin is a lecturer in the Department of
Physiology at Hebrew University Hadassah Medical School, Jerusalem. Her research interests
include molecular mechanisms of nerve cell activity. Treinin received a PhD in biology from Hebrew
University. Contact her at [email protected].
Join the IEEE Computer Society online at
computer.org/join/
Complete the online application and
• Take Web-based training courses in technical areas for free
• Receive substantial discounts for our software development professional
certification program
• Get immediate online access to Computer
• Subscribe to our two new publications, IEEE Pervasive Computing and
IEEE Transactions on Mobile Computing, or any of our 22 periodicals at
discounted rates
• Access the entire Computer Society digital library for only $50*
• Attend leading conferences at member prices
• Sign up for a free e-mail alias—[email protected]
*Regular price: $99. Offer expires 15 August 2002.
THE WORLD'S COMPUTER SOCIETY

Download Report

View PDF - CiteSeerX

Paperzz.com

Your Paperzz