DNA Computing: A Research Snapshot

DNA Computing: A Research
Snapshot
Lila Kari
A research snapshot
•
•
•
•
•
•
•
Adleman’s 20 variable 3-SAT experiment
DNA Benenson automata
DNA memory
Towards a programmable DNA computer
DNA nanoscale shapes
DNA nanomachines
Impact on theoretical computer science
(1) Adleman’s 20-variable 3-SAT
[Braich et al., Science, 2002]
• The first experiment that demonstrated that
DNA Computing devices can exceed the
computational power of an unaided human
• The answer to the problem was found after an
exhaustive search of more than 1 million
possible solution candidates
Input to 3-SAT and solution
Algorithm for 3SAT
• Input: A Boolean formula in 3CNF
• Step 1: Generate the set of all possible truth
value assignments
• Step 2: Remove the set of all truth value
assignments that make the first clause false
• Step 3: Repeat Step 2 for all clauses of the
input formula
• Output: The remaining (if any) truth value
assignments
Encoding the input
• Every variable xk, k =1,..., 20, was associated
with two distinct 15-mer DNA single strands
called ‘value sequences’, one representing
true, and one representing false
Library of candidates
• Each of the possible 2^20 truth assignments
was represented by a 300-mer ‘library strand’
consisting of the ordered catenation of one
15-mer value sequence for each variable, i.e.,
W1 W2 ..... W20, where Wi is Xi^T or Xi^F
To obtain these library strands, the 40
individual 15-mer sequences were assembled
using a mix-and-match combinatorial
technique
3SAT wetware
• A glass ‘library module’ filled with a gel
containing the library
• One glass ‘clause module’ for each of the 24
clauses of the formula
• Each clause module was filled with gel
containing probes, i.e., 15-mer strands
Watson-Crick complementary to the truth
assignment that made that particular clause
true
Bioalgorithm for 3SAT
• The strands are moved between modules by gel
electrophoresis
• The library passes through the first clause module,
wherein library strands containing the 3 truth
assignments satisfying the first clause are immobilized,
while library strands that do not satisfy it go into a
buffer reservoir
• The captured strands are released by raising the
temperature, and used as input to the 2nd clause
module, etc.
• At the end, only the strands representing the truth
assignment satisfying all 24 clauses remain
Output to 3SAT
• The output was PCR amplified with primer pairs
corresponding to all 4 possible true-false
combinations of assigments for the first and last
variable, x1 and x20
• None except the primer pair (X1F, WK(X20F))
showed any bands, indicating two truth values of
the satisfying assignment, x1 = F and x20 =F
• The process was repeated for all variable pairs
(x1, xk), k = 2,..., 19
(2) DNA Benenson Automata
[Benenson et al., Nature 2001]
Construct a simple two-state automaton
over a two-letter alphabet set, using
double -stranded DNA molecules and
restriction enzymes
Main engine of Benenson automata
• FokI enzyme: an unusual restriction enzyme that
recognizes a sequence and cuts unspecifically a short
distance away
• Recognition site
5’-GGATG-3’
3’-CC TAC -5’
• Cleaves 9bp away on the top strand and
13 bp away on the bottom strand
Encoding the input
• Encoding of the symbol a
• Encoding of the symbol b
• Encoding of the terminator t
Example of encoding the input
• The input strand ab is encoded as a DNA
strand that contains the site for FokI, followed
by the catenation of the encodings for abt
Encoding state/symbol pairs
The pair S0a is encoded as 5’-GGCT-3’
(the 4-mer suffix of a)
The pair S1a is encoded as 5’-CTGG-3’
(the 4-mer prefix of a)
Meaning: If the 4-mer suffix of the encoded
symbol is detected then the symbol is interpreted
as being read in state S0
If the 4-mer prefix is detected, then the symbol is
being interpreted as being read in state S1
Output detection molecules
• S0-D is a 161-mer DNA double strand with an
overhang 3’-AGCG-5’ which ‘detects’ the last
state of the computation as being S0
S1-D is a 251-mer DNA double strand with
overhang 3’-ACAG-5’ which detects the last
state of the computation as being S1
8 possible transition molecules
Each transition molecule has a 4-mer overhang,
for example T1 has 3’-CCGA-5’ ,that can
selectively bind to the DNA encoding the
current state/symbol pair, in this case S0a
Example computation on input ab
Computation on input ab
• FokI enzyme cuts the input encoding abt exposing
the sticky end 5’-GGCT-3’, i.e., S0a
• The transition molecule T1: S0a -> S0 detects
this state/symbol by binding and forming a
double-stranded molecule (using ligase)
Note: The transition molecule T1, incorporated in
the current molecule, contains a FokI restriction
site. Moreover, the 3bp spacer after the site
ensures that the next cleaving will expose a suffix
of the next symbol, which will be correctly
interpreted as S0b
Computation on input ab, contd.
• The overhang is now 5’-CAGC-3’ , i.e., S0b
• The sticky-end fits the transition rule
T4: S0b à S1
The combination of the current strand with T4
and ligase leads to another double strand
A last use of FokI exposes the overhang
5’-TGTC-3’ which is a suffix of the terminator,
interpreted as S1t
Outcome of computation on input ab
• The overhand is complementary to the stickyend 3’-ACAG-5’ of the detector molecule S1-D
corresponding to the last state of the
computation being S1.
• The state S1 is not final, and thus the outcome
of the computation is that the input ab is not
accepted by the automaton
• Note that any two-state two-symbol
automaton can be build using this method
Application of Benenson automata
• Medical diagnosis and treatment: smart drugs
[Benenson et al., Nature 2004]
• Automaton to identify and analyze the mRNA
of disease-related genes associated with lung
and prostate cancer, and produce a singlestranded DNA molecule modelled after an
anti-cancer drug
(3) DNA Memory
• Information-encoding density
• [Reif et al., DNA7, 2002] DNA has the
potential of storing on the order of 10^12
more compactly than conventional storage
technologies
• [Baum, Science, 1995]: content-addressable
DNA memory vastly larger than the brain
Nested Primer Molecular Memory
[ Yamamoto et al., 2008]
• NPMM = pool of strands wherein each strad
codes both data information and address
information
[CLi, BLj, Alk, DATA, ARq, BRr, CRs]
Here i, j, k, q, r, s are between 0 and 15 and
each component, e.g., CL0 represents a
20-mer DNA sequence
How to retrieve data
Use nested PCR consisting of 3 steps
Use PCR with primer pair (CLi, WK(CRs))
WK(s) is the Watson-Crick complement of s
This results in amplification of all molecules
starting with CLi and ending in CRs
• Second PCR uses primer pair (BLj, WK(BRk))
• Third PCR use primer pair (ALk, WK(ARq))
• Sequencing will result in retrieval of the DATA
•
•
•
•
Advantages of NPMM memory
• Enormous address space: 16.8 million
addresses
• High specificity
• Proper selection of DNA sequences avoids
mutation during PCR
Organic DNA memory
•
•
•
•
[Wong, Wong, Foote, Comm.ACM, 2003]
[Yachie et al., Biotechnology Progress, 2007]
Memory technology using living organisms
First paper proposes a candidate for a living
host for DNA memory sequences that
tolerates the addition of artificial gene
sequences and survives extreme
environmental conditions
Organic memory
• Use Escherichia coli, and Deinococcus
radiodurans (can survive extreme conditions
including cold, dehydration, vacuum, acid and
radiation)
• Information encoding stage: an encoding
scheme was chosen that assigned 3-mer
sequences to various symbols. For example:
AAA = “0”, AAC = “1”, AGG = “A”
Information encoding
• Each of the encoding 3-mers contained only 3
of the 4 DNA nucleotides
• Using this encoding, any English text could be
codified as a DNA sequences
• The text chosen for this experiment was
“And the oceans are wide”
Several additional sequences were chosen to
act as sentinels and tag the beginning and end
of messages
Choosing sentinel sequences
• Identify a set of twenty-five 20-mer sequences
that do not exist in either genome, yet satisfy
all the genomic constraints and restrictions
• All sequences contained multiple stop codons
TAA, TGA, TAG as subsequences to prevent
misinterpreting the memory strands,
translating them into artificial proteins that
could kill the bacteria
Inserting the message
• A 46bp DNA sequence was created, consisting
of two different 20bp sentinels, connected by
a 6bp recognition site of an enzyme
• The embedded DNA was then inserted into
cloning vectors, and transferred into E.coli
allowing the vector to multiply
• The vector and encoded DNA were then
incorporated into the genome of Deinococcus
for permanent storage and retrieval
Organic DNA memory
Advantages of organic memory
• Message can be retrieved using prior
knowledge of sequences at both borders, by
PCR, read-out and decoding
• 1ml of liquid can contain up to 10^9 bacteria
• Potential disadvantages are random mutations
but these are unlikely given the natural
cellular mechanisms for detecting and
correcting errors.
(4) Towards a programmable DNA
computer
• [Sakamoto et al., Science, 2000]
• [Hagiya et al., DNA3, 1997]
• A self-acting DNA molecule containing, on the
same strand, the input, the program, and the
working memory
• Whiplash PCR
Whiplash PCR
• The 5’ end of the DNA single strand contains
state transitions A à B, encoded as DNA rule
blocks
WK(B) – WK(A) – stopper sequence
The 3’ end of the strand contains the
encoding of “current state”, say A
Whiplash PCR transition A à B
• Step (i): Cooling the solution will lead the 3’ end
of the DNA strand, A, to attach to its
corresponding rule block, namely WK(A)
• Step (ii): PCR is used to extend the now-attached
end A by the encoding of the new state B, and
the process is stopped by the stopper sequence
• Step (iii): By raising the temperature, the new
current state B is detached, and the new
transition cycle can begin
Whiplash PCR
(5) DNA nanoscale shapes
[Rothemund, Nature, 2006]
• ‘Scaffolded DNA origami’ for fabrication of any
2D-shape of 100nm diameter
• Technique: DNA strands form complex
structures by their design, which makes it
possible for some single DNA strands to
participate in two double helices – they wind
along one helix, then switch to another
DNA origami design process
• (1) Build an approximate geometric model of
the desired shape; the shape is approximated
by cylinders that are models of DNA double
helices
• (2) Fill the shape by folding a single long
‘scaffold strand’ back an forth in a raster
pattern such that at each moment the scaffold
strand represents either the main strand or
the complement strand of as double helix
DNA origami design process
• (3) Use a computer program to generate a set
of ‘staple strands’ that provide Watson-Crick
complements of the scaffold
• The staple strands are designed to bind to
portions of the scaffold strand, holding it thus
together in the desired shape
• The staple strands are fine-tuned to minimize
strain and optimize binding specificity and
binding energy
Testing DNA origami
• Scaffold = circular genomic DNA, 7, 249nt
long, from the virus M13mp18
• Use 250 short staple strands and mix with the
scaffold, in 10-fold excess to it
• The strands annealed in less than two hours
and AFM (Atomic Force Microscopy) imaging
showed that the desired shape was realized
• Results: Assembly of squares, triangles, fivepointed stars, smiley faces
DNA origami
(6) DNA nanomachines
• Dynamic DNA structures with potential use to
nanofabrication, engineering and computation
• DNA-based nanodevices can convert static
DNA structures into machines that can move
or change conformation
• Examples: tweezers, walkers that can be
moved along a track, autonomous molecular
motors
Molecular tweezers
[Yurke et al., Nature, 2000]
• Two partially double-stranded DNA arms
connected by a short single DNA strand acting
as a flexible hinge
• The resulting structure is on the shape of a
pair of open tweezers
• A ‘set strand’ is designed in such a way as to
be complementary to both single-stranded
‘tails’ at the end of the arms
Molecular tweezers
• Adding the ‘set strand’ results in its annealing to
both tails of the arms, bringing thus the arms of
the tweezers together in a ‘close’ configuration
• A short region of the set strand remains single
stranded, And is used as a toehold that allows a
new ‘reset strand’ to strip the set strand from
the arms by itself hybridizing with the set strand
- the tweezers are returned to the ‘open’
configuration
Molecular tweezers
Molecular walker
[Shin, Pierce, JACS, 2004]
• DNA device with two distinguishable feet that
walks directionally on a linear DNA track with
single strands periodically protruding from it
and acting as anchors
• The walker is double-stranded and has two
single-stranded extensions acting as ‘legs’
• Specific attachments bind the legs to the
single-stranded anchors placed periodically
along the double-stranded track
Molecular walker step
• A step requires the sequential addition of two
strands: the first lifts the back foot from the
track, by strand displacement – a process by
which an invading DNA single strand can
displace one of the constituent strands of a
double-strand by replacing it with itself,
provided the new structure is more stable–
• The second strand places the released foot
ahead of the stationary foot
Molecular walker
• Molecular walker step
Other molecular walkers
• [Sherman, Seeman, Nanoletters, 2004] – walking
devices based on pattern of inchworms – the
front foot steps forward and the back foot
catches up
• [Sekiguchi et al., DNA13, 2008] Autonomous
three-legged walker (no need for fuel strands)
that can walk autonomously in 2D or 3D on a
designed route. It uses an enzyme as a source of
power and a track of DNA equipped with many
DNA anchors arranged in a specific pattern
(7) DNA Computing: Impact on
Theoretical Computer Science
•
•
•
•
•
•
•
•
The genetic code
Splicing systems
Optimal encodings for DNA Computing
Sticker systems
Watson-Crick automata
Combinatorics on DNA words
Cellular computing
DNA computation by self-assembly
1953: Watson and Crick discover DNA
structure
The RNA Tie Club
• 1954 “Solve the riddle of the RNA structure and to understand
how it builds proteins” (clockwise from upper left: Francis
Crick, L. Orgel, James Watson, Al. Rich)
• There are 20 aminoacids that build up proteins
The Diamond Code
• G.Gamow - double stranded DNA acts as a template for
protein synthesis: various combinations of bases could form
distinctively shaped cavities into which the side chains of
aminoacids might fit
Comma-Free Codes
(the prettiest wrong idea in 20-th century science)
• The RNA piglet model
The prettiest wrong idea in all of 20th
century science
• Suckling-pig model of protein synthesis
• Construct a code in which when two sense
codons (triplets) are catenated, the subword
codons are nonsense codons
• If CGU and AAG are sense codons, then GUA
and UAA must be nonsense because they
appear in CGUAAG
Comma-free codes (Crick 1957)
• How many words can a comma-free code
include?
• For n=4 and k=3 the size of a maximal commafree code is the magic number 20
• For an alphabet of n letters grouped into kletter words, if k is prime, the number of
maximal comma-free codes is (n^k –n)/k
• For n=4 and k=3 this equals 408
Reality Intrudes
• News from the lab bench:
[Nirenberg,Matthaei ’61] synthesize RNA,
namely poly-U, coding for phenylalanine
• By 1965 the genetic code was solved
• The code resembled none of the theoretical
notions
• The “extra” codons are merely redundant
The Genetic Code
Splicing Systems (Head 1987)
5’ CCCCCTCGACCCCC 3’
3’GGGGGAGCTGGGGG5’ +
5’AAAAAGCGCAAAAA 3’
3’ TTTTTCGCGTTTTT 5’ +
Enzyme 1
5’TCGA3’
3’AGCT5’
+
Enzyme 2
5’GCGC3’
3’CGCG5’
Splicing Systems
5’ CCCCCT CGACCCCC 3’
3’GGGGGAGC TGGGGG5’ +
5’AAAAAG CGCAAAAA 3’
3’ TTTTTCGC GTTTTT 5’
DNA strands with compatible sticky ends
recombine to produce two new strands
Splicing operation
Splicing system sample results
Theorem (Paun’95, Freund,Kari,Paun ,’99)
Every type-0 language can be generated by a splicing
system with finitely many axioms and finitely many
rules.
Theorem (Freund,Kari,Paun ’99)
For every given alphabet T there exists a splicing
system, with finitely many axioms and finitely many
rules, that is universal for the class of systems with
terminal alphabet T.
From DNA to TCS
•
•
•
•
•
•
•
•
The genetic code
Splicing systems
Optimal encodings for DNA Computing
Sticker systems
Watson-Crick automata
Combinatorics on DNA words
Cellular computing
DNA computation by self-assembly
Encoding Information for
DNA Computing
• DNA strands should form desired bonds
• DNA strands should be free of undesirable
intra-molecular bonds
• DNA strands should be free of undesirable
inter-molecular bonds
Intramolecular Bonds
C C A
T
C AGT C GC T AT C A
C C T
GT C AGC GAT AGA
Intra- and inter-molecular bonds
DNA-complementarity model
(Kari,Kitto,Thierrin’02)
3’
(a)
G
5’
A
C
G
T
T
G
C
A
C
G
A
C
G
C
T
G
T
A
A
T
(c)
3’
(d)
(b)
C
5’
Bond-free languages
Bonds between DNA strands
Sample Results
(Hussini/Kari/Konstantinidis/Losseva/Sosik ‘03)
Sticker Systems
(Freund,Paun,Rozenberg,Salomaa’98,
Kari,Paun,Rozenberg,Salomaa,Yu’98, Hoogeboom,van Vugt’00,
Kuske,Weigel’04,
Paun,Rozenberg ‘98)
Given a complementarity relation, define an
alphabet of double-stranded columns
Sticking operation
Complex Sticker Systems
• Sakakibara,Kobayashi ‘01: Sticker systems based on hairpins
• Alhazov,Cavaliere ’05: Observable sticker systems
Watson-Crick Automata
(Freund,Paun,Rozenberg,Salomaa’99;Paun,Rozenberg’98;
MartinVide,Paun,Rozenberg,Salomaa’98;Czeizler,Czeizler06;
Paun,Paun’99;Czeizler,Czeizler,Kari,Salomaa’08)
From DNA to TCS
•
•
•
•
•
•
•
•
The genetic code
Splicing systems
Optimal encodings for DNA Computing
Sticker systems
Watson-Crick automata
Combinatorics on DNA words
Cellular computing
DNA computation by self-assembly
Combinatorics on DNA Words
• IDEA: Consider the word w and its WKcomplement, WK(w), as equivalent
• The word ACTG CAGT CAGT can be considered
repetitive (periodic) because it can be written
as ACGT WK(ACGT)2
• Generalize classical notions such as power of a
word, border, primitive word, palindrome,
conjugacy, commutativity
Identity => Antimorphic involution f
Pseudo-palindrome (de Luca,De Luca’06,
Kari,Mahalingam’09) u = f(u)
Pseudo-commutativity(Kari,Mahalingam’08)
u v = f(v) u
Pseudo-bordered word (Kari,Mahalingam’07)
w = v x = y f(v)
Pseudoknot-bordered word (Kari,Seki’09)
w = u v x = y f(u) f(v)
Pseudo-conjugacy of u, v (Kari,Mahalingam’08)
u x = f(x) v
Fine and Wilf Theorem
Extended Fine and Wilf Theorem
Extended Fine and Wilf Theorem
Lyndon-Schutzenberger Equation
Extended Lyndon-Schuzenberger
Extended Lyndon-Schutzenberger
DNA Computing: A research snapshot
•
•
•
•
•
•
•
Adleman’s 20 variable 3-SAT experiment
DNA Benenson automata
DNA memory
Towards a programmable DNA computer
DNA nanoscale shapes
DNA nanomachines
Impact on theoretical computer science
Our Challenge
• Discover a new, broader notion of
computation
• Understand the world around us in terms of
information processing
• “Biology and Computer Science –
life and computation – are related.
I am confident that at their interface great
discoveries await whose who seek them.”
(Adleman’98)