Genetic Algorithm in DNA Computing: A Solution to the Maximal

Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
Genetic Algorithm in DNA Computing: A
Solution to the Maximal Clique Problem
Department of Physics,00 grade, Li Yuan, Fang Chen
Tutor: Ouyang Qi
Abstract
Genetic algorithm is one of the possible ways to break the limit of brute-force
method in DNA computing. Using the idea of Darwinian evolution, we introduce a
genetic DNA computing algorithm to solve the maximal clique problem. All the
operations in the algorithm are accessible with today’s molecular biotechnology. Our
computer simulations show that with this new computing algorithm, it is possible to
get a solution from a very small initial data pool, avoiding enumerating all candidate
solutions. For randomly generated problems, genetic algorithm can give correct
solution within a few cycles at high probability. Although the current speed of a DNA
computer is slow compared with silicon computers, our simulation indicates that the
time requirement of this genetic algorithm is approximately a linear function of the
number of vertices in the network. This may make DNA computers more powerful
attacking some hard computational problems.
Keywords: DNA computer, genetic algorithm, NP-complete problem.
23
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
Introduction
In recent years, biomolecular computing has been documented in various
literatures (see Ruben & Landweber, 2000[1] and references therein). Since Adleman’s
solution to the Hamiltonian path problem[2], DNA and RNA solutions of some other
famous NP-complete problems, such as the maximal clique problem[3], the 3-SAT
problem[4] and the knight problem[5], have been given. Some important progresses
have been made in the technologies in computing with biomolecules[6-11]. The power
of parallel, high-density computation by molecules in solution allows DNA computers
to solve hard computational problems such as NP-complete problems in polynomial
increasing time, while a conventional Turing machine require exponentially
increasing time[12]. However, all the current DNA computing strategies are based on
enumerating all candidate solutions, then, using selection processes to eliminate
unwanted DNA. This algorithm requires that size of the initial data pool increases
exponentially with the number of variables in the calculation, so that the capacity of
the DNA computer is limited. For example, to calculate a DNA solution of a maximal
clique problem, the number of molecules in the solution must be at least 2N, while N
is the number of nodes. With M scale DNA concentration, a 35-node problem can be
solution in a typical test tube, while a 75-node problem requires a swimming pool!
Obviously, it is inaccessible in practice. In order to break the barrier of this
brute-force method, we need to develop new algorithms for DNA computer.
One of the strategies to overcome the volume size problem is to apply the idea of
Darwinian evolution. Most organisms evolve by means of two primary processes:
natural selection and sexual reproduction. The first determines which members of a
population survive to reproduce, and the second ensures mixing and recombination
among the genes of their offspring. It was demonstrated that the generic algorithm
based on these basic processes could solve complex problems[13]. The algorithm
exploits the higher-payoff, or target regions of the solution space, because successive
generations of reproduction and crossover produce increasing numbers of strings in
those regions. It favors the fittest strings as parents, and so above-average strings will
have more offspring in the next generation. We notice that the parallelism of
molecular computation makes it extremely convenient to apply genetic algorithm (GA)
[14]
in DNA computing. In this paper, we present our simulation results of using the
genetic algorithm to solve the maximal clique (MC) problem. Our results show that it
is possible to get a solution from a very small initial data pool, avoiding enumerating
all candidate solutions. For randomly generated problems, genetic algorithm can give
correct solution within a few cycles at high probability. Although the current speed of
a DNA computer is slow compared with silicon computers, our simulation indicates
that the time requirement of this genetic algorithm is approximately a linear function
of the number of vertices in a network. This may make DNA computers more
powerful attacking some hard computational problems.
24
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
Methods
Mathematically, a clique is defined as a set of vertices in which every vertex is
connected to every other vertex by an edge. The maximal clique problem asks: Given
a network containing N vertices and M edges, how many vertices are in the largest
clique? In Fig.1a, a problem of five vertices and five edges is defined. Obviously, the
vertices (2,3,4) make the largest clique, so the size of the largest clique in this network
is three. The maximal clique problem has been proven an NP-complete problem.
Besides the conventional algorithm on electronic computer and DNA computer, some
interesting attempts have been made[15].
Fig. 1: The original graph describing a maximal clique problem of five vertices and five edges (a)
and its complementary graph (b).
The data structure of the computation is the same of the previous work[3]. For a
graph of N vertices, we use an N-digit binary number to represent each possible
clique. In the N-digit binary number, a bit set to 1 represents the corresponding vertex
in the clique, a bit set to 0 represents the corresponding vertex out of the clique. For
example, 5-digit binary number (01110) represents the largest clique (2,3,4) in Fig.1a.
To solve the maximal clique problem is to find an N-digit binary number that contains
most 1s (called condition A). For each graph of N vertices, we can build a
complementary graph, which contains all the missing edges in the original graph, as
shown in Fig.1b. Obviously, any two vertices in a clique in the original graph are not
connected in the complementary graph. It should be noticed that if the complementary
graph can be divided into several connected graphs, we can divide the problem into
several sub-problems, find each sub-problem’s maximal clique, and then the union of
these cliques makes the whole problem’s maximal clique. From this point of view, we
only need to solve problems whose complementary graph is a connected graph.
With this data structure, we design the following genetic algorithm to solve the
maximal clique problem in a DNA computer, and pay special attention to make sure
that all the processes in our algorithm are readily accessible in current
biotechnologies.
25
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
We start with a small data pool containing many N-digit binary numbers. In the
initial data pool, each digit of each binary number is set to 0, meaning that no vertex
is in the clique. Then we let the data pool evolve. In each cycle of evolution, we first
randomly replace some digits in the data pool by 1s with certain probability, no matter
whether the original digit is 0 or 1. We call this operation mutation. Then we eliminate
from the data pool all numbers that do not satisfy condition A. For example, for the
problem shown in Fig.1a and Fig.1b, such numbers as (11xxx), (1xx1x), (1x1xx),
(xx1x1) and (xxx11) will be removed from the data pool, while x representing either 0
or 1. If the size of the data pool is smaller after the elimination, we enlarge it by
making copies of the binary numbers until the size of the pool reaches that of the
initial pool. Through these operations of mutation and elimination, the cliques in the
data pool will become bigger as the increase of the number of evolution cycle.
In order to find out if there is some progress in each cycle, after the operation of
elimination, we will find the maximum number of 1s in a string. This number
represents the current biggest clique’s size. If this size tends to increase with the
increase of number of cycles, we know that the maximal clique is not found, so the
computation should go on. If it stops growing for many cycles, we regard it as a sign
that the evolution has reached its end, so the computation should stop and the size of
the clique we get at last is considered as the size of the maximal clique.
All above-mentioned calculating processes can be readily mapped into
corresponding biological operations using current biotechnologies. In the following,
we discuss the related biotechnologies that are needed in order to carry out the
algorithm.
First, we need to encode the binary numbers into DNA strands, one strand
representing one binary number. We noticed that the parallel overlap assembly
technique of building the data pool to be very helpful to our algorithm[16]. To encode
an N-digit binary number, we need N+1 DNA sequences representing the positions
and N sequences representing the values of each digit. The sequences representing the
positions are all of the same length, but the sequences representing the different
values are of different length, for example, 1s are shorter than 0s. It should be noticed
that the sequences must be of some different characters[17].
There are at least two techniques available to help carry out the operation of
elimination. If we make the length of the value sequences of 1s be zero, then
wherever in a binary number 1 occurs, the two position sequences next to that digit
will be directly linked. If we design the position sequences in such a way that a
restriction enzyme can break the linking point, we will eliminate value 1 in that
specific position. If we want to break all the strands with 1 at position a and position b,
we only need to divide the data pool into two parts, pool A and pool B, breaking the 1
at position a in pool A and 1 at position b in pool B. The union of pool A and pool B
will not contain DNA strands with 1 at both positions but still can contain strands with
1 at only one of the two positions (Fig. 2a) [3]. For the other technology, we make the
length of the value sequences of 0s be zero, then wherever in a binary number 0
occurs, the two position sequences next to it will join. In order to eliminate the
unqualified DNA strands, we only have to pick up the qualified ones by using
26
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
polyacrylamide gel containing bound probes to capture the strands with 0 at any of the
two positions (Fig. 2b) [4].
Fig. 2: Different method of eliminating the data pool. (a) use restricted enzymes; (b) pick up
qualified strands with probes in electrophoresis.
The operation of mutation would be a little difficult to carry out. In our algorithm
the value sequences representing 0s and 1s are of different length, so we need to make
DNA sequences mutate to a different length. If the length of value 0’s sequence is
zero, we can use the restriction enzyme to cut at the joint position the strands each
into two, add some DNA subsequence representing the corresponding value 1 into
data pool, then let the strands recombine. The sequence of value 1 has a chance to be
inserted into the joint of the two pieces. As our simulation indicates, the mutation rate
is no more than 5%, so this technique may be enough. In addition, the method of
bubble PCR and ligation can also be a candidate for carrying out mutation. Cloning or
PCR can help carry out the operation of enlarging. Because the value sequences of 1s
and 0s are of different length, an electrophoresis with marker will give us the
information of the sizes of the cliques[3].
Results
In order to evaluate the algorithm, we randomly generated some problems of
different sizes, wishing to study the time requirements of our algorithm for problems
of different difficulty levels. The problems are generated with a connecting rate of
approximate 80%, that is, when generating the problem, each pair of vertices has a
probability of 80% to be connected by an edge. In our simulation, the size of data pool
is 106. The probability at which we put 1s into digits (the mutation rate) is 2/N in the
27
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
sense that the cliques will grow averagely at the rate of two vertices per cycle. The
computation will stop if the size of the biggest clique remains unchanged for 30
cycles. We also use a conventional recursive algorithm (RA) to solve the problems in
an exhaustive way in order to check the result of the genetic algorithm and to compare
the time requirement of the two algorithms. Some results of our computation are
shown in Table 1. As we can see from the results, the genetic algorithm can solve the
problem correctly; the cycle needed to get the correct answers is almost a linear
function of the number of nodes (See Fig. 3(a)). If the connecting rate of problems is
fixed, then the number of edges in the complementary graph is order of N2. In each
cycle, each edge in the complementary graph brings in one operation of elimination, if
the number of cycles required increases linearly to N, then the total time requirement
of our genetic algorithm is order of N3. On the other hand, the total time requirement
of the recursive algorithm is exponential to N, as shown in Fig. 3(b). More
importantly, using the genetic algorithm, we only have to use a very small data pool,
compared to the number of all possible solutions. The number of all possible solutions
to a 40-vertices problem is 240, or 1012, to a 100-vertices problem is 2100, or 1030,
while the size of our data pool is only about 106, yet we can still give correct answer
in most of the occasions (see table 1). We thus believe that this DNA computing
algorithm overcomes the volume size barrier of brute-force method used in DNA
computing literature[2-5].
Vertices
(N)
Edges
Size
of
MC
Time (Arbitrary
unit, by RA)
Prob. to get the
correct answer
(by GA)
Average number of
cycles used before
reaching an answer
40
624
14
1
100% (6 of 6)
14
50
980
14
5.7
100% (6 of 6)
15
60
1420
16
34.2
100% (6 of 6)
25
70
1932
18
108
100% (6 of 6)
38
80
2530
19
703
100% (6 of 6)
41
90
3200
19
1130
100% (6 of 6)
44
100
4000
20
7440
100% (6 of 6)
57
110
4800
21
15800
100% (6 of 6)
81
130
6700
21
80800
100% (6 of 6)
85
160
10000
22
281000
83% (5 of 6)
113
Table 1
28
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
Fig. 3: Comparison of related time requirement between genetic algorithm and recursive algorithm
in solving MC problems with different sizes. The time requirement of the genetic algorithm
increases linearly (a), while that of recursive algorithm increases exponentially (b).
The difficulty level of a problem is related not only to the number of vertices in
the graph, but also to the connecting rate. From Table 1 we can see that at the
connecting rate of 80%, the size of maximal clique increases very slowly with the
increase of the number of vertices. This implies that for randomly generated problems,
a connecting rate of 80% is still not high enough to give large cliques. Therefore, we
also studied MC problems with fixed number of vertices (40 and 70) but with
increasing number of edges. The main results of the computation are in Table 2. We
observe that the size of maximal clique increases rapidly as the connecting rate
approaches 100%. As the number of edges increases, the recursive algorithm requires
exponentially increasing time, as shown in Fig. 4(c), 4(d). To our surprise, the number
of cycles needed for the genetic algorithm to solve the problem does not increase
much, as shown in Fig 4(a), 4(b). The number of edges in complementary graph
decreases as the increase of number of edges in the original graph, which reduces the
number of operations of elimination in each cycle, thus the increase of the total
number of operations would be even smaller.
Vertices
(N)
Edges
Size
MC
40
(Full
Edges:
780)
700
of
Time (Arbitrary unit, by
RA)
Cycles needed
(by GA)
19
38
15
710
20
80
15
720
21
173
16
730
22
440
16
740
25
1570
16
750
26
4250
17
29
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
70
(Full
Edges:
2415)
760
27
15700
17
1950
18
268
38
2000
20
661
39
2050
20
1430
40
2100
23
11300
42
2150
24
26800
44
2200
27
283000
47
Table 2
Fig. 4: Comparison of related time requirement between genetic algorithm and recursive algorithm
in solving MC problems with different number of edges. The time requirement of the genetic
algorithm is almost unchanged (a) (b), while that of recursive algorithm increases exponentially (c)
(d).
Discussion
Let us see into some reasons behind genetic algorithm’s powerfulness in solving
this problem. Because the maximal cliques have most vertices, they are easier to be
30
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
finally arrived at than any other smaller cliques are. Although there are much more
smaller cliques than the maximal cliques, most of them are on the right track to the
maximal cliques. To make this point more clear, see Fig. 5, which shows the process
of solving an 8-vertex problem. Figure (a) describes the problem, whose linking rate
is 0.8. Figure (b) shows all the cliques, each as a point, divided into five levels, and all
the mutations that give larger cliques, each as a line. (For the reason of space, the
exact clique that each point represents is not shown.) From bottom to top, the size of
the cliques in each level is from 0 to 4. They grow bigger along the lines. The
mutations that fail to give a clique are eliminated by selection and so are not
contained in the graph. Notice that in level K, each point has K lines connecting to the
points in level K-1, which means there are K ways to get a clique of size K from
cliques of size K-1. The faded lines and points represent those mutations and cliques
that surely cannot lead to or end up as a maximal clique. As mentioned above, in
randomly generated problems, they are only a small part of all mutations and cliques,
the rest are on the right track. As long as percentage of cliques on the right track in
each level is much bigger than 1/D, where D is the size of the data pool, the genetic
algorithm can almost surely give the correct solution after some cycles. Since the data
pool of DNA molecules is very large, the capacity of the algorithm can be satisfying.
Fig. 5 (a)
31
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
Fig. 5(b)
Fig. 5: (a), a problem of eight vertices, (b), A network describing the relation between all the
cliques. Each vertex represents a different clique in (a). They are arranged in such a way that the
sizes of the cliques are from 0 to 4 from bottom to top. Mutation can make cliques grow larger
along the lines. The faded vertices and lines mean that the corresponding cliques and mutations
cannot lead to the maximal cliques, which are at the top level.
Acknowledgements
We owe many thanks to Ms. Minping Qian from the school of mathematics for her
inspiring discussion and helpful comment to our work. We acknowledge Mr. Luping
Xu’s helpful detailed information on the biotechnologies related to the algorithm.
Other members of the lab of nonlinear science and biotechnology also lent many
helping hands to our work. This work is supported by the Chun-Tsung Foundation, “863”
program of National Science and Technology Department of China.
References:
1.
2.
3.
Ruben, A. J. and Landweber, L. F., The past, present and future of molecular computing.
Nature Reviews Molecular Cell Biology, 2000, 1: 69-72.
Adleman, L., Molecular computation of solutions to combinatorial problems. Science, 1994,
266: 1021-1024.
Ouyang, Q., Kaplan, P. D., Liu S. and Libchaber, A., DNA solution of the maximal clique
32
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
problem. Science, 1997, 278: 446-449.
Braich, R. S., Chelyapov, N., Johnson, C., Rothermund, P. W. K. and Adleman, L., Solution
of a 20-variable 3-SAT problem on a DNA computer. Science, 2002, 296: 499-502.
Faulhammer, D., Cukras, A. R., Lipton, R. J. and Landweber, L. F., Molecular computation:
RNA solutions to chess problems. Proc. Natl. Acad. Sci., 2000, U.S.A. 97: 1385-1389.
Benenson, Y., Paz-Elizur, T., Adar, R., Keinan, E., Livneh, Z. and Shapiro, E.,
Programmable and autonomous computing machine made of biomolecules. Nature, 2001,
414: 430-434.
Liu, Q., Wang, L., Frutos, A. G., Condon, A. E., Corn, R. M. and Smith, L. M., DNA
computing on surfaces. Nature, 2000, 403: 175-179.
Ogihara, M. and Ray, A., DNA computing on a chip. Nature, 2000, 403: 143-144.
Sakamoto, K., Gouzu, H., Komiya, K., Kiga, D., Yokoyama, S., Yokomori, T. and Hagiya,
M., Molecular computation by DNA hairpin formation. Science, 2000, 288: 1223-1226.
Wang, L., Hall, J. G., Lu, M., Liu, Q. and Smith, L. M., A DNA computing readout operation
based on structure-specific cleavage. Nat. Biotechnol., 2001, 19: 1053-1059.
Zimmermann, Karl-Heinz, On applying molecular computation to binary linear codes. IEEE
Trans. Inform. Theory, 2002, 48: 505-510.
Impagliazzo, R., Paturi, R. and Zane, F., Which problems have strongly exponential
complexity? J. Comput. Syst. Sci., 2001, 23: 512-530, doi: 10.1006/jcss.2001.1774.
Holland, J. H., Generic algorithm, Scientific American, 1992, July, 66-72.
Foster. J. A., Evolutionary computation. Nat. Rev. Genet., 2001, 2: 428-436.
Chiu, D. T., Pezzoli, E., Wu, H., Stroock, A. D. and Whitesides, G. M., Using
three-dimensional microfluidic networks for solving computationally hard problems. Proc.
Natl. Acad. Sci. U.S.A., 2001, 98: 2961-2966.
Kaplan, P. D., Ouyang, Q., Thaler, D. S. and Libchaber, A., Parallel overlap assembly for the
construction of computational DNA libraries. J. Theor. Biol., 1997, 188: 333-341, doi:
10.1006/jtbi.1997.0475.
Arita, M. and Kobayashi, S., The power of sequence design in DNA computing. ICCIMA
2001: Fourth International Conference on Computational Intelligence and Multimedia
Applications, Proceedings, 163-167.
作者简介:
李源,男,1982 年 4 月生于湖南,2000 年 9 月因获全国物理竞赛二等奖从
广东华南师大附中保送进北大物理系物理学专业。
方辰,男,1982 年 7 月生于北京,2000 年 9 月从北京三十五中考入北大物
理系物理学专业。
两人在校期间主动学习,积极思考,成绩优良,从大三下学期开始攻读数学
科学学院的双学位,曾组队参加全国大学生数学建模大赛并获北京市一等奖;在
日常生活中待人真诚,热心帮助同学,跟周围人关系融洽。
33
Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003)
感悟与寄语:
在这次科研活动中,我们磨练了意志,提高了能力:既体验到了最初尝试各
种方法屡遭失败的痛苦,又享受到了终于得到预期结果时的喜悦;既在实际的工
作中自学了很多关于计算机和编程的实用知识,又在互相之间对算法的基本思想
的讨论中加深了对相应的数学、物理概念的理解。更重要的,我们通过这次的科
研,认识到了实际工作中和其他人合作的重要性——这是和个人学习之间的最大
不同,如果没有和老师、同学之间的讨论以及他们给我们的启发和帮助,完成这
篇论文是根本不可能的。
指导教师简介:
欧阳颀,男,教授、博士生导师。1982 年毕业于清华大学化学化工系,同年
留校工作。1983 年赴法国波尔多第一大学留学,1989 年获法国博士学位。1989
年在美国德州大学奥斯丁分校工作, 1996 年转入美国 NEC 研究所。1998 年 6
月回国到北京大学物理系从事非线性科学与生物芯片技术开发工作,同年被评为
首届“长江学者”特聘教授。
欧阳颀博士 1983 年以来一直从事非线性科学的基础理论与实验研究,并在
该领域取得了一系列重大成果,被国际同行公认为斑图动力学领域的实验科学带
头人之一。1985 年至今他已经在国际著名科学杂志上发表论文 50 余篇。其中包
括英国《自然》杂志 3 篇,美国《科学》杂志 2 篇,美国《物理通讯快报》
(PRL)
7 篇。
34