Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) Genetic Algorithm in DNA Computing: A Solution to the Maximal Clique Problem Department of Physics,00 grade, Li Yuan, Fang Chen Tutor: Ouyang Qi Abstract Genetic algorithm is one of the possible ways to break the limit of brute-force method in DNA computing. Using the idea of Darwinian evolution, we introduce a genetic DNA computing algorithm to solve the maximal clique problem. All the operations in the algorithm are accessible with today’s molecular biotechnology. Our computer simulations show that with this new computing algorithm, it is possible to get a solution from a very small initial data pool, avoiding enumerating all candidate solutions. For randomly generated problems, genetic algorithm can give correct solution within a few cycles at high probability. Although the current speed of a DNA computer is slow compared with silicon computers, our simulation indicates that the time requirement of this genetic algorithm is approximately a linear function of the number of vertices in the network. This may make DNA computers more powerful attacking some hard computational problems. Keywords: DNA computer, genetic algorithm, NP-complete problem. 23 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) Introduction In recent years, biomolecular computing has been documented in various literatures (see Ruben & Landweber, 2000[1] and references therein). Since Adleman’s solution to the Hamiltonian path problem[2], DNA and RNA solutions of some other famous NP-complete problems, such as the maximal clique problem[3], the 3-SAT problem[4] and the knight problem[5], have been given. Some important progresses have been made in the technologies in computing with biomolecules[6-11]. The power of parallel, high-density computation by molecules in solution allows DNA computers to solve hard computational problems such as NP-complete problems in polynomial increasing time, while a conventional Turing machine require exponentially increasing time[12]. However, all the current DNA computing strategies are based on enumerating all candidate solutions, then, using selection processes to eliminate unwanted DNA. This algorithm requires that size of the initial data pool increases exponentially with the number of variables in the calculation, so that the capacity of the DNA computer is limited. For example, to calculate a DNA solution of a maximal clique problem, the number of molecules in the solution must be at least 2N, while N is the number of nodes. With M scale DNA concentration, a 35-node problem can be solution in a typical test tube, while a 75-node problem requires a swimming pool! Obviously, it is inaccessible in practice. In order to break the barrier of this brute-force method, we need to develop new algorithms for DNA computer. One of the strategies to overcome the volume size problem is to apply the idea of Darwinian evolution. Most organisms evolve by means of two primary processes: natural selection and sexual reproduction. The first determines which members of a population survive to reproduce, and the second ensures mixing and recombination among the genes of their offspring. It was demonstrated that the generic algorithm based on these basic processes could solve complex problems[13]. The algorithm exploits the higher-payoff, or target regions of the solution space, because successive generations of reproduction and crossover produce increasing numbers of strings in those regions. It favors the fittest strings as parents, and so above-average strings will have more offspring in the next generation. We notice that the parallelism of molecular computation makes it extremely convenient to apply genetic algorithm (GA) [14] in DNA computing. In this paper, we present our simulation results of using the genetic algorithm to solve the maximal clique (MC) problem. Our results show that it is possible to get a solution from a very small initial data pool, avoiding enumerating all candidate solutions. For randomly generated problems, genetic algorithm can give correct solution within a few cycles at high probability. Although the current speed of a DNA computer is slow compared with silicon computers, our simulation indicates that the time requirement of this genetic algorithm is approximately a linear function of the number of vertices in a network. This may make DNA computers more powerful attacking some hard computational problems. 24 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) Methods Mathematically, a clique is defined as a set of vertices in which every vertex is connected to every other vertex by an edge. The maximal clique problem asks: Given a network containing N vertices and M edges, how many vertices are in the largest clique? In Fig.1a, a problem of five vertices and five edges is defined. Obviously, the vertices (2,3,4) make the largest clique, so the size of the largest clique in this network is three. The maximal clique problem has been proven an NP-complete problem. Besides the conventional algorithm on electronic computer and DNA computer, some interesting attempts have been made[15]. Fig. 1: The original graph describing a maximal clique problem of five vertices and five edges (a) and its complementary graph (b). The data structure of the computation is the same of the previous work[3]. For a graph of N vertices, we use an N-digit binary number to represent each possible clique. In the N-digit binary number, a bit set to 1 represents the corresponding vertex in the clique, a bit set to 0 represents the corresponding vertex out of the clique. For example, 5-digit binary number (01110) represents the largest clique (2,3,4) in Fig.1a. To solve the maximal clique problem is to find an N-digit binary number that contains most 1s (called condition A). For each graph of N vertices, we can build a complementary graph, which contains all the missing edges in the original graph, as shown in Fig.1b. Obviously, any two vertices in a clique in the original graph are not connected in the complementary graph. It should be noticed that if the complementary graph can be divided into several connected graphs, we can divide the problem into several sub-problems, find each sub-problem’s maximal clique, and then the union of these cliques makes the whole problem’s maximal clique. From this point of view, we only need to solve problems whose complementary graph is a connected graph. With this data structure, we design the following genetic algorithm to solve the maximal clique problem in a DNA computer, and pay special attention to make sure that all the processes in our algorithm are readily accessible in current biotechnologies. 25 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) We start with a small data pool containing many N-digit binary numbers. In the initial data pool, each digit of each binary number is set to 0, meaning that no vertex is in the clique. Then we let the data pool evolve. In each cycle of evolution, we first randomly replace some digits in the data pool by 1s with certain probability, no matter whether the original digit is 0 or 1. We call this operation mutation. Then we eliminate from the data pool all numbers that do not satisfy condition A. For example, for the problem shown in Fig.1a and Fig.1b, such numbers as (11xxx), (1xx1x), (1x1xx), (xx1x1) and (xxx11) will be removed from the data pool, while x representing either 0 or 1. If the size of the data pool is smaller after the elimination, we enlarge it by making copies of the binary numbers until the size of the pool reaches that of the initial pool. Through these operations of mutation and elimination, the cliques in the data pool will become bigger as the increase of the number of evolution cycle. In order to find out if there is some progress in each cycle, after the operation of elimination, we will find the maximum number of 1s in a string. This number represents the current biggest clique’s size. If this size tends to increase with the increase of number of cycles, we know that the maximal clique is not found, so the computation should go on. If it stops growing for many cycles, we regard it as a sign that the evolution has reached its end, so the computation should stop and the size of the clique we get at last is considered as the size of the maximal clique. All above-mentioned calculating processes can be readily mapped into corresponding biological operations using current biotechnologies. In the following, we discuss the related biotechnologies that are needed in order to carry out the algorithm. First, we need to encode the binary numbers into DNA strands, one strand representing one binary number. We noticed that the parallel overlap assembly technique of building the data pool to be very helpful to our algorithm[16]. To encode an N-digit binary number, we need N+1 DNA sequences representing the positions and N sequences representing the values of each digit. The sequences representing the positions are all of the same length, but the sequences representing the different values are of different length, for example, 1s are shorter than 0s. It should be noticed that the sequences must be of some different characters[17]. There are at least two techniques available to help carry out the operation of elimination. If we make the length of the value sequences of 1s be zero, then wherever in a binary number 1 occurs, the two position sequences next to that digit will be directly linked. If we design the position sequences in such a way that a restriction enzyme can break the linking point, we will eliminate value 1 in that specific position. If we want to break all the strands with 1 at position a and position b, we only need to divide the data pool into two parts, pool A and pool B, breaking the 1 at position a in pool A and 1 at position b in pool B. The union of pool A and pool B will not contain DNA strands with 1 at both positions but still can contain strands with 1 at only one of the two positions (Fig. 2a) [3]. For the other technology, we make the length of the value sequences of 0s be zero, then wherever in a binary number 0 occurs, the two position sequences next to it will join. In order to eliminate the unqualified DNA strands, we only have to pick up the qualified ones by using 26 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) polyacrylamide gel containing bound probes to capture the strands with 0 at any of the two positions (Fig. 2b) [4]. Fig. 2: Different method of eliminating the data pool. (a) use restricted enzymes; (b) pick up qualified strands with probes in electrophoresis. The operation of mutation would be a little difficult to carry out. In our algorithm the value sequences representing 0s and 1s are of different length, so we need to make DNA sequences mutate to a different length. If the length of value 0’s sequence is zero, we can use the restriction enzyme to cut at the joint position the strands each into two, add some DNA subsequence representing the corresponding value 1 into data pool, then let the strands recombine. The sequence of value 1 has a chance to be inserted into the joint of the two pieces. As our simulation indicates, the mutation rate is no more than 5%, so this technique may be enough. In addition, the method of bubble PCR and ligation can also be a candidate for carrying out mutation. Cloning or PCR can help carry out the operation of enlarging. Because the value sequences of 1s and 0s are of different length, an electrophoresis with marker will give us the information of the sizes of the cliques[3]. Results In order to evaluate the algorithm, we randomly generated some problems of different sizes, wishing to study the time requirements of our algorithm for problems of different difficulty levels. The problems are generated with a connecting rate of approximate 80%, that is, when generating the problem, each pair of vertices has a probability of 80% to be connected by an edge. In our simulation, the size of data pool is 106. The probability at which we put 1s into digits (the mutation rate) is 2/N in the 27 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) sense that the cliques will grow averagely at the rate of two vertices per cycle. The computation will stop if the size of the biggest clique remains unchanged for 30 cycles. We also use a conventional recursive algorithm (RA) to solve the problems in an exhaustive way in order to check the result of the genetic algorithm and to compare the time requirement of the two algorithms. Some results of our computation are shown in Table 1. As we can see from the results, the genetic algorithm can solve the problem correctly; the cycle needed to get the correct answers is almost a linear function of the number of nodes (See Fig. 3(a)). If the connecting rate of problems is fixed, then the number of edges in the complementary graph is order of N2. In each cycle, each edge in the complementary graph brings in one operation of elimination, if the number of cycles required increases linearly to N, then the total time requirement of our genetic algorithm is order of N3. On the other hand, the total time requirement of the recursive algorithm is exponential to N, as shown in Fig. 3(b). More importantly, using the genetic algorithm, we only have to use a very small data pool, compared to the number of all possible solutions. The number of all possible solutions to a 40-vertices problem is 240, or 1012, to a 100-vertices problem is 2100, or 1030, while the size of our data pool is only about 106, yet we can still give correct answer in most of the occasions (see table 1). We thus believe that this DNA computing algorithm overcomes the volume size barrier of brute-force method used in DNA computing literature[2-5]. Vertices (N) Edges Size of MC Time (Arbitrary unit, by RA) Prob. to get the correct answer (by GA) Average number of cycles used before reaching an answer 40 624 14 1 100% (6 of 6) 14 50 980 14 5.7 100% (6 of 6) 15 60 1420 16 34.2 100% (6 of 6) 25 70 1932 18 108 100% (6 of 6) 38 80 2530 19 703 100% (6 of 6) 41 90 3200 19 1130 100% (6 of 6) 44 100 4000 20 7440 100% (6 of 6) 57 110 4800 21 15800 100% (6 of 6) 81 130 6700 21 80800 100% (6 of 6) 85 160 10000 22 281000 83% (5 of 6) 113 Table 1 28 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) Fig. 3: Comparison of related time requirement between genetic algorithm and recursive algorithm in solving MC problems with different sizes. The time requirement of the genetic algorithm increases linearly (a), while that of recursive algorithm increases exponentially (b). The difficulty level of a problem is related not only to the number of vertices in the graph, but also to the connecting rate. From Table 1 we can see that at the connecting rate of 80%, the size of maximal clique increases very slowly with the increase of the number of vertices. This implies that for randomly generated problems, a connecting rate of 80% is still not high enough to give large cliques. Therefore, we also studied MC problems with fixed number of vertices (40 and 70) but with increasing number of edges. The main results of the computation are in Table 2. We observe that the size of maximal clique increases rapidly as the connecting rate approaches 100%. As the number of edges increases, the recursive algorithm requires exponentially increasing time, as shown in Fig. 4(c), 4(d). To our surprise, the number of cycles needed for the genetic algorithm to solve the problem does not increase much, as shown in Fig 4(a), 4(b). The number of edges in complementary graph decreases as the increase of number of edges in the original graph, which reduces the number of operations of elimination in each cycle, thus the increase of the total number of operations would be even smaller. Vertices (N) Edges Size MC 40 (Full Edges: 780) 700 of Time (Arbitrary unit, by RA) Cycles needed (by GA) 19 38 15 710 20 80 15 720 21 173 16 730 22 440 16 740 25 1570 16 750 26 4250 17 29 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) 70 (Full Edges: 2415) 760 27 15700 17 1950 18 268 38 2000 20 661 39 2050 20 1430 40 2100 23 11300 42 2150 24 26800 44 2200 27 283000 47 Table 2 Fig. 4: Comparison of related time requirement between genetic algorithm and recursive algorithm in solving MC problems with different number of edges. The time requirement of the genetic algorithm is almost unchanged (a) (b), while that of recursive algorithm increases exponentially (c) (d). Discussion Let us see into some reasons behind genetic algorithm’s powerfulness in solving this problem. Because the maximal cliques have most vertices, they are easier to be 30 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) finally arrived at than any other smaller cliques are. Although there are much more smaller cliques than the maximal cliques, most of them are on the right track to the maximal cliques. To make this point more clear, see Fig. 5, which shows the process of solving an 8-vertex problem. Figure (a) describes the problem, whose linking rate is 0.8. Figure (b) shows all the cliques, each as a point, divided into five levels, and all the mutations that give larger cliques, each as a line. (For the reason of space, the exact clique that each point represents is not shown.) From bottom to top, the size of the cliques in each level is from 0 to 4. They grow bigger along the lines. The mutations that fail to give a clique are eliminated by selection and so are not contained in the graph. Notice that in level K, each point has K lines connecting to the points in level K-1, which means there are K ways to get a clique of size K from cliques of size K-1. The faded lines and points represent those mutations and cliques that surely cannot lead to or end up as a maximal clique. As mentioned above, in randomly generated problems, they are only a small part of all mutations and cliques, the rest are on the right track. As long as percentage of cliques on the right track in each level is much bigger than 1/D, where D is the size of the data pool, the genetic algorithm can almost surely give the correct solution after some cycles. Since the data pool of DNA molecules is very large, the capacity of the algorithm can be satisfying. Fig. 5 (a) 31 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) Fig. 5(b) Fig. 5: (a), a problem of eight vertices, (b), A network describing the relation between all the cliques. Each vertex represents a different clique in (a). They are arranged in such a way that the sizes of the cliques are from 0 to 4 from bottom to top. Mutation can make cliques grow larger along the lines. The faded vertices and lines mean that the corresponding cliques and mutations cannot lead to the maximal cliques, which are at the top level. Acknowledgements We owe many thanks to Ms. Minping Qian from the school of mathematics for her inspiring discussion and helpful comment to our work. We acknowledge Mr. Luping Xu’s helpful detailed information on the biotechnologies related to the algorithm. Other members of the lab of nonlinear science and biotechnology also lent many helping hands to our work. This work is supported by the Chun-Tsung Foundation, “863” program of National Science and Technology Department of China. References: 1. 2. 3. Ruben, A. J. and Landweber, L. F., The past, present and future of molecular computing. Nature Reviews Molecular Cell Biology, 2000, 1: 69-72. Adleman, L., Molecular computation of solutions to combinatorial problems. Science, 1994, 266: 1021-1024. Ouyang, Q., Kaplan, P. D., Liu S. and Libchaber, A., DNA solution of the maximal clique 32 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. problem. Science, 1997, 278: 446-449. Braich, R. S., Chelyapov, N., Johnson, C., Rothermund, P. W. K. and Adleman, L., Solution of a 20-variable 3-SAT problem on a DNA computer. Science, 2002, 296: 499-502. Faulhammer, D., Cukras, A. R., Lipton, R. J. and Landweber, L. F., Molecular computation: RNA solutions to chess problems. Proc. Natl. Acad. Sci., 2000, U.S.A. 97: 1385-1389. Benenson, Y., Paz-Elizur, T., Adar, R., Keinan, E., Livneh, Z. and Shapiro, E., Programmable and autonomous computing machine made of biomolecules. Nature, 2001, 414: 430-434. Liu, Q., Wang, L., Frutos, A. G., Condon, A. E., Corn, R. M. and Smith, L. M., DNA computing on surfaces. Nature, 2000, 403: 175-179. Ogihara, M. and Ray, A., DNA computing on a chip. Nature, 2000, 403: 143-144. Sakamoto, K., Gouzu, H., Komiya, K., Kiga, D., Yokoyama, S., Yokomori, T. and Hagiya, M., Molecular computation by DNA hairpin formation. Science, 2000, 288: 1223-1226. Wang, L., Hall, J. G., Lu, M., Liu, Q. and Smith, L. M., A DNA computing readout operation based on structure-specific cleavage. Nat. Biotechnol., 2001, 19: 1053-1059. Zimmermann, Karl-Heinz, On applying molecular computation to binary linear codes. IEEE Trans. Inform. Theory, 2002, 48: 505-510. Impagliazzo, R., Paturi, R. and Zane, F., Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 2001, 23: 512-530, doi: 10.1006/jcss.2001.1774. Holland, J. H., Generic algorithm, Scientific American, 1992, July, 66-72. Foster. J. A., Evolutionary computation. Nat. Rev. Genet., 2001, 2: 428-436. Chiu, D. T., Pezzoli, E., Wu, H., Stroock, A. D. and Whitesides, G. M., Using three-dimensional microfluidic networks for solving computationally hard problems. Proc. Natl. Acad. Sci. U.S.A., 2001, 98: 2961-2966. Kaplan, P. D., Ouyang, Q., Thaler, D. S. and Libchaber, A., Parallel overlap assembly for the construction of computational DNA libraries. J. Theor. Biol., 1997, 188: 333-341, doi: 10.1006/jtbi.1997.0475. Arita, M. and Kobayashi, S., The power of sequence design in DNA computing. ICCIMA 2001: Fourth International Conference on Computational Intelligence and Multimedia Applications, Proceedings, 163-167. 作者简介: 李源,男,1982 年 4 月生于湖南,2000 年 9 月因获全国物理竞赛二等奖从 广东华南师大附中保送进北大物理系物理学专业。 方辰,男,1982 年 7 月生于北京,2000 年 9 月从北京三十五中考入北大物 理系物理学专业。 两人在校期间主动学习,积极思考,成绩优良,从大三下学期开始攻读数学 科学学院的双学位,曾组队参加全国大学生数学建模大赛并获北京市一等奖;在 日常生活中待人真诚,热心帮助同学,跟周围人关系融洽。 33 Series of Selected Papers from Chun-Tsung Scholars,Peking University(2003) 感悟与寄语: 在这次科研活动中,我们磨练了意志,提高了能力:既体验到了最初尝试各 种方法屡遭失败的痛苦,又享受到了终于得到预期结果时的喜悦;既在实际的工 作中自学了很多关于计算机和编程的实用知识,又在互相之间对算法的基本思想 的讨论中加深了对相应的数学、物理概念的理解。更重要的,我们通过这次的科 研,认识到了实际工作中和其他人合作的重要性——这是和个人学习之间的最大 不同,如果没有和老师、同学之间的讨论以及他们给我们的启发和帮助,完成这 篇论文是根本不可能的。 指导教师简介: 欧阳颀,男,教授、博士生导师。1982 年毕业于清华大学化学化工系,同年 留校工作。1983 年赴法国波尔多第一大学留学,1989 年获法国博士学位。1989 年在美国德州大学奥斯丁分校工作, 1996 年转入美国 NEC 研究所。1998 年 6 月回国到北京大学物理系从事非线性科学与生物芯片技术开发工作,同年被评为 首届“长江学者”特聘教授。 欧阳颀博士 1983 年以来一直从事非线性科学的基础理论与实验研究,并在 该领域取得了一系列重大成果,被国际同行公认为斑图动力学领域的实验科学带 头人之一。1985 年至今他已经在国际著名科学杂志上发表论文 50 余篇。其中包 括英国《自然》杂志 3 篇,美国《科学》杂志 2 篇,美国《物理通讯快报》 (PRL) 7 篇。 34
© Copyright 2026 Paperzz