Algorithms for the Multiple Degenerate Primer Selection Problem Nikoletta DiGirolamo1? UConn BioGrid, REU Summer 2008 Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269 1 Department of Biology, Hunter College, 425 E 25th St, New York, NY 10065 Abstract. The multiplex polymerase chain reaction (MP-PCR) is a quick and an inexpensive technique in molecular biology for amplifying multiple DNA loci in a single Polymerase Chain Reaction (PCR). One of the criteria to achieve highly specific reaction products is to keep the concentration of the amplification primers low. In research, the dilemma associated with primer minimization for MP-PCR reactions has been formulated as the Multiple Degenerate Primer Selection Problem (MDPSP). MDPSP is related to the earlier Degenerate Primer Design (DPD) problem that has proven to be NP-complete. This paper formulates a new, so far, unexplored variant, the Multiple Degenerate Primer Selection Problem with Errors (MDPSPE) and introduces algorithm DPS-HDE for solving this new version. Implementing DPS-HDE on randomly generated datasets greatly reduced the number of primers in the output and the time of execution. Furthermore, we also introduce a new exact algorithm, DPS-HDR for solving the earlier MDPSP and compare the algorithm’s performance on randomly generated datasets with DPS-HD, thus far the most efficient algorithm for solving MDPSP introduced by Balla et al. in [2]. DPS-HDR executed slower than DPSHD, however, with the same quality of output. Key words: Degenerate Primer Design, Primers for MP-PCR, Multiple Degenerate Primer Selection 1 Introduction The Polymerase Chain Reaction (PCR) is one of the most extensively used tools in the field of molecular biology. The reaction consists of a three-phase cycle: denaturation of the double stranded DNA, followed by hybridization of the forward and reverse primers, and extension of the primers by the enzyme DNA polymerase. The primers are short synthetic oligonucleotides, with length varying from 15 to 30 bases, and perfect or close to perfect (mismatch could be tolerated) complements to the 3’ ends of the denatured DNA double strand. ? email:[email protected] 2 Algorithms for the Multiple Degenerate Primer Selection Problem After several cycles of the reaction, the targeted DNA segment is amplified exponentially in the PCR product [5]. A special variant of PCR is the Multiplex Polymerase Chain Reaction (MPPCR) [3] in which degenerate primers amplify several DNA sequences simultaneously. We call a PCR primer degenerate if there are more then one nucleotides allowed at any position of the primer [9]. The degeneracy of the primer is equal to all its possible combination of unique, non-degenerate primers. For example, let Σ = {A, T, C, G}, where Σ stands for the four DNA bases: adenine (A), thymine (T), cytosine (C), guanine (G). If degenerate primer, P = M GHG then from Figure 1, P = {A|C}{G}{T |A|C}{G} and its degeneracy, d(P ) = 2 × 1 × 3 × 1 = 6. Fig. 1. Single letter IUPAC nomenclature of mixed bases [4] Fortunately, synthesizing degenerate oligonucleotides are as easy and costefficient as regular primers. The problem, however, with using degenerate primers in MP-PCR reactions is the increasing chance of observing spurious amplification products such as primer dimers [11] and mispriming events. Thus, minimizing the concentration of degenerate primers plays an important role in achieving highly specific solution quality. One method of keeping the ineffective primers concentration low is to impose a bound on the degeneracy. The first tentative steps toward finding a pair of degenerate primer with bound degeneracy that would cover most of the input sequences has lead to the maximizing problem that seeks a set of degenerate primers with a certain degeneracy threshold and maximum coverage to cover all the input sequences collectively. The problem formulated as follows: Definition 1: The Multiple Degenerate Primer Selection Problem (DPS): Given n DNA strings with m length, find the P set of degenerate u primers that Algorithms for the Multiple Degenerate Primer Selection Problem 3 would cover all the input sequences collectively such that each u ∈ P has L length and at most D degeneracy. [2] This paper introduces a new variant of the above problem allowing a certain number of mismatches between the primers and the DNA fragments: Definition 2: The Multiple Degenerate Primer Selection Problem with Errors (MDPSPE): given a set of n DNA sequences each with m length and integers D, L, and E, find the P set of degenerate u primers that would match all the input sequences with up to E errors (mismatches) such that u ∈ P has L length and at most D degeneracy. In the following section the paper provides a short history on the daunting problem of primer design. In Section 3, we introduce our algorithms, DPS-HDE and DPS-HDR. Section 4 discusses the performance of DPS-HDR in comparison to DPS-HD, and also the performance of DPS-HDE, implementing the algorithms on randomly generated data sets. Section 5 concludes the paper. 2 The History of Primer Design The scientific community has embraced the challenge of primer design, producing extensive research. Most of the pioneer works [10], [7] focus on optimizing the quality of a single non-degenerate primer pair based on biochemical factors such as primer and target length, melting temperatures, base composition, and GC-content. From one of these earlier works evolved the primer minimization problem: also known as the Primer Selection Problem (PSP) [12]. In which the focus shifts towards minimizing the number of non-degenerate primers that would amplify a given set of sequences. The short comings of non-degenerate primer design are, however, that only genes with known sequence information are amplified and the primers work effectively only on small set of sequences. Remarkably, with using degenerate primers, very often unknown genes from related families can be identified. Moreover, they work well for large genomic data, and help amplify targets for which only amino acid sequences available. For instance, algorithm CODEHOP [13] was developed to isolate gene homologs from various genomes. The program produces a pool of primers with a 3’ degenerate core and 5’ non-degenerate clamp region for each blocks of aligned amino acid sequences. As well as CODEHOP, DePiCt [16] generates primers for multiply-aligned protein sequences, it uses hierarchical clustering to group amino acid sequences together based on their BlockSimilarity [16] and produce degenerate primers with low degeneracy and high coverage. The coverage is the number of input sequences that are covered by the primers. To increase the coverage the degeneracy of the primers has to be increased. However, highly degenerate primers increase the presence of non-specific primers, and lessen the concentration of the effective ones leading to undesired amplification products. 4 Algorithms for the Multiple Degenerate Primer Selection Problem Motivated by this paradox, Linhart and Shamir in [8] introduce the Degenerate Primer Design (DPD) problem which offers a trade off between degeneracy and coverage. DPD looks for a highly degenerate primer pair with at most d degeneracy that would amplify as many input sequences as possible. The authors introduce several variants of the problem optimizing either the coverage, Maximum Coverage DPD (MC-DPD) or the degeneracy, Minimum Degeneracy (MDDPD) [9, 8]. The paper proposes efficient approximation algorithms for solving the MC-DPD problem, and based on these approximations introduces HYDEN. The algorithm was implemented as a part of the DEFOG program, a novel approach to Decipher Families of Genes [6]. DEFOG was successfully tested on the human Olfactory Receptor gene (OR) superfamily; it almost tripled the size of the initial collection of OR genes. The Maximum Coverage DPD paved the way to the recently introduced Multiple Degenerate Primer Selection Problem (MDPSP) [2]. Most of the algorithms designed for MDPSP attempts to maximize the coverage of degenerate primers at each selective step. Souvenir et al. in [15] introduces two variants of the problem called Primer Threshold Multiple Degenerate Primer Design (PT-MDPD) and Total Threshold MDPD (TT-MDPD). Both of the versions were proven to be NP-hard. The authors implement an iterative beam search algorithm MIPS for solving the two cases. At each step of the iteration, the algorithm chooses the primers with the least degeneracy. The primers for extension stored in a beam, b which size is a constant input parameter. Increasing beam size can improve the solution quality, however; it also slows down the running time (O(bn3 mp)). Following a similar iterative search, Balla and Rajasekaran in [1], introduce the Degenerate Primer Search (DPS) algorithm. To overcome the dependence on beam size, the algorithm selects primer candidates from a collection of best primers, b based on their coverage-efficiency, e. Implementing e as a scoring function, the authors were able to improve both the time of execution (O(b|Σ| logΣ dn2 mp)) and the quality in comparison to MIPS. Now we turn our attention to DPS-HD (the acronym stands for Degenerate Primer Selector by Hamming Distance) [2], the most recent algorithm for solving MDPSP. The algorithm achieves a better running time when compared to MIPS and DPS by introducing a greedy approach to circumvent the dependency on b. Namely, both MIPS and DPS require a time complexity of O(bn2 m) for one iteration step that is creating a collection of best primers [1] or producing the next generation of primers [14]. DPS-HD eliminates this b variable from its time complexity by calculating the Hamming Distances (HD) between the candidate primer and all the l-mers in the set, and decides the best candidate for merging the primer based on the lowest HD value. The hamming distance is equal to the number of positions where the nucleotides between two DNA strings does not match up (mismatches). For instance, lets consider two l-mers u = ACGT and v = ACT T , then HDuv = 1. The time it takes to calculate HD for one generation of primers is O(nm2 ) and the overall time for expanding and choosing the best candidate takes O(|Σ| log|Σ| dnm2 p), where d is the degeneracy threshold, n is Algorithms for the Multiple Degenerate Primer Selection Problem 5 the number of sequences, m is the sequences length, and p is the cardinality of the final primer set. Our approach to solve the MDPSP is quite similar to DPS-HD. We also implement the measure of hamming distance in order to choose the best v l-mers to merge the primer candidates with. However, while DPS-HD merges a selected primer u with all the l-mers at minimum hamming distance, our algorithms choose only a random v at minimum HD to introduce degeneracy. 3 The New Algorithms In this section we introduce two new algorithms for solving MDPSP. Both DPSHDR and DPS-HDE design a set of degenerate primers with maximum coverage that covers all the input DNA sequences. The algorithms work as follow. 3.1 DPS-HDR Let us consider one cycle of DPS-HDR. First, it selects a random sequence k from N = {1,2,...n}, each sequence with a common length of m. Then DPSHDR generates a set of l-mers with equal length of l for Nk , such that Nkj , 1 5 j 5 (m − l + 1). To merge Nk1 , let it be u, with the most similar l-mer from Nij , 1 5 i 5 n, 1 5 j 5 (m − l + 1), the algorithm calculates the hamming distances between u and all the l-mers from sequences alive and evaluates the minimum HD. When HDu,l−mer = 0, the coverage of u is increased, otherwise, DPS-HDR retains the list of locations at minimum hamming distance in B. To expand the current primer, in our case u, a random v with the length of l is selected form B. After merging u and v, let us call the degenerate primer u0 , we repeat these steps. This continues until the degeneracy of u0 reaches the threshold or we exhausted all the sequences in N , |N | = 0. One cycle is over when above mentioned steps are repeated for all the l-mers in Nk . In the final step the algorithm maximizes the coverage by retaining the best u from Nk , the one that covered the most sequences and adds it to S, the set of best primers. Finally, DPS-HDR enters a new cycle. 3.2 DPS-HDE An algorithm for solving MDPSPE The earlier mentioned algorithms for MDPSP such as MIPS, DPS, and also DPS-HD design primers that perfectly complement the target sequences (HD = 0). Our algorithm DPS-HDE (acronym for Degenerate Primer Selector by Hamming Distance with Errors), however, lets a certain number of Errors between the primers and the input sequences. In other words: Definition 3: E is an input constant corresponding to the number of mismatches allowed HD ≤ E between u and the input sequences to be covered. For 6 Algorithms for the Multiple Degenerate Primer Selection Problem this variant of the multiple degenerate primer design, we formulated the problem MDPSPE in Definition 2. Algorithm DPS-HDE greatly resembles the previously described DPS-HDR. It implements the same ranking based on minimum hamming distances for choosing the best l-mer to merge the candidate primer with. However, in line 15 (pg 6) where DPS-HDR calculates hamming distances between u and all the l-mers in sequences yet to be covered, we have to make a small adjustment that accounts for E: repeat 14 : Calculate hamming distance (HD) between u and all the l-mers of other sequences alive, and add to coverage where, HD ≤ E; 15 : Select a random l-mer, v with minimum hamming distance and expand u with v; until ( degeneracy < threshold, dmax ) or ( all the sequences covered) By allowing error, one would expect a reduction in cardinality of the final primer set and a shorter running time, since the candidate primers’ coverage most likely will increase. Algorithm DPS-HDR 1: Let S be the list of selected primers; Initially, S := null; 2: Let N be a set of sequences with m length; N := {1,2,...,n}; 3: while ( |N | > 0 ) do 4: { 5: BestPrimer := null; 6: BestCoverage := (); 7: Select a random k sequence from |N |; 8: Generate l-mers, u for k; 9: for all (u in k) do 10: { 11: CurrentPrimer = u; 12: CurrentCoverage := (); 13: repeat 14: Calculate hamming distance (HD) between u and all the l-mers of other sequences alive, and add to coverage where, HD = 0; 15: Select a random l-mer, v with minimum hamming distance and expand u with v; 16: until ( degeneracy < threshold, dmax ) or ( all the sequences covered) 17: if (CurrentCoverage > BestCoverage) then 18: { 19: BestPrimer = CurrentPrimer; 20: BestCoverage = CurrentCoverage; 21: end if 22: } Algorithms for the Multiple Degenerate Primer Selection Problem 7 23: end for 24: } 25: S gets BestPrimer; 26: end while 27: } 4 Experimental Results In this section we present our results for the performance of both DPS-HDR and DPS-HDE. We have compared algorithms DPS-HD and DPS-HDR both in quality - the number of primers generated - and also in the time of execution. The experiment for DPS-HD was carried out on a PowerEdge 2600 Linux server with 4 GB of RAM and dual 2.8 GHz Intel Xeon CPUs. DPS-HDR and DPS-HDE were run on Intel(R) Core(TM)2 CPU 6700 at 2.66 GHz and 2 GB of RAM. The authors implemented DPS-HD in language Java, while, DPS-HDR was implemented in Perl. To compare the algorithms, the method used in [2] was followed. The experiment was run for data sizes n = 20, 40, 60, 80, 100, 120. For each n we generated random DNA sequences with the common length of m = 300 nucleotides. The program searched for primers with l = 15 bases, and with degeneracy thresholds, d = 10000 and d = 100000. Note that DPS-HDR was run only once for each above mentioned data size due to the shortage of resources and time. Overall, even though DPS-HDR generated the same number of primers as DPS-HD for small sets of data, hence the quality remained the same, the running time of the algorithm was observably longer than that of DPS-HD. This behavior could be partially due to the discrepancy between the processors, but the main reason for slow execution is most likely the difference between the implementation languages. The results for randomly generated datasets can be viewed in Table 1 and Table 2. The performance of DPS-HD compared to MIPS and DPS can be view in [2]. DPS-HDE was also implemented in Perl programming language and run under the same experimental conditions as DPS-HDR. We ran the program for degeneracy, d = 10000 and d = 100000, and the error sizes, E = 1, E = 2, and E = 3. The results can be view below in Table 3 and Table 4. Introducing error greatly decreased the cardinality of the final degenerate primer set in the output and also, reduced the running time. 5 Conclusion In this paper we introduced two new algorithms that design primers for MP-PCR reactions. One of the algorithms DPS-HDR generates degenerate primers for MDPSP. After comparing its performance to the recently introduced DPS-HD, we received longer running times for the same quality of output. Our work also 8 Algorithms for the Multiple Degenerate Primer Selection Problem extended the problem of MDPSP into a new variant, namely MDPSPE. This version introduces errors (mismatches) between the primers and the targeted DNA sequences in order to increase the coverage and decrease the cardinality of the degenerate primer set in the final solution. For solving this new variant the paper introduced algorithm DPS-HDE. We believe that after optimizing the efficiency of the Perl program for DPSHDR and DPS-HDE, we will be able to improve the current running times. Furthermore, in order to achieve more accurate results, we plan to rerun the experiment for larger data sizes and not only for randomly generated but also for biological data. Most importantly, we will try to initiate a collaboration with molecular biologists in order to prove the performance of these algorithms in wet-lab experiments. Acknowledgments. This research was a part of the BioGrid Initiatives at the University of Connecticut, and supported by the REU-NSF grant CCF0755373. I would like to thank my mentor, Professor Sanguthevar Rajasekaran for introducing me to the problem, and letting me participate in his research, and Dolly Sharma for her everlasting help and guidance through the project. References 1. Sudha Balla and Sanguthevar Rajasekaran. An efficient algorithm for minimum degeneracy primer selection., March 2007. 2. Sudha Balla, Sanguthevar Rajasekaran, and Ion Mandoiu. Greedy heuristics for degenerate primer selection. 3. J S Chamberlain, R A Gibbs, J E Ranier, P N Nguyen, and C T Caskey. Deletion screening of the duchenne muscular dystrophy locus via multiplex dna amplification., December 1988. 4. A Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984., May 1985. 5. Henry A. Erlich, David Gelfand, and John J. Sninsky. Recent advances in the polymerase chain reaction, June 1991. 6. Tania Fuchs, Barbora Malecova, Chaim Linhart, Roded Sharan, Miriam Khen, Ralf Herwig, Dmitry Shmulevich, Rani Elkon, Matthias Steinfath, John K O’Brien, Uwe Radelof, Hans Lehrach, Doron Lancet, and Ron Shamir. Defog: a practical scheme for deciphering families of genes., September 2002. 7. Robert Giegerich, Folker Meyer, and Chris Schleiermacher. Genefisher-software support for the detection of postulated genes. In David J. States, Pankaj Agarwal, Terry Gaasterland, Lawrence Hunter, and Randall Smith, editors, ISMB, pages 68–77. AAAI, 1996. 8. Chaim Linhart and Ron Shamir. The degenerate primer design problem., 2002. 9. Chaim Linhart and Ron Shamir. The degenerate primer design problem: Theory and applications. Journal of Computational Biology, 12(4):431–456, 2005. 10. T Lowe, J Sharefkin, S Q Yang, and C W Dieffenbach. A computer program for selection of oligonucleotide primers for polymerase chain reactions., April 1990. 11. P Markoulatos, N Siafakas, and M Moncany. Multiplex polymerase chain reaction: a practical approach., 2002. Algorithms for the Multiple Degenerate Primer Selection Problem 9 12. William R. Pearson, Gabriel Robins, Dallas E. Wrege, and Tongtong Zhang. On the primer selection problem in polymerase chain reaction experiments. Discrete Applied Mathematics, 71(1-3):231–246, 1996. 13. T M Rose, E R Schultz, J G Henikoff, S Pietrokovski, C M McCallum, and S Henikoff. Consensus-degenerate hybrid oligonucleotide primers for amplification of distantly related sequences., April 1998. 14. Richard Souvenir, Jeremy Buhler, Gary Stormo, and Weixiong Zhang. An iterative method for selecting degenerate multiplex pcr primers., 2007. 15. Richard Souvenir, Jeremy Buhler, Gary D. Stormo, and Weixiong Zhang. Selecting degenerate multiplex pcr primers. In Gary Benson and Roderic D. M. Page, editors, WABI, volume 2812 of Lecture Notes in Computer Science, pages 512–526. Springer, 2003. 16. Xintao Wei, David N Kuhn, and Giri Narasimhan. Degenerate primer design via clustering., 2003. Appendix Table 1. Performance on random datasets: l = 15, d = 10000. DPS-HDR # of seq # of primers time in sec (n) (p) (t) 20 4 533.10 40 6 1971.51 60 9 4561.48 80 11 7675.19 100 13 11769.60 120 15 16474.85 Table 2. Performance on random datasets: l = 15, d = 100000. DPS-HDR # of seq # of primers time in sec (n) (p) (t) 20 3 590.27 40 4 2460.49 60 6 5607.98 80 7 9729.31 100 8 13511.15 120 10 19613.65 10 Algorithms for the Multiple Degenerate Primer Selection Problem Table 3. Performance on random datasets: l = 15, d = 10000. DPS-HDE Error = 1 Error = 2 Error = 3 # of seq # of primers time in sec # of primers time in sec # of primers time in sec (n) (p) (t) (p) (t) (p) (t) 20 3 461.60 2 278.74 1 35.42 40 6 1664.21 3 742.17 1 369.59 60 7 3236.05 4 1277.97 2 651.43 80 8 5012.94 4 1772.90 2 857.57 100 10 7286.97 4 2446.25 2 1071.56 120 11 9780.00 5 3100.85 2 1318.22 Table 4. Performance on random datasets: l = 15, d = 100000. DPS-HDE Error = 1 Error = 2 Error = 3 # of seq # of primers time in sec # of primers time in sec # of primers time in sec (n) (p) (t) (p) (t) (p) (t) 20 2 484.93 1 21.20 1 1.24 40 3 1402.28 1 19.70 1 4.47 60 4 2571.24 1 116.91 1 3.44 80 4 3691.13 2 1627.81 1 4.48 100 4 4779.16 2 2086.23 1 5.64 120 4 6164.56 2 2492.41 1 6.51
© Copyright 2026 Paperzz