Algorithms for the Multiple Degenerate Primer Selection Problem

Algorithms for the Multiple Degenerate Primer
Selection Problem
Nikoletta DiGirolamo1?
UConn BioGrid, REU Summer 2008
Department of Computer Science & Engineering,
University of Connecticut, Storrs, CT 06269
1
Department of Biology, Hunter College, 425 E 25th St, New York, NY 10065
Abstract. The multiplex polymerase chain reaction (MP-PCR) is a
quick and an inexpensive technique in molecular biology for amplifying multiple DNA loci in a single Polymerase Chain Reaction (PCR).
One of the criteria to achieve highly specific reaction products is to
keep the concentration of the amplification primers low. In research,
the dilemma associated with primer minimization for MP-PCR reactions has been formulated as the Multiple Degenerate Primer Selection
Problem (MDPSP). MDPSP is related to the earlier Degenerate Primer
Design (DPD) problem that has proven to be NP-complete. This paper
formulates a new, so far, unexplored variant, the Multiple Degenerate
Primer Selection Problem with Errors (MDPSPE) and introduces algorithm DPS-HDE for solving this new version. Implementing DPS-HDE
on randomly generated datasets greatly reduced the number of primers
in the output and the time of execution. Furthermore, we also introduce
a new exact algorithm, DPS-HDR for solving the earlier MDPSP and
compare the algorithm’s performance on randomly generated datasets
with DPS-HD, thus far the most efficient algorithm for solving MDPSP
introduced by Balla et al. in [2]. DPS-HDR executed slower than DPSHD, however, with the same quality of output.
Key words: Degenerate Primer Design, Primers for MP-PCR, Multiple
Degenerate Primer Selection
1
Introduction
The Polymerase Chain Reaction (PCR) is one of the most extensively used
tools in the field of molecular biology. The reaction consists of a three-phase
cycle: denaturation of the double stranded DNA, followed by hybridization of
the forward and reverse primers, and extension of the primers by the enzyme
DNA polymerase. The primers are short synthetic oligonucleotides, with length
varying from 15 to 30 bases, and perfect or close to perfect (mismatch could
be tolerated) complements to the 3’ ends of the denatured DNA double strand.
?
email:[email protected]
2
Algorithms for the Multiple Degenerate Primer Selection Problem
After several cycles of the reaction, the targeted DNA segment is amplified
exponentially in the PCR product [5].
A special variant of PCR is the Multiplex Polymerase Chain Reaction (MPPCR) [3] in which degenerate primers amplify several DNA sequences simultaneously.
We call a PCR primer degenerate if there are more then one nucleotides
allowed at any position of the primer [9]. The degeneracy of the primer is equal
to all its possible combination of unique, non-degenerate primers. For example,
let Σ = {A, T, C, G}, where Σ stands for the four DNA bases: adenine (A),
thymine (T), cytosine (C), guanine (G). If degenerate primer, P = M GHG
then from Figure 1, P = {A|C}{G}{T |A|C}{G} and its degeneracy, d(P ) =
2 × 1 × 3 × 1 = 6.
Fig. 1. Single letter IUPAC nomenclature of mixed bases [4]
Fortunately, synthesizing degenerate oligonucleotides are as easy and costefficient as regular primers. The problem, however, with using degenerate primers
in MP-PCR reactions is the increasing chance of observing spurious amplification
products such as primer dimers [11] and mispriming events. Thus, minimizing
the concentration of degenerate primers plays an important role in achieving
highly specific solution quality. One method of keeping the ineffective primers
concentration low is to impose a bound on the degeneracy. The first tentative
steps toward finding a pair of degenerate primer with bound degeneracy that
would cover most of the input sequences has lead to the maximizing problem
that seeks a set of degenerate primers with a certain degeneracy threshold and
maximum coverage to cover all the input sequences collectively. The problem
formulated as follows:
Definition 1: The Multiple Degenerate Primer Selection Problem (DPS):
Given n DNA strings with m length, find the P set of degenerate u primers that
Algorithms for the Multiple Degenerate Primer Selection Problem
3
would cover all the input sequences collectively such that each u ∈ P has L length
and at most D degeneracy. [2]
This paper introduces a new variant of the above problem allowing a certain
number of mismatches between the primers and the DNA fragments:
Definition 2: The Multiple Degenerate Primer Selection Problem with Errors (MDPSPE): given a set of n DNA sequences each with m length and integers
D, L, and E, find the P set of degenerate u primers that would match all the
input sequences with up to E errors (mismatches) such that u ∈ P has L length
and at most D degeneracy.
In the following section the paper provides a short history on the daunting
problem of primer design. In Section 3, we introduce our algorithms, DPS-HDE
and DPS-HDR. Section 4 discusses the performance of DPS-HDR in comparison to DPS-HD, and also the performance of DPS-HDE, implementing the
algorithms on randomly generated data sets. Section 5 concludes the paper.
2
The History of Primer Design
The scientific community has embraced the challenge of primer design, producing extensive research. Most of the pioneer works [10], [7] focus on optimizing
the quality of a single non-degenerate primer pair based on biochemical factors
such as primer and target length, melting temperatures, base composition, and
GC-content. From one of these earlier works evolved the primer minimization
problem: also known as the Primer Selection Problem (PSP) [12]. In which the
focus shifts towards minimizing the number of non-degenerate primers that would
amplify a given set of sequences.
The short comings of non-degenerate primer design are, however, that only
genes with known sequence information are amplified and the primers work
effectively only on small set of sequences.
Remarkably, with using degenerate primers, very often unknown genes from
related families can be identified. Moreover, they work well for large genomic
data, and help amplify targets for which only amino acid sequences available.
For instance, algorithm CODEHOP [13] was developed to isolate gene homologs from various genomes. The program produces a pool of primers with a
3’ degenerate core and 5’ non-degenerate clamp region for each blocks of aligned
amino acid sequences. As well as CODEHOP, DePiCt [16] generates primers
for multiply-aligned protein sequences, it uses hierarchical clustering to group
amino acid sequences together based on their BlockSimilarity [16] and produce
degenerate primers with low degeneracy and high coverage.
The coverage is the number of input sequences that are covered by the
primers. To increase the coverage the degeneracy of the primers has to be increased. However, highly degenerate primers increase the presence of non-specific
primers, and lessen the concentration of the effective ones leading to undesired
amplification products.
4
Algorithms for the Multiple Degenerate Primer Selection Problem
Motivated by this paradox, Linhart and Shamir in [8] introduce the Degenerate Primer Design (DPD) problem which offers a trade off between degeneracy
and coverage. DPD looks for a highly degenerate primer pair with at most d
degeneracy that would amplify as many input sequences as possible. The authors
introduce several variants of the problem optimizing either the coverage, Maximum Coverage DPD (MC-DPD) or the degeneracy, Minimum Degeneracy (MDDPD) [9, 8]. The paper proposes efficient approximation algorithms for solving
the MC-DPD problem, and based on these approximations introduces HYDEN.
The algorithm was implemented as a part of the DEFOG program, a novel approach to Decipher Families of Genes [6]. DEFOG was successfully tested on the
human Olfactory Receptor gene (OR) superfamily; it almost tripled the size of
the initial collection of OR genes.
The Maximum Coverage DPD paved the way to the recently introduced Multiple Degenerate Primer Selection Problem (MDPSP) [2]. Most of the algorithms
designed for MDPSP attempts to maximize the coverage of degenerate primers
at each selective step. Souvenir et al. in [15] introduces two variants of the problem called Primer Threshold Multiple Degenerate Primer Design (PT-MDPD)
and Total Threshold MDPD (TT-MDPD). Both of the versions were proven to
be NP-hard. The authors implement an iterative beam search algorithm MIPS
for solving the two cases. At each step of the iteration, the algorithm chooses the
primers with the least degeneracy. The primers for extension stored in a beam, b
which size is a constant input parameter. Increasing beam size can improve the
solution quality, however; it also slows down the running time (O(bn3 mp)).
Following a similar iterative search, Balla and Rajasekaran in [1], introduce the Degenerate Primer Search (DPS) algorithm. To overcome the dependence on beam size, the algorithm selects primer candidates from a collection
of best primers, b based on their coverage-efficiency, e. Implementing e as a
scoring function, the authors were able to improve both the time of execution
(O(b|Σ| logΣ dn2 mp)) and the quality in comparison to MIPS.
Now we turn our attention to DPS-HD (the acronym stands for Degenerate
Primer Selector by Hamming Distance) [2], the most recent algorithm for solving
MDPSP. The algorithm achieves a better running time when compared to MIPS
and DPS by introducing a greedy approach to circumvent the dependency on
b. Namely, both MIPS and DPS require a time complexity of O(bn2 m) for one
iteration step that is creating a collection of best primers [1] or producing the
next generation of primers [14]. DPS-HD eliminates this b variable from its time
complexity by calculating the Hamming Distances (HD) between the candidate
primer and all the l-mers in the set, and decides the best candidate for merging
the primer based on the lowest HD value. The hamming distance is equal to
the number of positions where the nucleotides between two DNA strings does not
match up (mismatches). For instance, lets consider two l-mers u = ACGT and
v = ACT T , then HDuv = 1. The time it takes to calculate HD for one generation
of primers is O(nm2 ) and the overall time for expanding and choosing the best
candidate takes O(|Σ| log|Σ| dnm2 p), where d is the degeneracy threshold, n is
Algorithms for the Multiple Degenerate Primer Selection Problem
5
the number of sequences, m is the sequences length, and p is the cardinality of
the final primer set.
Our approach to solve the MDPSP is quite similar to DPS-HD. We also
implement the measure of hamming distance in order to choose the best v l-mers
to merge the primer candidates with. However, while DPS-HD merges a selected
primer u with all the l-mers at minimum hamming distance, our algorithms
choose only a random v at minimum HD to introduce degeneracy.
3
The New Algorithms
In this section we introduce two new algorithms for solving MDPSP. Both DPSHDR and DPS-HDE design a set of degenerate primers with maximum coverage
that covers all the input DNA sequences. The algorithms work as follow.
3.1
DPS-HDR
Let us consider one cycle of DPS-HDR. First, it selects a random sequence k
from N = {1,2,...n}, each sequence with a common length of m. Then DPSHDR generates a set of l-mers with equal length of l for Nk , such that Nkj ,
1 5 j 5 (m − l + 1). To merge Nk1 , let it be u, with the most similar l-mer
from Nij , 1 5 i 5 n, 1 5 j 5 (m − l + 1), the algorithm calculates the hamming
distances between u and all the l-mers from sequences alive and evaluates the
minimum HD. When HDu,l−mer = 0, the coverage of u is increased, otherwise,
DPS-HDR retains the list of locations at minimum hamming distance in B. To
expand the current primer, in our case u, a random v with the length of l is
selected form B. After merging u and v, let us call the degenerate primer u0 ,
we repeat these steps. This continues until the degeneracy of u0 reaches the
threshold or we exhausted all the sequences in N , |N | = 0. One cycle is over
when above mentioned steps are repeated for all the l-mers in Nk . In the final
step the algorithm maximizes the coverage by retaining the best u from Nk , the
one that covered the most sequences and adds it to S, the set of best primers.
Finally, DPS-HDR enters a new cycle.
3.2
DPS-HDE
An algorithm for solving MDPSPE
The earlier mentioned algorithms for MDPSP such as MIPS, DPS, and also
DPS-HD design primers that perfectly complement the target sequences (HD =
0). Our algorithm DPS-HDE (acronym for Degenerate Primer Selector by Hamming Distance with Errors), however, lets a certain number of Errors between
the primers and the input sequences. In other words:
Definition 3: E is an input constant corresponding to the number of mismatches allowed HD ≤ E between u and the input sequences to be covered. For
6
Algorithms for the Multiple Degenerate Primer Selection Problem
this variant of the multiple degenerate primer design, we formulated the problem
MDPSPE in Definition 2.
Algorithm DPS-HDE greatly resembles the previously described DPS-HDR.
It implements the same ranking based on minimum hamming distances for choosing the best l-mer to merge the candidate primer with. However, in line 15 (pg 6)
where DPS-HDR calculates hamming distances between u and all the l-mers in
sequences yet to be covered, we have to make a small adjustment that accounts
for E:
repeat
14 : Calculate hamming distance (HD) between u and all the l-mers of other
sequences alive, and add to coverage where, HD ≤ E;
15 : Select a random l-mer, v with minimum hamming distance and expand
u with v;
until ( degeneracy < threshold, dmax ) or ( all the sequences covered)
By allowing error, one would expect a reduction in cardinality of the final primer
set and a shorter running time, since the candidate primers’ coverage most likely
will increase.
Algorithm DPS-HDR
1: Let S be the list of selected primers; Initially, S := null;
2: Let N be a set of sequences with m length; N := {1,2,...,n};
3: while ( |N | > 0 ) do
4:
{
5:
BestPrimer := null;
6:
BestCoverage := ();
7:
Select a random k sequence from |N |;
8:
Generate l-mers, u for k;
9:
for all (u in k) do
10:
{
11:
CurrentPrimer = u;
12:
CurrentCoverage := ();
13:
repeat
14:
Calculate hamming distance (HD) between u and all the l-mers of
other sequences alive, and add to coverage where, HD = 0;
15:
Select a random l-mer, v with minimum hamming distance and expand u with v;
16:
until ( degeneracy < threshold, dmax ) or ( all the sequences covered)
17:
if (CurrentCoverage > BestCoverage) then
18:
{
19:
BestPrimer = CurrentPrimer;
20:
BestCoverage = CurrentCoverage;
21:
end if
22:
}
Algorithms for the Multiple Degenerate Primer Selection Problem
7
23:
end for
24:
}
25:
S gets BestPrimer;
26: end while
27: }
4
Experimental Results
In this section we present our results for the performance of both DPS-HDR
and DPS-HDE.
We have compared algorithms DPS-HD and DPS-HDR both in quality - the
number of primers generated - and also in the time of execution. The experiment
for DPS-HD was carried out on a PowerEdge 2600 Linux server with 4 GB of
RAM and dual 2.8 GHz Intel Xeon CPUs. DPS-HDR and DPS-HDE were run
on Intel(R) Core(TM)2 CPU 6700 at 2.66 GHz and 2 GB of RAM. The authors
implemented DPS-HD in language Java, while, DPS-HDR was implemented in
Perl. To compare the algorithms, the method used in [2] was followed. The
experiment was run for data sizes n = 20, 40, 60, 80, 100, 120. For each n
we generated random DNA sequences with the common length of m = 300
nucleotides. The program searched for primers with l = 15 bases, and with
degeneracy thresholds, d = 10000 and d = 100000. Note that DPS-HDR was run
only once for each above mentioned data size due to the shortage of resources
and time.
Overall, even though DPS-HDR generated the same number of primers as
DPS-HD for small sets of data, hence the quality remained the same, the running time of the algorithm was observably longer than that of DPS-HD. This
behavior could be partially due to the discrepancy between the processors, but
the main reason for slow execution is most likely the difference between the
implementation languages.
The results for randomly generated datasets can be viewed in Table 1 and
Table 2. The performance of DPS-HD compared to MIPS and DPS can be view
in [2].
DPS-HDE was also implemented in Perl programming language and run
under the same experimental conditions as DPS-HDR. We ran the program for
degeneracy, d = 10000 and d = 100000, and the error sizes, E = 1, E = 2, and
E = 3. The results can be view below in Table 3 and Table 4. Introducing error
greatly decreased the cardinality of the final degenerate primer set in the output
and also, reduced the running time.
5
Conclusion
In this paper we introduced two new algorithms that design primers for MP-PCR
reactions. One of the algorithms DPS-HDR generates degenerate primers for
MDPSP. After comparing its performance to the recently introduced DPS-HD,
we received longer running times for the same quality of output. Our work also
8
Algorithms for the Multiple Degenerate Primer Selection Problem
extended the problem of MDPSP into a new variant, namely MDPSPE. This
version introduces errors (mismatches) between the primers and the targeted
DNA sequences in order to increase the coverage and decrease the cardinality of
the degenerate primer set in the final solution. For solving this new variant the
paper introduced algorithm DPS-HDE.
We believe that after optimizing the efficiency of the Perl program for DPSHDR and DPS-HDE, we will be able to improve the current running times.
Furthermore, in order to achieve more accurate results, we plan to rerun the
experiment for larger data sizes and not only for randomly generated but also
for biological data. Most importantly, we will try to initiate a collaboration with
molecular biologists in order to prove the performance of these algorithms in
wet-lab experiments.
Acknowledgments. This research was a part of the BioGrid Initiatives at
the University of Connecticut, and supported by the REU-NSF grant CCF0755373. I would like to thank my mentor, Professor Sanguthevar Rajasekaran
for introducing me to the problem, and letting me participate in his research,
and Dolly Sharma for her everlasting help and guidance through the project.
References
1. Sudha Balla and Sanguthevar Rajasekaran. An efficient algorithm for minimum
degeneracy primer selection., March 2007.
2. Sudha Balla, Sanguthevar Rajasekaran, and Ion Mandoiu. Greedy heuristics for
degenerate primer selection.
3. J S Chamberlain, R A Gibbs, J E Ranier, P N Nguyen, and C T Caskey. Deletion
screening of the duchenne muscular dystrophy locus via multiplex dna amplification., December 1988.
4. A Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acid
sequences: recommendations 1984., May 1985.
5. Henry A. Erlich, David Gelfand, and John J. Sninsky. Recent advances in the
polymerase chain reaction, June 1991.
6. Tania Fuchs, Barbora Malecova, Chaim Linhart, Roded Sharan, Miriam Khen, Ralf
Herwig, Dmitry Shmulevich, Rani Elkon, Matthias Steinfath, John K O’Brien, Uwe
Radelof, Hans Lehrach, Doron Lancet, and Ron Shamir. Defog: a practical scheme
for deciphering families of genes., September 2002.
7. Robert Giegerich, Folker Meyer, and Chris Schleiermacher. Genefisher-software
support for the detection of postulated genes. In David J. States, Pankaj Agarwal,
Terry Gaasterland, Lawrence Hunter, and Randall Smith, editors, ISMB, pages
68–77. AAAI, 1996.
8. Chaim Linhart and Ron Shamir. The degenerate primer design problem., 2002.
9. Chaim Linhart and Ron Shamir. The degenerate primer design problem: Theory
and applications. Journal of Computational Biology, 12(4):431–456, 2005.
10. T Lowe, J Sharefkin, S Q Yang, and C W Dieffenbach. A computer program for
selection of oligonucleotide primers for polymerase chain reactions., April 1990.
11. P Markoulatos, N Siafakas, and M Moncany. Multiplex polymerase chain reaction:
a practical approach., 2002.
Algorithms for the Multiple Degenerate Primer Selection Problem
9
12. William R. Pearson, Gabriel Robins, Dallas E. Wrege, and Tongtong Zhang. On
the primer selection problem in polymerase chain reaction experiments. Discrete
Applied Mathematics, 71(1-3):231–246, 1996.
13. T M Rose, E R Schultz, J G Henikoff, S Pietrokovski, C M McCallum, and
S Henikoff. Consensus-degenerate hybrid oligonucleotide primers for amplification
of distantly related sequences., April 1998.
14. Richard Souvenir, Jeremy Buhler, Gary Stormo, and Weixiong Zhang. An iterative
method for selecting degenerate multiplex pcr primers., 2007.
15. Richard Souvenir, Jeremy Buhler, Gary D. Stormo, and Weixiong Zhang. Selecting degenerate multiplex pcr primers. In Gary Benson and Roderic D. M. Page,
editors, WABI, volume 2812 of Lecture Notes in Computer Science, pages 512–526.
Springer, 2003.
16. Xintao Wei, David N Kuhn, and Giri Narasimhan. Degenerate primer design via
clustering., 2003.
Appendix
Table 1. Performance on random datasets: l = 15, d = 10000.
DPS-HDR
# of seq # of primers time in sec
(n)
(p)
(t)
20
4
533.10
40
6
1971.51
60
9
4561.48
80
11
7675.19
100
13
11769.60
120
15
16474.85
Table 2. Performance on random datasets: l = 15, d = 100000.
DPS-HDR
# of seq # of primers time in sec
(n)
(p)
(t)
20
3
590.27
40
4
2460.49
60
6
5607.98
80
7
9729.31
100
8
13511.15
120
10
19613.65
10
Algorithms for the Multiple Degenerate Primer Selection Problem
Table 3. Performance on random datasets: l = 15, d = 10000.
DPS-HDE
Error = 1
Error = 2
Error = 3
# of seq # of primers time in sec # of primers time in sec # of primers time in sec
(n)
(p)
(t)
(p)
(t)
(p)
(t)
20
3
461.60
2
278.74
1
35.42
40
6
1664.21
3
742.17
1
369.59
60
7
3236.05
4
1277.97
2
651.43
80
8
5012.94
4
1772.90
2
857.57
100
10
7286.97
4
2446.25
2
1071.56
120
11
9780.00
5
3100.85
2
1318.22
Table 4. Performance on random datasets: l = 15, d = 100000.
DPS-HDE
Error = 1
Error = 2
Error = 3
# of seq # of primers time in sec # of primers time in sec # of primers time in sec
(n)
(p)
(t)
(p)
(t)
(p)
(t)
20
2
484.93
1
21.20
1
1.24
40
3
1402.28
1
19.70
1
4.47
60
4
2571.24
1
116.91
1
3.44
80
4
3691.13
2
1627.81
1
4.48
100
4
4779.16
2
2086.23
1
5.64
120
4
6164.56
2
2492.41
1
6.51