Combinatorial Optimization Problems in Computational Biology

Highly Scalable Algorithms for
Robust String Barcoding
Bhaskar DasGupta*
Kishori M. Konwar
Ion Mandoiu
Alex Shavartsman
Computer Science & Engineering Department
University of Connecticut
Storrs, CT
*Department
of Computer Science
University of Illinois at Chicago
Chicago, IL
1
Motivation
There are many critical situations when one needs to
rapidly identify an unknown genomic sequence from
among a given set of known sequences
•
Rapid identification of pathogens in epidemic
outbreaks
 Monitoring of microbial communities, e.g., in
environmental studies
 Fast database search

2
Possible Approaches
• Sequencing based: sequence the unknown DNA
sequence, then use similarity search programs such as
BLAST to identify the unknown virus sequence for
pathogens in database
 Sequencing is prohibitively expensive and time consuming
• Hybridization Based: identify the unknown sequence
by testing for the presence of certain subsequences
 Subsequence tests can be performed quickly and at low cost
using a variety of hybridization based methods
3
Sequence Fingerprints
• For each sequence, find a subsequence that appears
in that sequence and only in that sequence
CAGTTGC
CAGTTC
CATGGA
GTTGC
GTTC
CAT
1
0
0
0
1
0
0
0
1
• Sequence barcodes: 0/1 vectors
• When using fingerprints, barcode length = #sequences
4
String Barcoding
• (Borneman et al.’01, Rash & Gusfield’02): Unique
occurrence of tested subsequences not needed, as
long as 0/1 barcodes are unique
CAGTTGC
CAGTTC
CATGGA
TG
1
0
1
CAGT
1
1
0
• When using non-unique subsequences, barcode length
can be much smaller than #sequences
5
Overview
•
Problem Formulation and Previous Work
•
Greedy Setcover Algorithm
•
Experimental Results
•
Conclusions
6
Problem Definition
Given:
Genomic sequences g1,…, gn
Find:
Minimum number of distinguisher strings t1,…,tk
Such that:
For every gi  gj there exists a distinguisher tl
which is a substring of gi or gj but not of both
- At least log2n distinguishers needed
- n distinguishers are always sufficient
7
Computational Complexity
• [Berman et al.’04] Cannot be approximated within a
factor of (1-)ln(n) unless NP=DTIME(nloglog(n))
8
Rash & Gusfield Integer Program
• A non-redundant set of candidate distinguishers is
generated using a suffix tree
• One variable vi for each candidate distinguisher xi
 vi = 1  xi is selected
 vi = 0  xi is not selected
9
Integer Program Example
Minimize
VTG + VATGGA+ VCAGT + VTTC +VGTGC
Such that
VTG + VTTC + VGTGC >= 1
VATGGA + VCAGT + VGTGC >= 1
VTG + VATGGA + VCAGT + VTTC >= 1
Binaries
VTG VATGGA VCAGT VTTC VGTGC
End
1. CAGTGC
2. CAGTTC
3. CCATGGA
#objective function
#constraint to cover pair 1,2
#constraint to cover pair 1,3
#constraint to cover pair 2,3
#all variables are 0/1
TG
1
ATGGA
0
0
1
0
1
10
Limitations of Integer Program Method
• Works only for moderately sized datasets
50-150 sequences
Average sequence length ~1000 nucleotides
Up to 4 hours needed to come within 20% of optimum
11
Information Content Heuristic
• [Berman et al. 2004]
Keep track of the partition defined by distinguishers
selected so far
In every step, choose candidate that reduces partition
entropy by largest amount
• Theorem: Information Content Heuristic is
always finding a #distinguishers within 1+ln(n)
of optimum
12
Limitations of ICH
• Real genomic sequences contain degenerate
nucleotides (e.g., N for any of {A,T,C,G}) due to
sequencing errors and known single nucleotide
polymorphisms
• Distinguisher-to-sequence matches:
 Perfect matches
 Perfect mismatches
 Uncertain matches
ATCNAT
ATC
1
CCC
0
CCA
?
• Information Content cannot be defined in the presence
17
of uncertain matches
Other Heuristics
• (Cazalis et al 2004): greedy setcover, simulated
annealing, and genetic algorithms for
distinguisher selection
• To achieve practical running time, only a small
random subset (2000 candidates) of all
candidate distinguishers is considered
No data provided on the loss of solution quality due
to this restriction
18
Overview
•
Problem Formulation and Previous work
•
Greedy Setcover Algorithm
•
Experimental Results
•
Conclusions
19
Setcover Greedy Heuristic
• Phase I: Candidate Generation
 Generate a representative set of candidate
distinguishers from the source sequences
• Phase II: Greedy Distinguisher Selection
In every step, choose candidate that distinguishes the
largest number of not yet distinguished pairs
20
Candidate Generation
• A set of candidate distinguishers guaranteed to
contain an optimum solution is generated from the
sequences
• We do not generate certain redundant candidates
 A candidate is redundant if there is another candidate
that appears exactly in the same set of sequences
 For every sequence we generate only one of the
substrings that appear exclusively in that sequence
21
Efficient Candidate Generation
• Our implementation uses simple array datastructures
 We generate candidates in increasing order of length
 Exact match positions for candidates of length l-1 used to
generate the exact matches for candidates of length l
• Candidates that do not satisfy individual given
biochemical constraints, such as minimum/maximum
length, GC content, melting temperature, are discarded
22
Setcover Greedy Heuristic
• Phase I: Candidate Generation
 Generate a set of candidate distinguishers from the
source sequences
• Phase II: Greedy Distinguisher Selection
In every step, choose candidate that distinguishes the
largest number of not yet distinguished pairs
23
Distinguisher Selection as Set Cover
• Set Cover Problem: given a universal set U and a
family of subsets, find a minimum number of subsets
covering U
• Distinguisher selection is a special case of set cover:
 Elements to be covered are the pairs of sequences
 Each candidate distinguisher defines a set of pairs that it
separates
• By a classical result, the greedy algorithm has an
approximation factor of 1+ln(|U|)
Setcover greedy has approximation factor of 2*ln(n) for
distinguisher selection with n sequences
24
Distinguisher Selection
• Start with an empty set D of distinguishers
• While there are pairs of sequences not yet
distinguished, do:
 Compute for each remaining candidate c its coverage gain
(c, D) – the number of not yet distinguished pairs of
sequences that are distinguished by c
 Add the candidate with maximum coverage gain to D
• Return D
25
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATCAG
AATAG
D={ }
26
Computation of (c, D):
c=TCAG
CATCAGA
TTCAGT
TAT
AATCAG
AATAG
D={ }
27
Computation of (c, D):
c=TCAG
(c, D)= 3 x (5 –3) = 6
CATCAGA
TTCAGT
TAT
AATCAG
AATAG
D={ }
28
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATCAG
AATAG
D = {TCAG}
29
Computation of (c, D):
CATCAGA
TTCAGT
TAT
c=AAT
AATCAG
AATAG
D = {TCAG}
30
Computation of (c, D):
(c,D)= 1 x (2-1) + 1 x (3-1) = 3
CATCAGA
TTCAGT
TAT
c=AAT
AATCAG
AATAG
D = {TCAG}
31
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATCAG
AATAG
D = {TCAG,AAT}
32
Computation of (c, D)
 ( c, D ) 
k
| S
i 1
i
 M c |  | Si \ M c |
• S1, S2, …, Sk are the subsets in the partition defined by D
• Mc is the set of matches of candidate c
• Using simple datastructures, computation can be done in
linear time (in the number of sequences)
33
Lazy Update of Gains
• Coverage gains are monotonically non-increasing
during the algorithm
• Re-compute coverage gain for a candidate only if
last saved gain is higher than the gain of current
best candidate
• In practice this speeds-up the selection algorithm
by a factor of ~2
34
Algorithm Extensions
• Degenerate bases
 A pair of sequences is separated by candidate c if
c has at least one perfect match with one of the sequences, and
c has perfect mismatches at all positions of the other sequence
 Gain computation done in O(n2) time using a simple
coverage matrix data-structure
• Redundancy r
 A pair of sequences is counted in the gain function until r
distinguishers separate it
• Distinguisher cross-hybridization
 Minimum edit distance, or maximum common substring
weight, bound for every pair of selected distinguishers
 Candidates incompatible with a selected distinguisher
removed from candidate list
35
Overview
•
Problem Formulation and Previous work
•
Greedy Setcover Algorithm
•
Experimental Results
•
Conclusions
36
Testcases
• Randomly generated instances
Equal probabilities assigned to each of the four
nucleotides
• Microbial genomes extracted from NCBI
databases
Sequence lengths between 490 Kbases to 4.75
Mbases
Small number of degenerate bases
37
Selection time, L=10k, r=1
basic – O(n2) computation of gains using
matrix datastructure
partition – O(n) computation of gains
using partition-based datastructure
38
Candidate Sampling, n=1000, L=10k, r=1
39
Comparison to ICH, L=10k, r=1
Algo
log2n
20
50
n
100
4
10
5
6
7
8
9
4.0
14.1
5.0
7.0
8.0
10.0
12.2
4.0
14.1
5.0
7.0
8.0
10.0
12.3
10
200
500
1000
ICH
SGA
40
Varying Redundancy, L=10k
n=10
n=20
n=50
n=100
n=200
n=500
n=1000
80
#Distinguishers
70
60
50
40
30
20
10
0
0
5
10
Redundancy
15
20
41
NCBI testcase
•
20 NCBI microbial genomic sequences
•
Distinguisher melting temperature range of 5560 oC
•
GC content range of 40-60%
•
Max common subsequence weight bound of 5
 weight(A)=weight(T)=1, weight(C)=weight(G)=2
42
NCBI testcase, r=1
Organism
Nanoarchaeum equitans Kin4-M
Mycobacterium tuberculosis CDC1551
Brucella suis 1330 chromosome 1
Leifsonia xyli subsp. xyli str. CTCB07
Mannheimia succiniciproducens MBEL55E
Geobacter sulfurreducens PCA
Rickettsia typhi str. Wilmington
Picrophilus torridus DSM 9790
Mesoplasma florum L1
Methylococcus capsulatus str. Bath
Propionibacterium acnes KPA171202
Mycoplasma mobile 163K
Mycoplasma hyopneumoniae 232
Bacillus licheniformis DSM 13
Legionella pneumophila subsp. pneumophila str.
Philadelphia 1
Onion yellows phytoplasma OY-M DNA
Staphylococcus aureus subsp. Aureus strain
MRSA252
Staphylococcus aureus strain MSSA476
Burkholderia pseudomallei strain K96243
chromosome 1
Bartonella henselae strain Houston-1
Barcode
0
0
0
0
1
1
0
0
1
1
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
0
1
0
1
Mb
0.49
4.40
2.11
2.58
2.31
3.81
1.11
1.55
0.79
3.30
2.56
0.78
0.89
4.22
3.40
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0.86
2.90
0
0
0
0
0
1
0
0
1
0
2.80
0
0
0
0
4.07
0
0
0
0
1.93
GC (%)
Tm (oC)
0
1
0
1
1
0
1
0
0
0
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
1
1
1
1
0
1
0
0
1
0
0
1
1
1
1
0
1
1
0
0
1
1
1
1
1
1
0
1
0
1
1
1
1
0
1
0
0
0
0
0
0
0
0
1
0
1
0
60.0 45.5 60.0 50.0 57.1 50.0 52.6 42.9 40.0
55.6 59.6 55.4 59.3 56.9 58.6 55.1 55.4 56.3
43
Results on 29 Microbial Sequences (76 Mb)
Redun lmin lmax MinEdit
1
1
5
5
10
10
20
0

15
40
0

15
40
31.0
0

15
40
60.0
0

Select Time
#Distinguishers
0
6
0
6
14.2
2.6
20.3
8.7
6.0
8.0
21.0
0
6
22.9
16.4
41.0
0
26.8
76.0
44
Overview
•
Problem Formulation and Previous work
•
Greedy Setcover Algorithm
•
Experimental Results
•
Conclusions
45
Conclusions
• We provided highly scalable algorithms for the robust
string barcoding problem, capable of handling whole
genomic sequences of up to bacterial size
• Distinguisher selection based whole genomic sequences
results in a number of distinguishers nearly matching
the information theoretic lower bounds for the problem
• The software can be used online at
http://dna.engr.uconn.edu/~software/DNA-BAR/
46