Large Scale Combinatorial
Optimization Problems in
Bioinformatics
Nathan Edwards
Center for Bioinformatics and Computational Biology
University of Maryland, College Park
Sample Preparation for
Peptide Identification
Enzymatic Digest
and
Fractionation
2
Single Stage MS
MS
m/z
3
Tandem Mass Spectrometry
(MS/MS)
m/z
Precursor selection
m/z
4
Tandem Mass Spectrometry
(MS/MS)
Precursor selection +
collision induced dissociation
(CID)
m/z
MS/MS
m/z
5
Novel Splice Isoform
• Human Jurkat leukemia cell-line
• Lipid-raft extraction protocol, targeting T cells
• von Haller, et al. MCP 2003.
• LIME1 gene:
• LCK interacting transmembrane adaptor 1
• LCK gene:
• Leukocyte-specific protein tyrosine kinase
• Proto-oncogene
• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
6
Novel Splice Isoform
7
Novel Splice Isoform
8
Polymerase Chain Reaction
9
PCR
10
Applications of k-mer sets
• Peptide Identification
• Set of all human amino-acid 30-mer
peptide sequences...
• ...that occur at least twice in dbEST
• PCR Primer Design:
• Unique 20-mers
• What does it mean to be unique?
11
k-mer superstrings
• Completeness
• All of the required k-mers are represented
• Correctness
• No additional k-mers are represented
• Minimize the total representation length
• Correlates with running time
12
Shortest superstring
problem
• General strings (arbitrary length)
• Completeness only!
• Classical NP-hard problem
• Garey and Johnson
• Approximate within 2.5*OPT
• Max-SNP hard
• One of the first algorithmic approaches to
genome assembly
13
de Bruijn Sequences
de Bruijn sequences represent all words of
length k from some alphabet A.
A = {0,1}, k = 3: s = 0001110100
A = {0,1}, k = 4: s = 0000111101011001000
14
de Bruijn Graph: A = {0,1}, k = 4
1
001
0
1
0
1
000
011
010
0
0
1
0
100
1
1
0
15
0
101
1
111
0
110
1
de Bruijn Sequences &
Graphs
de Bruijn graphs (k,A):
• Edges represent length k words from A
• Each node has
• in degree |A|
• out degree |A|
Eulerian tour constructs de Bruijn
sequence.
16
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
17
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18
Sequence Databases &
CSBH-graphs
• Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
19
Sequence Databases &
CSBH-graphs
• All k-mers represented by an edge have
the same count
1
2
2
1
2
20
Correct, Complete Enumeration
• Set of paths that use each edge
at least once
ACDEFGEFGI, DEFACG
21
Related work
• Chinese Postman Problem
• Edmonds and Johnson, ‘73
• Undirected graph, weighted edges
• Shortest path that uses all the edges
• Solvable in polynomial time
• Construct minimum weighted matching
between nodes of odd-degree
• Add matching to graph and find Eulerian
path
• Minimize weight of extra edges used
22
Correct, Complete,
Enumeration
• Chinese postman problem, except:
• Directed graph
• Add edges from nodes with surplus in-degree
to nodes with surplus out-degree
• Fixed cost teleportation option
• Can always “start” a new sequence
• Find optimal set of additional edges
• Transportation problem / min cost flow
instance
23
Patching the CSBH-graph
• Use artificial edges to fix unbalanced
nodes
24
C3 Enumeration
#in-#out
#in-#out
1
-2
3
-1
2
-4
1
-1
3
-2
Cost: k
25
C3 Enumeration
#in-#out
#in-#out
1
-2
Cost: 0
Cost: 0
3
2
1
-1
0
0
Cost: k
3
-4
-1
-2
26
C2 Enumeration
#in-#out
1
3
2
#in-#out
4
-2
10
“Shortcut paths”
1
3
-1
-4
-1
7
-2
27
C2 Superstring
#in-#out
#in-#out
1
-2
Cost: 0
Cost: 0
3
-1
Cap: 1
2
0
0
-4
Cost: 0
1
-1
3
-2
28
Large scale instances
• Millions of nodes and edges
• cSBH-graph instance
• Min cost flow instance
• cSBH-graph instance takes days to
construct
• 2 days on 250 CPUs
• Algorithms must be linear in problem size
• Out-of-core Eulerian path algorithm?
29
Grid computing
• Heterogeneous machines
• Varying disk/memory/MHz/cores capabilities
• Centralized scheduler
• Jobs started asynchronously
• Other jobs may preempt current job
• Input files may need to be staged
• 250 simultaneous requests for a 3Gb file?
• How to guarantee integrity of input files?
• Problem decomposition may be non-trivial
• Jobs sizes need to fit the least capable machine
• Sometimes need to “game” the scheduler
• Need to ensure the integrity of job output
30
Uniqueness Oracles
• Oracle for uniqueness of 20-mers in
the Human genome (n=3Gb)
• Count occurrences in the genome: 0,1,2+
• Construct 20-mer superstring for 20-mers
with count 1
• Construct 20-mer superstring for 20-mers
with count > 1
• Easy for exact sequence match: O(n)
• Fast automata, indexing, hash tables.
31
Inexact sequence match
• Inexact sequence matching O(n*m*k)
• Errors/Mismatches (k): 1,2,3
• # distinct 20-mers (m): O(n)
• Achieve expected linear time using a
hybrid approach:
•
•
•
•
Exact search for short chunks of queries
Expensive alignment only where chunks match
Large chunks ) Fast, but miss occurrences
Small chunks ) Slow, find all matches
32
Inexact sequence match
Baeza-Yates Perleberg:
• Correct and O(n) for small k
• At least 1 chunk is observed with no error.
• Form of locality sensitive hashing
33
Locality Sensitive Hashing
• For each query:
• store a (set of) hash(es) in hash-table
• At each position in the genome:
• look-up a (set of) hash(es) in hash-table
• if any are in the table, do more expensive check
• Need to weigh
• sensitivity (false negatives) against
• specificity (false positives)
• Our application requires no false negatives
34
Random Projection
• Choose T templates of l random “care”
positions
g
t1
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
35
Random Projection
• Choose T templates of l random
positions
g
t1
t2
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
36
Random Projection
• Choose T templates of l random
positions
g
t1
t2
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
37
Random Projection
• Choose T templates of l random
positions
g
t1
t2
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
38
Gapped seed-set design
problem
• Given:
• mer-size: m ( = 20 )
• # errors: k ( = 1,2,3)
• # cares: l ( = 10,12,14 )
• Find the smallest set of templates with
no false negatives.
• Minimize running time.
39
Alternative Formulation #1
(for k = 2)
• Cover the edges of Km with
copies of Km-l
• How many triangles to cover K6?
• Some instances of (m,2,m-3) cover
each edge exactly once:
• Steiner triple systems
40
Alternative Formulation #2
• Set cover instance:
• Ground set: all possible placements of the
k errors (alignments)
• Covering sets: all possible placements of
the l care positions
• For (m=20,k=2,l=10),
• 190 elements, 184,756 sets!
• Greedy approximation algorithm works
41
Alternative Formulation #3
Templates
Positions (m)
Remove any k
position nodes,
¸ 1 template
has degree l.
l
42
Alternative formulation #3
• Template t has care at position i: xti
• Alignment a is not covered by template t: yta
• Alignment a is not covered by any template: za
43
Alternative Formulation #3
• Polynomial size in terms of number of
templates
• Select T in advance and test whether T is
sufficient.
• Greedily add T templates.
• Apply iteratively to achieve feasible solution
• Extremely weak LP relaxation
• Lots of symmetry!
• Hard to solve useful instances
44
Solution for (20,2,10)
.................... Positions
**********
t1
********** t2
*****
*****
t3
*****
***** t4
*****
***** t5
**********
t6
• Need at least 4 templates, 6 is optimal
45
Remember the application!
• We are checking some templates
twice!
• We compute the hash(es) at each position
in the genome
• Any template that is a shift of another
will be computed at some nearby
genomic position!
46
Solution for (20,2,10)
.................... Positions
**********
t1
********** t2
*****
*****
t3
*****
***** t4
*****
***** t5
**********
t6
• Need at most 3 templates...can we do better?
47
Alternative formulation #3
with template shift
48
Solution for (20,2,10) w/ shift
.................... Positions
**** ** ****
**** * *****
t1
t2
• Optimal is 2 templates...
49
Alternative Formulation #3
with shift
• Polynomial size in terms of number of
templates
• Select T in advance and test whether T is
sufficient.
• Greedily add T templates.
• Apply iteratively to achieve feasible solution
• Greedy not known to be good!
• Extremely weak LP relaxation
• Much less symmetry
• Hard to solve useful instances
50
Conclusions
• Lots of interesting optimization problems in
bioinformatics
• Usually very large scale
• Need good empirical algorithms/solvers
• Modeling tradeoffs abound
• Speed/Memory/Optimality/Correctness
• Many variants of LSH in different domains
• Resource constrained allocation problems
• ...due to limitations in biotechnologies
• As the technologies scale up, so do the issues
51
© Copyright 2026 Paperzz