Faster, More Sensitive Peptide ID by Sequence DB

Large Scale Combinatorial
Optimization Problems in
Bioinformatics
Nathan Edwards
Center for Bioinformatics and Computational Biology
University of Maryland, College Park
Sample Preparation for
Peptide Identification
Enzymatic Digest
and
Fractionation
2
Single Stage MS
MS
m/z
3
Tandem Mass Spectrometry
(MS/MS)
m/z
Precursor selection
m/z
4
Tandem Mass Spectrometry
(MS/MS)
Precursor selection +
collision induced dissociation
(CID)
m/z
MS/MS
m/z
5
Novel Splice Isoform
• Human Jurkat leukemia cell-line
• Lipid-raft extraction protocol, targeting T cells
• von Haller, et al. MCP 2003.
• LIME1 gene:
• LCK interacting transmembrane adaptor 1
• LCK gene:
• Leukocyte-specific protein tyrosine kinase
• Proto-oncogene
• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
6
Novel Splice Isoform
7
Novel Splice Isoform
8
Polymerase Chain Reaction
9
PCR
10
Applications of k-mer sets
• Peptide Identification
• Set of all human amino-acid 30-mer
peptide sequences...
• ...that occur at least twice in dbEST
• PCR Primer Design:
• Unique 20-mers
• What does it mean to be unique?
11
k-mer superstrings
• Completeness
• All of the required k-mers are represented
• Correctness
• No additional k-mers are represented
• Minimize the total representation length
• Correlates with running time
12
Shortest superstring
problem
• General strings (arbitrary length)
• Completeness only!
• Classical NP-hard problem
• Garey and Johnson
• Approximate within 2.5*OPT
• Max-SNP hard
• One of the first algorithmic approaches to
genome assembly
13
de Bruijn Sequences
de Bruijn sequences represent all words of
length k from some alphabet A.
A = {0,1}, k = 3: s = 0001110100
A = {0,1}, k = 4: s = 0000111101011001000
14
de Bruijn Graph: A = {0,1}, k = 4
1
001
0
1
0
1
000
011
010
0
0
1
0
100
1
1
0
15
0
101
1
111
0
110
1
de Bruijn Sequences &
Graphs
de Bruijn graphs (k,A):
• Edges represent length k words from A
• Each node has
• in degree |A|
• out degree |A|
Eulerian tour constructs de Bruijn
sequence.
16
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
17
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18
Sequence Databases &
CSBH-graphs
• Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
19
Sequence Databases &
CSBH-graphs
• All k-mers represented by an edge have
the same count
1
2
2
1
2
20
Correct, Complete Enumeration
• Set of paths that use each edge
at least once
ACDEFGEFGI, DEFACG
21
Related work
• Chinese Postman Problem
• Edmonds and Johnson, ‘73
• Undirected graph, weighted edges
• Shortest path that uses all the edges
• Solvable in polynomial time
• Construct minimum weighted matching
between nodes of odd-degree
• Add matching to graph and find Eulerian
path
• Minimize weight of extra edges used
22
Correct, Complete,
Enumeration
• Chinese postman problem, except:
• Directed graph
• Add edges from nodes with surplus in-degree
to nodes with surplus out-degree
• Fixed cost teleportation option
• Can always “start” a new sequence
• Find optimal set of additional edges
• Transportation problem / min cost flow
instance
23
Patching the CSBH-graph
• Use artificial edges to fix unbalanced
nodes
24
C3 Enumeration
#in-#out
#in-#out
1
-2
3
-1
2
-4
1
-1
3
-2
Cost: k
25
C3 Enumeration
#in-#out
#in-#out
1
-2
Cost: 0
Cost: 0
3
2
1
-1
0
0
Cost: k
3
-4
-1
-2
26
C2 Enumeration
#in-#out
1
3
2
#in-#out
4
-2
10
“Shortcut paths”
1
3
-1
-4
-1
7
-2
27
C2 Superstring
#in-#out
#in-#out
1
-2
Cost: 0
Cost: 0
3
-1
Cap: 1
2
0
0
-4
Cost: 0
1
-1
3
-2
28
Large scale instances
• Millions of nodes and edges
• cSBH-graph instance
• Min cost flow instance
• cSBH-graph instance takes days to
construct
• 2 days on 250 CPUs
• Algorithms must be linear in problem size
• Out-of-core Eulerian path algorithm?
29
Grid computing
• Heterogeneous machines
• Varying disk/memory/MHz/cores capabilities
• Centralized scheduler
• Jobs started asynchronously
• Other jobs may preempt current job
• Input files may need to be staged
• 250 simultaneous requests for a 3Gb file?
• How to guarantee integrity of input files?
• Problem decomposition may be non-trivial
• Jobs sizes need to fit the least capable machine
• Sometimes need to “game” the scheduler
• Need to ensure the integrity of job output
30
Uniqueness Oracles
• Oracle for uniqueness of 20-mers in
the Human genome (n=3Gb)
• Count occurrences in the genome: 0,1,2+
• Construct 20-mer superstring for 20-mers
with count 1
• Construct 20-mer superstring for 20-mers
with count > 1
• Easy for exact sequence match: O(n)
• Fast automata, indexing, hash tables.
31
Inexact sequence match
• Inexact sequence matching O(n*m*k)
• Errors/Mismatches (k): 1,2,3
• # distinct 20-mers (m): O(n)
• Achieve expected linear time using a
hybrid approach:
•
•
•
•
Exact search for short chunks of queries
Expensive alignment only where chunks match
Large chunks ) Fast, but miss occurrences
Small chunks ) Slow, find all matches
32
Inexact sequence match
Baeza-Yates Perleberg:
• Correct and O(n) for small k
• At least 1 chunk is observed with no error.
• Form of locality sensitive hashing
33
Locality Sensitive Hashing
• For each query:
• store a (set of) hash(es) in hash-table
• At each position in the genome:
• look-up a (set of) hash(es) in hash-table
• if any are in the table, do more expensive check
• Need to weigh
• sensitivity (false negatives) against
• specificity (false positives)
• Our application requires no false negatives
34
Random Projection
• Choose T templates of l random “care”
positions
g
t1
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
35
Random Projection
• Choose T templates of l random
positions
g
t1
t2
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
36
Random Projection
• Choose T templates of l random
positions
g
t1
t2
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
37
Random Projection
• Choose T templates of l random
positions
g
t1
t2
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
38
Gapped seed-set design
problem
• Given:
• mer-size: m ( = 20 )
• # errors: k ( = 1,2,3)
• # cares: l ( = 10,12,14 )
• Find the smallest set of templates with
no false negatives.
• Minimize running time.
39
Alternative Formulation #1
(for k = 2)
• Cover the edges of Km with
copies of Km-l
• How many triangles to cover K6?
• Some instances of (m,2,m-3) cover
each edge exactly once:
• Steiner triple systems
40
Alternative Formulation #2
• Set cover instance:
• Ground set: all possible placements of the
k errors (alignments)
• Covering sets: all possible placements of
the l care positions
• For (m=20,k=2,l=10),
• 190 elements, 184,756 sets!
• Greedy approximation algorithm works
41
Alternative Formulation #3
Templates
Positions (m)
Remove any k
position nodes,
¸ 1 template
has degree l.
l
42
Alternative formulation #3
• Template t has care at position i: xti
• Alignment a is not covered by template t: yta
• Alignment a is not covered by any template: za
43
Alternative Formulation #3
• Polynomial size in terms of number of
templates
• Select T in advance and test whether T is
sufficient.
• Greedily add T templates.
• Apply iteratively to achieve feasible solution
• Extremely weak LP relaxation
• Lots of symmetry!
• Hard to solve useful instances
44
Solution for (20,2,10)
.................... Positions
**********
t1
********** t2
*****
*****
t3
*****
***** t4
*****
***** t5
**********
t6
• Need at least 4 templates, 6 is optimal
45
Remember the application!
• We are checking some templates
twice!
• We compute the hash(es) at each position
in the genome
• Any template that is a shift of another
will be computed at some nearby
genomic position!
46
Solution for (20,2,10)
.................... Positions
**********
t1
********** t2
*****
*****
t3
*****
***** t4
*****
***** t5
**********
t6
• Need at most 3 templates...can we do better?
47
Alternative formulation #3
with template shift
48
Solution for (20,2,10) w/ shift
.................... Positions
**** ** ****
**** * *****
t1
t2
• Optimal is 2 templates...
49
Alternative Formulation #3
with shift
• Polynomial size in terms of number of
templates
• Select T in advance and test whether T is
sufficient.
• Greedily add T templates.
• Apply iteratively to achieve feasible solution
• Greedy not known to be good!
• Extremely weak LP relaxation
• Much less symmetry
• Hard to solve useful instances
50
Conclusions
• Lots of interesting optimization problems in
bioinformatics
• Usually very large scale
• Need good empirical algorithms/solvers
• Modeling tradeoffs abound
• Speed/Memory/Optimality/Correctness
• Many variants of LSH in different domains
• Resource constrained allocation problems
• ...due to limitations in biotechnologies
• As the technologies scale up, so do the issues
51