8 Physical Mapping - Algorithms in Bioinformatics

124
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
8 Physical Mapping
This exposition is based on the following sources, which are all recommended reading:
1. J Setubal and J Meidanis, Introduction to Computational Molecular Biology, PWS Publishing,
1997, chapter 5.
2. J. Blazewicz, P. Formanowicz, M. Kasprzak, M. Jaroszewski and W.T. Markiewicz, Construction
of DNA restriction maps based on a simplified experiment, Bioinformatics, 17(5):398-404 (2001)
8.1
Introduction
A basic step in the study of a genome is to create a physical map. This “map” consists of a list of
locations of markers along the chromosomes of the genome. Such a marker is usually a known DNA
sequence that occurs somewhere in the genome. Physical maps are used to navigate genomes and also
to determine the location of fragments of sequence.
For example, assume we are given a physical map A, B, C, D, E of a genome G. Let S be a fragment
of G with markers B, C, then we can “place“ S in G:
A
B
C
D
E
G
S
B
C
More formally we define:
Definition 8.1.1 (Physical map) Let S be a DNA sequence. A physical map consists of a set M
of markers and a function p : M → N that assigns a position in S to each marker in M .
How are markers and physical maps created?
Generally one uses fingerprinting techniques such as restriction enzymes and hybridization experiments
to determine overlaps of fragments or to order non-overlapping fragments.
8.2
Restriction sites
A restriction enzyme recognizes a specific short sequence of nucleotides, called restriction sites and
cuts a DNA molecule at any position that displays the specific sequence.
For example, EcoRI recognizes the sequence:
5’
3’
...
...
G
C
A
T
in a DNA molecule and cuts it after the G:
5’ ... G
3’ ... C T T A A
A
T
T
A
T
A
C
G
A A T T
...
...
C
G
3’
5’
...
...
3’
5’
Restriction enzymes (also called restriction endonucleases) were discovered by Nathans, Arber and
Smith, for which they received the Nobel Prize in Medizine in 1978.
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
125
EcoRI recognizes a DNA word of length 6, so in a random sequence we expect to see one EcoRI cut
site about every 46 = 4096 positions, that is, one EcoRI site appears approximately every 4kb.
Application of a restriction site enzyme to a target DNA molecule is called digestion.
The following table lists some of the most commonly used restriction enzymes:
Enzyme Source
EcoRI
Escherichia coli
BamHI
Bacillus amyloliquefaciens
HindIII
Haemophilus influenzae
MstII
Microcoleus species
TaqI
Thermus aquaticus
NotI
Nocardia otitidis
HinfI
Haemophilus influenzae
AluI*
Arthrobacter luteus
Recognition Sequence
5'GAATTC
3'CTTAAG
5'GGATCC
3'CCTAGG
5'AAGCTT
3'TTCGAA
5'CCTNAGG
3'GGANTCC
5'TCGA
3'AGCT
5'GANTC
3'CTNAG
5'GANTC
3'CTNAG
5'AGCT
3'TCGA
Cut
5'---G
AATTC---3'
3'---CTTAA
G---5'
5'---G
GATCC---3'
3'---CCTAG
G---5'
5'---A
AGCTT---3'
3'---TTCGA
A---5'
5'---CC TNAGG---3'
3'---GGANT CC---5'
5'---T CGA---3'
3'---AGC T---5'
5'---GC GGCCGC---3'
3'---CGCCGG CG---5'
5'---G ANTC---3'
3'---CTNA G---5'
5'---AG CT---3'
3'---TC GA---5'
* = blunt ends
A digestion of a restriction enzymes applied to a DNA sequence chops it into fragments whose lengths
can be separated via an electrophoresis gel.
To build a restriction map, different biochemical techniques are used to derive information about the
map and then combinatorial methods are used to reconstruct the map from that data.
8.3
Restriction site mapping
The restriction mapping approach involves first digesting the given target sequence with one or more
restriction enzyme(s) and then solving a variant of the following problem:
Restriction Mapping problem: For a set X of points on the line, let ∆X = {|x1 −x2 | : x1 , x2 ∈ X}
denote the multiset of all pairwise distances between points in X. In the restriction mapping problem,
a subset E ⊆ ∆X (of experimentally obtained fragment lengths) is given and the task is to reconstruct
X from E.
Example:
We will discuss three different formulations of the problem of building a physical map using restriction
enzymes, namely:
1. the double digest problem,
2. the partial digest problem, and
3. the simplified partial digest problem.
126
8.4
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
The double digest problem
The idea here is to use two different restriction site enzymes A and B that recognize different sites.
We make multiple copies of the target DNA sequence S and then perform three separate reactions:
1. a complete digest of S using A,
2. a complete digest of S using B, and
3. a complete digest of S using both A and B.
In all three cases we then run the resulting DNA fragments in a gel to obtain the multisets of all
resulting fragment lengths, denoted by ∆A, ∆B and ∆AB, respectively.
For example, consider the following digestions:
0kb
5kb
10kb
15kb
A
A
A
A
B
B
B
A, B
A
B
A
A
B
We obtain the following multisets of fragment lengths (ordered non-decreasingly):
1. ∆A = {2kb, 4kb, 5kb, 5kb},
2. ∆B = {1kb, 6kb, 9kb}, and
3. ∆AB = {1kb, 1kb, 2kb, 3kb, 4kb, 5kb}.
Double Digest Problem (DDP): Given ∆A, ∆B and ∆AB for a target sequence S, infer the
positions of all A and B restriction sites in S.
• The DDP is NP-hard.
• All algorithms have problems with more than 10 restriction sites for each enzyme.
• A solution may not be unique and the number of solutions grows exponentially.
• A positive feature of the DDP problem is that the experiments are easy to conduct.
Lemma 8.4.1 The decision problem of the DDP is NP-complete.
Proof: To prove that DDP is NP-hard, we need to show that (1) any proposed solution can be checked
in polynomial time and (2) there exists an NP-hard problem that can be reduced to DDP.
For (1): First note that a proposed solution is easily checked in polynomial time: simply determine
all fragment lengths, sort them and compare them with the given multi-sets.
For (2): Suppose we are given a set of integers X = {x1 , . . . , xl }. The Set Partitioning
Problem
P
P (SPP)
is to determine whether we can partition X into two subsets X1 and X2 such that x∈X1 x = x∈X2 x.
This problem is known to be NP-complete.
We will now show that SPP can be solved as a special case of the DDP problem, and thereby proving
that the decision problem of DDP is NP-complete.
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
127
Let X be the input to the SPP and assume that the sum of all elements of X is even. Then set
∆A = X,
X
K K
x, and
∆B = { , }, with K =
2 2
x∈X
∆AB = ∆A
(So we assume that the enzyme B produces only two fragments, both of the same length.)
P
P
Then there exists a disjoint partition X = X1 ∪ X2 with x∈X1 x = x∈X2 x if and only if there
exists a solution for the constructed instance of the DDP problem.
8.5
The partial digest problem
In this approach, we only apply one type of restriction site enzyme to our target sequence S. However,
we perform many different experiments on copies of S, varying the reaction time.
This produces fragments of many different lengths that are not seen when performing a complete
digestion. The goal is to produce at least one fragment for every pair of restriction sites in the target
sequence.
For example:
0kb
5kb
10kb
12kb
16kb
The set of all fragment lengths is αA = {2, 4, 5, 5, 6, 7, 10, 11, 12, 16}, including the length of the
sequence.
Partial Digest Problem (PDP): Given αA, the multi-set of all possible fragment lengths obtained
from a target sequence S using the partial digest protocol, infer the positions of all restriction sites of
A in S.
• No polynomial time algorithm is known for PDP. In fact, the complexity of PDP is an open
problem.
• S. Skiena1 devised a simple backtracking algorithm that performs well in practice, but may
require exponential time.
• This approach is not a popular mapping method, as it is difficult to reliably produce all pairwise
distances between restriction sites.
8.6
The simplified partial digest problem
We now discuss the “simplified partial digest problem” 2 .
Given a target sequence S and a single restriction enzyme A, two different experiments are performed
on two sets of copies of S:
1. In the short experiment, the time span is chosen so that each copy of the target sequence is cut
precisely once by the restriction enzyme.
1
Skiena, SS and Sundaram, G (1994) A partial digest approach to restriction site mapping. Adv Appl Math 12:412-427
Blazewicz J et al. Construction of DNA restriction maps based on a simplified experiment. Bioinformatics. 2001
May;17(5):398-404.
2
128
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
2. In the long experiment, a complete digest of S by A is performed.
Let Γ = {γ1 , . . . , γ2N } be the multi-set of all fragment lengths obtained by the short experiment, and
let Λ = {λ1 , . . . , λN +1 } be the multi-set of all fragment lengths obtained by the long experiment,
where N is the number of restriction sites in S.
Example:
Suppose we are given these (unknown) restriction sites (in kb):
2
8 9
13
16
The short experiment yields:
2
14
8
8
9
13
7
3
Hence, Γ = {2kb, 14kb, 8kb, 8kb, 9kb, 7kb, 13kb, 3kb}.
The long experiment yields Λ = {2kb, 6kb, 1kb, 4kb, 3kb}.
8.6.1
An algorithm for SPDP
In the following we assume that Γ = hγ1 , . . . , γ2N i is sorted in non-decreasing order.
For each pair of fragment lengths γi and γ2N −i+1 , we have γi + γ2N −i+1 = L, where L is the length of
S.
Each such pair {γi , γ2N −i+1 } of complementary lengths corresponds to precisely one restriction site in
the target sequence S, which is either at position γi or at position γ2N −i+1 .
Let Pi = hγi , γ2N −i+1 i and P2N −i+1 = hγ2N −i+1 , γi i denote the two possible orderings of the pair
{γi , γ2N −i+1 }. Of any such ordered pair P = ha, bi we call the first component a the prefix of P .
By Q = {q1 , q2 , . . . , qN } we denote a set of complementary fragment permutations, such that qi = Pi
or qi = P2N −i+1 .
We obtain a set X of putative restriction site positions from Q as follows: For each complementary
pair {γi , γ2N −i+1 }, we choose one of the two possible orderings Pi and P2N −i+1 , and then add the
corresponding prefix to X.
Any such ordered choice X = hx1 , . . . , xN i of putative restriction sites gives rise to a multi-set of
integers R = {r1 , . . . , rN +1 }, with

xi
i=1

xi − xi−1 if i = 2, . . . , N
ri :=

L − xN
i = N + 1.
We can now formulate: Simplified Partial Digest Problem (SPDP): Given multi-sets Γ and Λ
of fragment lengths, determine a choice of orderings of all complementary fragment lengths in Γ such
that the arising set R equals Λ.
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
129
Example: From Γ = {2kb, 14kb, 8kb, 8kb, 9kb, 7kb, 13kb, 3kb}.
we obtain the pairs
P1 = h2, 14i, P8 = h14, 2i, P2 = h3, 13i, P7 = h13, 3i, P3 = h7, 9i, P6 = h9, 7i, P4 = h8, 8i, P5 = h8, 8i.
Because of the long experiment
Λ = {2kb, 6kb, 1kb, 4kb, 3kb}
we obtain Q = {P1 , P7 , P6 , P4 } and X = {2, 8, 9, 13}, from which we get R = {2, 6, 1, 4, 3}, our
restriction site map.
2
8 9
13
16
We will now formulate an algorithm that generates all possible solutions of the problem.
Given the ordered list of all ordered pairs Π = hP1 , . . . , P2N i, the algorithm generates all possible
choices of ordered pairs. More precisely, when called with variable i, it considers both alternatives Pi
and P2N −i+1 .
During a call, the current list of restriction sites X and the list R = hr1 , . . . , rk , rk+1 i of all fragment
lengths are passed as a parameter.
When processing a new corresponding pair of fragment lengths, the last element of the list R is replaced
by two new fragment lengths that arise because the last fragment is split by the new restriction site.
Algorithm 8.6.1 (SPDP (X = hx1 , . . . , xk i,R = hr1 , . . . , rk , rk+1 i, i))
if k = N and R = Λ then
print X // output valid set of restriction sites
else if i ≤ 2N then
Consider Pi = ha, bi
if b ∈
/ X then // haven’t used complement yet
Set p = a − (L − rk+1 ) and q = L − a // new fragment lengths
if p ∈ Λ then // new lengths ok
Set R0 = hr1 , . . . , rk , p, qi // replace old length by new
Set X 0 = hx1 , . . . , xk , ai// add a to set of restriction sites
Call SPDP(X 0 , R0 , i + 1) // continue using a in this tree’s lineage
Call SPDP(X, R, i + 1) // consider other alternative
end
Let P1 = ha, bi be the first ordered pair. Call the algorithm thus: SP DP (X = hai, R = ha, bi, i = 2).
The worst case running time complexity of this algorithm is exponential. However, it seems to work
quite well in practice.
This algorithm is designed for ideal data. In practice there are two problems:
Fragment length determination by gels leads to imprecise measurements, down to about 2 − 7% in
good experiments. This can be addressed by using interval arithmetic in the above algorithm.
The second problem is missing fragments. The SPDP does not suffer from this problem much because
both digests are easy to perform. Moreover, the short experiment must give rise to complementary
values and any failure to do so can be detected. The long experiment should give rise to precisely
N + 1 fragments.
130
8.7
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
Hybridization mapping
Hybridization mapping makes use of the fact we can easily test for the presence of small genomic
sequences (called probes) in a given clone by an hybridization experiment.
There are two sorts of probes, namely unique probes (such as so-called STS’s, sequence tagged sites)
and non-unique probes (such as restriction sites). The two give rise to different algorithmic problems.
We will concentrate on unique probes.
Given a set of unique probes, two protocols are commonly used, STS content mapping and radiation
hybrid mapping.
FISH (fluorescence in situ hybridization) is a cytogenetic technique used to detect and localize the
presence or absence of specific DNA sequences on chromosomes. FISH uses fluorescent probes that
bind to only those parts of the chromosome with which they show a high degree of sequence similarity. Fluorescence microscopy can be used to find out where the fluorescent probe bound to the
chromosomes.
Source: http://en.wikipedia.org
8.7.1
STS content mapping
An STS is defined as a short (200-500 bp) DNA sequence that occurs exactly once in a given genome.
In an STS hybridization experiment we use a collection of clones (a clone library) and we measure
which STS binds to which clone. The fingerprint of a clone is the set of probes that hybridize to it.
Note that we do not know the exact locations of the probes in a clone. By comparing the fingerprints
of different clones we hope to deduce the order of the probes in the clones and also the order of the
clones along the genome.
Example: If clone A has fingerprint {x, y, z} and clone B has fingerprint {x, z, w}, then we may
conclude the two clones come from an overlapping region of the genome.
Let P = {p1 , . . . , pm } be a set of unique probes (e.g. STSs). Let S = {S1 , . . . , Sn } be a set of clones
sampled from a common genomic region. Let P (Si ) denote the set of probes that are contained in
(hybridize to) clone Si .
STS Content Mapping Problem: Find a permutation π of the probe set P such that for every
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
131
clone Si we have
P (Si ) = {pπ(q) , . . . , pπ(r) },
for some 1 ≤ q ≤ r ≤ m.
In order to solve this problem we will translate it into a problem formulated in terms of incidence
matrices:
Definition 8.7.1 (Incidence matrix) For a set of clones S = {S1 , . . . , Sn } and a set P =
{p1 , . . . , pm } of unique probes the n × m binary incidence matrix M is defined as
1 probe j hybridized to clone i
Mij =
0 else
For example, assume we are given the following incidence matrix:
clone
1
2
3
4
5
6
A
0
0
1
1
1
0
B
1
1
0
0
0
0
C
0
0
1
1
1
0
probe
D E
0 1
0 0
0 0
0 0
0 0
1 0
F
0
1
1
0
1
0
G
0
0
1
0
0
1
probe
C A
0 0
0 0
1 1
1 1
1 1
0 0
G
0
0
1
0
0
1
D
0
0
0
0
0
1
The probes A, . . . , G can be permuted as follows:
clone
1
2
3
4
5
6
This implies the following layout:
E
B
E
1
0
0
0
0
0
B
1
1
0
0
0
0
F
F
0
1
1
0
1
0
C
A
G
D
1
2
3
4
5
6
Now all probes are consecutive for each clone and we say that the matrix has the Consecutive
Ones property. The solution(s) can be computed in linear time and represented in a data structure
called a P Q-tree 3 .
Not only have we thus ordered all clones, but we have also determined an ordering of the probes.
Using the incidence matrix, we can reformulate the STS content mapping problem:
3
A PQ-tree is a tree that represents all possible solutions. Booth and Lueker, 1976.
132
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
STS content mapping problem: Given an n × m incidence (hybridization) matrix, determine
whether it has the Consecutive Ones property.
8.7.2
Consecutive Ones Problem
The Consecutive Ones Problem (C1P) can be formulated in a number of different, equivalent ways,
e.g.:
1. Given a binary matrix M , can we find a permutation of the columns of M such that in every
row, all 1’s occur in a consecutive interval, called a block?
2. Given a set of objects P and a system S of subsets of P , can we define an ordering ≤ of P such
that for every set C ∈ S there exist two elements q, r ∈ C with C = {x ∈ S | q ≤ x ≤ r}?
3. Given a graph G = (V, E) with vertex set V and edge set E, can we assign an interval Iv to each
node v ∈ V such that {v, w} ∈ E ⇔ Iv ∩ Iw 6= ∅?
Example:
1
1
1. Matrix: 
0
0

2
0
0
1
3
0
1
0
4
1
0
1
5
0
1
0
6
0
0
1
7

0
1 
1
2. Set system: {1, 4}, {3, 5, 7}, {2, 4, 6, 7}
5
2
3. Graph:
3
1
4
7
6
A solution: 3, 5, 7, 2, 6, 4, 1
8.7.3
STS content mapping - errors
We have seen that determining whether a matrix has the C1P can be done in polynomial time.
However, if errors are present, then the corresponding matrix might not have a C1P solution.
Unfortunately, the hybridization experiments are very error-prone, usually suffering from:
• false positives: reporting that a clone contains a specific probe, when in fact it does not,
• false negatives: reporting that a clone does not contain a specific probe, when in fact it does,
and
• chimeras: these are false clones built from different pieces of DNA that come from unrelated
and distance parts of the genome and thus falsely bring together distant probes.
The following matrix depicts a correctly ordered probe set with a false negative in clone 3, a false
positive in clone 1, and a possible chimeric clone 6:
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
clone
1
2
3
4
5
6
8.7.4
E
1
0
0
0
0
1
B
1
1
0
0
0
0
F
0
1
1
0
1
0
probe
C A
0 1
0 0
0 1
1 1
1 1
0 0
133
G
0
0
1
0
0
1
D
0
0
0
0
0
1
Optimal Consecutive Ones Problem
Thus in practice (since errors are very likely to occur), we must consider the Optimal Consecutive
Ones Problem:
1. Given a binary matrix M , find a permutation that minimizes the number of blocks of consecutive
ones.
2. Given a set of objects P and a system S of subsets of P , define an ordering ≤ of P that minimizes
the total number of maximal subsets C 0 of sets C ∈ S of the form C 0 = {x ∈ P | q ≤ x ≤ r},
with q ≤ r.
3. Given a graph G = (V, E) with vertex set V and edge set E, assign an interval Iv to each node
v ∈ V such that {v, w} ∈ E implies Iv ∩Iw 6= ∅ and the number of pairs v, w ∈ V with {v, w} ∈
/E
and Iv ∩ Iw 6= ∅ is minimal.
It turns out that this problem is equivalent to solving the Traveling Salesman Problem (TSP), which
we know is NP-hard!
However, efficient approximation algorithms exist that can be applied.
Before we discuss this further, we describe a second experimental method, radiation hybrid mapping,
since it produces similar data.
8.7.5
Radiation hybrid mapping
In radiation hybrid mapping, a target (e.g. human) chromosome is irradiated and broken into a small
number of fragments. These non-overlapping fragments are fused into a e.g. hamster cell and then
replication produces a cell line. Subsequently, each cell line contains a pool of 5−10 large, disconnected,
non-overlapping fragments of target DNA.
This is repeated several times using different random irradiation results.
Finally, it is determined which cell lines hybridizes to which probes. This is very similar to STScontent mapping, except that we do not know how many fragments a cell line contains or to which
fragment a given probe actually hybridizes to.
F
D
E
B A
C
G
1
2
3
4
The following matrix shows the data from the above depicted radiation hybrid experiment:
What is a sensible objective function to help find the correct permutation of probes?
134
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
We can assume that probes that lie close to each other in the target genome are more likely to be
contained in the same fragment (within a pool). Thus, again we aim to minimize the total number of
blocks of consecutive ones.
8.7.6
TSP solution for Optimal Consecutive Ones
We can reduce the problem of finding the probe permutation with the minimum number of blocks of
consecutive ones to the Traveling Salesman Problem as follows:
1. Define a weighted graph G = (V, E) with V = {s, p1 , . . . , pk } where pi is a node for each probe
i and s is a special node.
2. E contains an edge from s to each pi and an edge for each pair of probes.
3. The weight of the edges from s to the pi is the number of ones in the corresponding column of
the matrix.
4. The weight of any other edge (pi , pj ) is the Hamming distance between the columns corresponding
to pi and pj .
For the sake of exposition we look at a submatrix of the above example:
c/p
1
2
3
4
E
1
0
0
1
B
1
1
1
1
C
0
1
1
0
A
1
1
1
0
H
E
B
C
A
⇒
E
0
B
2
0
C
4
2
0
A
3
1
1
0
This translates into the following graph G:
E
2
3
3
S
4
2
A
C
1
1
2
4
B
2
An optimal tour is S, C, A, B, E, S:
E
2
3
3
S
4
2
A
C
1
1
2
4
B
2
c/p
1
2
3
4
C
0
1
1
0
A
1
1
1
0
B
1
1
1
1
E
1
0
0
1
This tour indeed gives the correct ordering and the blocks of ones happen to be gap-free.
Bioinformatics I, WS’09-10, D. Huson, January 13, 2010
Theorem 8.7.2 A TSP tour of weight w corresponds to a probe permutation with exactly
of consecutive ones.
8.8
135
w
2
blocks
Summary
There are different protocols to determine a map, each suitable in different situations. Each protocol
has an associated algorithmic problem. Most of them are already NP-hard in their exact formulation.
Errors need to be taken into account.
Physical mapping comes in two flavors:
1. Restriction mapping. Restriction enzymes are used to digest the target into smaller pieces.
Using a specific protocol, certain sets of distances between restriction sites are constructed. The
computational challenge is to determine the location of the restriction sites from the distances.
Sometimes restriction mapping is also used to determine whether two clones overlap.
2. Hybridization mapping. The goal here is to determine the order of overlapping clones. The
hybridization signature of short (possibly unique) sequences is determined. This approach is
often used to determine a minimal tiling path of clones in a sequencing project.

Download Report

8 Physical Mapping - Algorithms in Bioinformatics

Paperzz.com

Your Paperzz