Integer Programming Models for Computational Biology Prob

Nov. 2004, Vol.19, No.1, pp.{
J. Comput. Sci. & Technol.
Integer Programming Models for Computational Biology Problems
Giuseppe Lancia
D.I.M.I, Universita di Udine, Viale delle Scienze 206, 33100 Udine, Italy
E-mail: [email protected]
Received May 26, 2003; revised August 21, 2003.
Abstract The recent years have seen an impressive increase in the use of Integer Programming models for the solution of optimization problems originating in Molecular Biology. In this
survey, we describe some of the most successful Integer Programming approaches, while giving a
broad overview of application areas in modern Computational Molecular Biology.
Keywords integer programming, computational biology, sequence alignment, genome rearrangements, protein structures
1
Introduction
Computational (Molecular) Biology is a relatively young science, which has experienced a
tremendous growth in the last decade. A sign
of this growth is the fact that a steadily increasing number of people with dierent scientic backgrounds have been approaching this eld in the recent years. It is now common to see papers on Computational Biology problems written not only by
computer scientists and biologists, but also statisticians, physicists and mathematicians, pure and
applied. In particular, the apparently endless capability of Computational Biology to provide exciting
combinatorial problems, has become very appealing to the Mathematical Programming/Operations
Research community. In fact, many of the problems considered can be cast as optimization problems, to be attacked by standard optimization techniques.
Consider the typical approach to tackle a computational biology problem, which is as follows.
First, a (rather diÆcult) modeling analysis is performed, in order to map the biological process under investigation into one or more mathematical
objects (such as graphs, sets, arrays: : :). After
this transformation, the original question concerning the biological data becomes a mathematical
question about the objects of the chosen representation. Sometimes this question can be simply cast
as \does there exist at least one object with such
and such mathematical properties?". This is a feasibility, or existence problem (which can be solved,
e.g., by using powerful arguments from combinatorics -such as the pigeonhole principle-, or from
probabilistic analysis, or, more likely, by describing
an algorithm which builds a solution, if one exists,
and proving its correctness).
However, many other times (most times, we
would say), each object, representing a tentative solution, has associated a numerical weight, intended
to measure its quality. The weight is proportional,
loosely speaking, to the probability that the solution is the \true" answer to the original biological
problem. The higher the weight, the more reliable a
solution is. Since the problem was mapped into the
domain of discrete mathematics, also the weight is
represented by some mathematical function f (x).
In order to nd the best answer to the original biological problem, one has then to solve the following optimization problem: nd a solution x which
maximizes f (x) over all possible solutions.
Optimization problems have been a subject of
study for decades in the mathematical community.
In particular, the whole eld of Operations Research is devoted to the subject, and many very
successful techniques have been developed over
the past years for diÆcult (NP-hard) optimization
problems. For such diÆcult problems, very rigorous methods have been devised that are practically
(i.e., within a \small" time) capable of nding the
optimal solution for data derived from real-life applications.
The most successful approach to the exact solution of hard optimization problems is probably Integer Linear Programming (for brevity, Integer Programming hereafter), which has been applied protably for hundreds of problems from real
domains[1 3] . The Integer Programming approach
consists in formulating a problem as the maximization (minimization) of a linear function of some integer variables and then solving it via branch and
2
bound, where the upper (lower) bound comes from
the Linear Programming (LP) relaxation. The LP
relaxation, in which the same function is optimized
but the variables are not restricted to be integer,
is polynomially solvable. A formulation is as successful as the strength of its LP bound. That is, if
the value of the objective function over the relaxation is close to the value over the integers, then the
bound, and hence the pruning of the search space,
is eective. It is often the case that in order to
obtain better bounds, the formulation is reinforced
by the use of additional constraints, called cuts,
and the resulting approach is known as Branchand-Price. Cuts are constraints that do not eliminate any feasible integer solution, but make the
space of fractional solutions smaller, this way improving the value of the LP bound.
Besides its use to solve a problem exactly, the
Integer Programming approach can also be applied
to compute provably good feasible solutions. In
fact, at any point of the search process, by using
the LP bound, we have an estimate of the maximum error between the value of the current solution
and the optimum. The search can then be stopped
as soon as the maximum error becomes acceptably
small. Furthermore, approximation-guarantee algorithms which are based on Integer Programming
formulation are also possible. These are polynomial algorithms which exploit the LP relaxation of
a problem to nd a feasible solution within a guaranteed maximum error from the optimum. We will
give examples of results of all these types in the
following sections.
The use of Integer Programming approaches
to Computational Biology problems has been introduced, starting about seven years ago, by several people with Mathematical Programming background (among which A. Caprara, B. Carr, D.
Guseld, R. Karp, J. Kececioglu, G. Lancia, H.
P. Lenhof, P. Mutzel, R. Ravi, K. Reinert and M.
Vingron) and has today become a common tool for
approaching a new problem.
In this survey we will review some of the main
results that were obtained with the use of Integer
Programming approaches. The problems, coming
from dierent areas, touch almost every aspect of
modern Computational Biology.
The material is organized in sections. In each
section, we describe one or more problems of the
same nature, and the Integer Programming approaches that were proposed in the literature for
their solution. Many of these approaches are based
on Integer Programming formulations with an ex-
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
ponential number of inequalities (or variables),
that are solved by Branch-and-Price (respectively,
Branch-and-Price). Section 2is devoted to a brief
description of the general principles of the theory
of Integer Programming.
The remainder of the paper is then organized
as follows. In Section 3 we outline some Integer
Programming approaches to Alignment Problems.
We review several problems related with the alignment of genomic objects, such as DNA or RNA sequences and secondary or tertiary structures. We
distinguish three types of alignment:
(i) Sequence vs Sequence Alignment: This subsection is devoted to the problem of aligning genomic sequences. First, we discuss the classical
multiple alignment problem, studied by Reinert
et al.[4] and Kececioglu et al.[5] , who developed a
Branch-and-Price algorithm for its solution. Then,
we describe a heuristic approach for the same problem, by Fischetti et al.[6] . The approach generalizes
an approximation algorithm by Guseld[7] and uses
a reduction to the Minimum Routing Cost Tree
problem, solved by Branch-and-Price. Closely related to the alignment problem, are the so called
\center" problems, in which one tries to determine
a sequence as close or as far as possible from a set
of input sequences. For this problem, we describe
an approximation algorithm based on LP relaxation and Randomized Rounding, by Ben-Dor et
al.[8] . We also discuss some works that, developing
similar ideas, obtain PTAS for the consensus and
the farthest-sequence problems (Li et al.[9] , Ma[10] ,
Lanctot et al.[11] ).
(ii) Sequence vs Structure Alignment: In this
subsection we describe the problem of aligning two
RNA sequences, when for one of them the secondary structure is already known. For this problem, we describe a Branch-and-Price approach by
Lenhof et al.[12] and Kececioglu et al.[5] .
(iii) Structure vs Structure Alignment: In this
subsection we consider the problem of comparing
two protein tertiary structures. For this problem,
we review a Branch-and-Price approach, proposed
by Lancia et al.[13] .
Section 4 is devoted to Genome Rearrangements
problems. In this section we describe a Branchand-Price approach, by Caprara et al.[14;15] for
computing the evolutionary distance between two
genomes evolved from a common ancestor by inversions of long DNA regions.
In Section 5 we diversities (polymorphisms) at
the genomic level. The SNP (Single Nucleotide
Polymorphism) Haplotyping problem consists in
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
determining the values of a set of polymorphic sites
in a genome. First, we discuss the single individual haplotyping problem, for which some Integer
Programming techniques were employed by Lancia et al.[16] and Lippert et al.[17] . Then, we review two versions of the problem relative to haplotyping entire populations, studied primarily by
Guseld[18 20] . The approaches are based employs
a reduction to a graph problem, which is then
solved by Interger Programming.
In Section 6 we mention some approaches that
were left out for space reasons[21;22] . Finally, we
show how the Integer Programming approach can
also be applied indirectly, i.e., when a very well
known optimization problem (solved by Integer
Programming techniques) is used to solve a problem in Computational Biology. This is the case,
e.g., for the TSP problem, used to solve the Radiation Hybrid Mapping problem (Agarwala et al.[23] ).
2
Integer Linear Programming Basics
An Integer Linear Programming (IP) problem
is dened by an m n real constraint matrix
A = (aij ), a real 1 n cost vector c and a real
m 1 rhs vector b. The problem consists in nding
the values for n variables x = (x1 ; : : : ; xn ) which
optimize (e.g., minimize)
min
n
X
j =1
cj xj
(1)
subject to
n
X
j =1
aij xj
> bi 8i = 1; : : : ; m
(2)
and with each xi being a non-negative, integer number. This last constraint is written as
x > 0;
x 2 Zn:
(3)
It is easy to show that the IP problem is NPhard. However, its Linear Programming relaxation,
obtained by replacing the integrality constraints (3)
with
x>0
(4)
is polynomially solvable[24;25] . Let zIP be the optimum value of (1), (2) and (3) and zLP be the optimum value of (1), (2) and (3). The value zIP zLP
is called the integrality gap of the relaxation. The
success of an IP model relies in the integrality gap
being as small as possible.
3
To formulate a combinatorial optimization
problem as an IP problem, one has to dene a set
of variables, linear constraints and a linear objective function, such that the optimal solution to the
problem of minimizing (1) subject to (2) and (3)
identies the optimal solution of the original problem. Many times the natural choice for the integer
variables asks them to take only the values 0 and 1
(binary variables). These values are used to represent the decision of taking or not a certain element
as part of the optimal solution.
Sometimes the tightest IP formulations require
a very large (exponential with respect to the size
of the original problem) number of constraints.
A fundamental paper by Grotschel, Lovasz and
Schrijver[26] states that even exponential-size LP
relaxations can be solved in polynomial time, provided that separation can be done in polynomial
time. Separation consists in solving the LP with
only a subset of the constraints (2), and then checking if any of the constraints that were left out is
violated by the optimal solution. If this is the case,
the separation procedure nds one such constraint,
which is then added to the current LP, to be solved
again. By using the ellipsoid method as the LP
solver, Grotschel, Lovasz and Schrijver showed that
the above procedure converges in polynomial time.
The Branch-and-Bound approach for IP formulations in which inequalities are added through separation is called Branch-and-Price. In fact, each
added inequality is called a \cut" since it cuts o
an LP solution which is not feasible for all constraints.
An alternative way to obtain tight IP formulation is sometimes through the use of a very large
(exponential with respect to the size of the original problem) number of variables[27] . Variables relate to constraints through duality. The dual of
a Linear Program (called the primal) dened by
(1), (2) and (4) is the following LP in the variables
y = (y1 ; : : : ; ym ):
max
m
X
bi yi
(5)
6 cj 8j = 1; : : : ; n
(6)
i=1
subject to
m
X
i=1
y
aij yi
> 0:
(7)
Note how the role of variables and constraints
is reversed in going from the primal to the dual.
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
4
In order to solve an LP formulation with an exponential number of variables in polynomial time,
it is suÆcient to design a polynomial time pricing
procedure. Pricing consists in solving the LP with
only a subset of the variables, and then checking if
any of the variables that were left out can be used
to improve the current optimal solution. If this is
the case, the pricing procedure nds one such variable, which is then added to the current LP, to be
solved again. Using duality, it can be shown that
nding a variable to add to the current LP is the
same as nding a violated inequality for the dual
of the current LP (i.e., the pricing problem for the
primal is the separation problem for the dual).
The Branch-and-Bound approach for IP formulations in which variables are added through pricing
is called Branch-and-Price.
3
Alignment Problems
Oftentimes, in computational biology, one must
compare objects which consist of a set of elements
arranged in a linearly ordered structure. A typical
example is the genomic sequence, which is equivalent to a string over an alphabet (the 4-letter
nucletotide alphabet fA; T; C; Gg, or the 20-letter
amino acid alphabet). Another example is the protein, when this is regarded as a linear chain of
residues (and hence with a rst and a last residue).
Aligning two (or more) such objects consists
in determining subsets of corresponding elements
in each. The correspondence must be orderpreserving, i.e., if the i-th element of object 1 corresponds to the k-th element of object 2, no element
following i in object 1 can correspond to an element
preceding k in object 2.
The situation can be described graphically as
follows. Given two objects, where the rst has n
elements, numbered 1; : : : ; n and the second has m
elements, numbered 1; : : : ; m, we consider the complete bipartite graph
Wnm
:= ([n]; [m]; L)
(8)
where L = [n] [m] and [k] := f1; : : : ; kg 8k 2 Z + .
We hereafter call a pair (i; j ) with i 2 [n] and
j 2 [m] a line (since it aligns i with j ), and we denote it by [i; j ]. L is the set of all lines. Two lines
[i; j ] and [i0 ; j 0 ], with [i; j ] 6= [i0 ; j 0 ], are said to cross
if either i0 > i and j 0 6 j or i0 6 i and j 0 > j (by
denition, a line does not cross itself). Graphically,
crossing lines correspond to lines that intersect in
a single point. A matching is a subset of lines no
two of which share an endpoint. An alignment is
identied by a noncrossing matching, i.e., a matching for which no two lines cross, in Wnm . Fig.1(a)
shows (in bold) an alignment of two objects.
Fig.1. (a) A noncrossing matching (alignment). (b) The
directed grid.
In the following Subsections 3.1, 3.2 and 3.3,
we will describe Integer Programming approaches
for three alignment problems, i.e., Sequences vs.
Sequences (Reinert et al.[4] , Kececioglu et al.[5] ,
Sequences vs. Structures (Lenhof et al.[12] , Kececioglu et al.[5] ), and Structures vs. Structures
(Lancia et al.[13] , Caprara and Lancia[31] ) respectively. All Integer Programming formulations for
these problems contain binary variables xl to select which lines l 2 L dene the alignment, and
constraints to insure that the selected lines are noncrossing. Therefore, it is appropriate to discuss
here this part of the models, since the results will
apply to all the following formulations. We will
consider here the case of two objects only. This
case can be generalized to more objects, as mentioned in Subsection 3.1.
Let Xnm be the set of incidence vectors of all
noncrossing matchings in Wnm . A noncrossing
matching in Wnm corresponds to a stable set in a
new graph, GL (the Line Conict Graph), dened
as follows. Each line l 2 L is a vertex of GL , and
two vertices l and h are connected by an edge if the
lines l and h cross.
The graph GL has been studied in Lancia et
al.[13] , where the following theorem is proved:
[13] The graph GL is perfect.
Theorem 1.
Since a stable set can intersect a clique in at
most 1 node, the following family of clique inequalities can be used as constraints for the alignment
problem:
X
2
l Q
xl
6 1 8Q 2 clique(L):
(9)
where clique(L) denotes the set of all cliques in
GL . Given weights xl for each line l, the separation problem for the clique inequalities (9) consists in nding a clique Q in GL with x (Q) > 1.
From Theorem 1, we know that this problem can be
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
solved in polynomial time by nding the maximum
x -weight clique in GL . Instead of resorting to slow
and complex algorithms for perfect graphs, the separation problem can be solved directly by Dynamic
Programming, as described in Lenhof et al.[12] and,
independently and with a dierent construction, in
Lancia et al.[13] . Both algorithms have complexity
O(nm). We describe here the construction of [12].
Consider L as the vertex set of a directed grid,
in which vertex [1; n] is put in the lower left corner and vertex [m; 1] in the upper right corner (see
Fig.1(b)).
At each internal node l = [i; j ], there are two
incoming arcs, one directed from the node below,
i.e. [i; j + 1], and one directed from the node on
the left, i.e. [i 1; j ]. Each such arc has associated
a length equal to xl .
[12;13] The nodes on a directed path
P from [1; n] to [m; 1] in the grid correspond to a
clique Q, of maximal cardinality, in GL and vice
versa.
Theorem 2.
Then the most violated clique inequality can be
found by taking the longest [1; n]-[m; 1] path in the
grid. There is a violated clique inequality if and
only if the length of the path (plus x1n ) is greater
than 1.
3.1
Sequence vs Sequence Alignments
Comparing genomic sequences drawn from individuals of the same or dierent species is one of
the fundamental problems of molecular biology. In
fact, such comparisons are needed to identify highly
conserved (and therefore presumably functionally
relevant) DNA regions, spot fatal mutations, suggest evolutionary relationships, and help in correcting sequencing errors.
Aligning a set of sequences (i.e., computing a
Multiple Sequence Alignment) consists in arranging them in a matrix having each sequence in a
row. This is obtained by possibly inserting gaps
(represented by the `-' character) in each sequence
so that they all result of the same length. The following is a simple example of an alignment of the
sequences ATTCCGAC, TTCCCTG and ATCCTC.
A
A
T
T
-
T
T
T
C
C
C
C
C
C
G
C
-
A
-
T
T
C
G
C
By looking at a column of the alignment, we
can reconstruct the events that have happened at
the corresponding position in the given sequences.
For instance, the letter G in sequence 1, has been
5
mutated into a C in sequence 2, and deleted in se-
quence 3.
The multiple sequence alignment problem has
been formalized as an optimization problem. The
most popular objective function for multiple alignment generalizes ideas from optimally aligning two
sequences. This problem, called pairwise alignment, is formulated as follows: Given symmetric
costs (a; b) for replacing a letter a with a letter
b and costs (a; ) for deleting or inserting a letter a, nd a minimum-cost set of symbol operations that turn a sequence S 0 into a sequence S 00 .
For genomic sequences, the costs (; ) are usually
specied by some widely used substitution matrices
(like PAM and BLOSUM), which score the likelihood of specic letter mutations, deletions and insertions. The pairwise alignment problem can be
solved by Dynamic Programming in time and space
O(l2 ), where l is the maximum length of the two
sequences (Needleman and Wunsch[32] ). The value
of an optimal solution is called the edit distance of
S 0 and S 00 and is denoted by d(S 0 ; S 00 ).
Formally, an alignment A of two or more
sequences is a bi-dimensional array having the
(gapped) sequences as rows. The value dA (S 0 ; S 00 )
of an alignment of two sequences S 0 and S 00 is obtained by adding up the costs for the pairs of characters in corresponding positions, and it is easy to
see that d(S 0 ; S 00 ) = minA dA (S 0 ; S 00 ). This objective is generalized, for k sequences fS1 ; : : : ; Sk g, by
the Sum{of{Pairs (SP) score, in which the cost of
an alignment is obtained by adding up the costs
of the symbols matched up at the same positions,
over all the pairs of sequences:
SP (A) : =
=
X
16i<j 6k
X
dA (Si ; Sj )
jAj
X
16i<j 6k l=1
(A[i][l]; A[j ][l])
(10)
where jAj denotes the length of the alignment, i.e.,
the number of its columns.
Finding the optimal SP alignment was shown
NP-hard by Wang and Jiang[33] . A straightforward
generalization of the Dynamic Programming from 2
to k sequences, of length l, leads to an exponentialtime algorithm of complexity O(2k lk ), which is in
practice useless for all but tiny problems[34] .
3.3.1
Trace and Integer Programming Approach
As an alternative approach to the ineective Dynamic Programming algorithm, Kececioglu et al.[5]
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
6
(see also Reinert et al.[4] for a preliminary version)
proposed an approach based on Integer Programming, to optimize an objective function called the
Maximum Weight Trace (MWT) Problem. The
MWT generalizes the SP objective as well as many
other alignment problems in the literature. The
problem was introduced in 1993 by Kececioglu[35] ,
who described a combinatorial Branch-and-Bound
algorithm for its solution.
The trace is a graph theoretic generalization
of the noncrossing matching to the case of the
alignment of more than two objects. Suppose
we want to align k objects, of n1 ; n2 ; : : : ; nk elements. We dene a complete Sk-partite graph
Wn1 ;:::;nk = (V; L), where V = ki=1 Vi , the set
Vi = fvi1 ; : : : ; vini g denotes the nodes of level i
(corresponding
to the elements of the i-th object),
S
and L = 16i<j 6k Vi Vj . Each edge [i; j ] 2 L is
still called a line. Given an alignment A of the objects (i.e., an arrangement in an array, with gaps),
we say that the alignment realizes a line [i; j ] if i
and j are put in the same column of A. A trace
T is a set of lines such that there exists an alignment A that realizes all the lines in T . The Maximum Weight Trace (MWT) Problem is the following: given weights w[i;j ] for each line [i; j ], nd a
trace T of maximum total weight.
Let A be the set of directed arcs pointing from
each element to the elements that follow it in an
object (i.e., arcs of type (vis ; vit ) for i = 1; : : : ; k,
1 6 s < t 6 ni ). Then a trace T is characterized
as follows:
[14] T is a trace if and only if
there is no directed mixed cycle in the mixed graph
(V; T [ A).
A directed mixed cycle is a cycle containing both
Proposition 3.
arcs and edges, in which the arcs (elements of A)
are traversed according to their orientation, while
the undirected edges (lines of T ) can be traversed
in either direction.
Using binary variables xl for the lines in L, the
MWT problem can be modeled as follows[5] :
max
X
2
wl xl
(11)
l L
subject to
X
2 \
l C L
xl
6 jC \ Lj 1 8 directed mixed cycles
C in (V; L [ A)
xl 2 f0; 1g 8l 2 L:
(12)
(13)
The formulation (11){(13) can be strengthened
by using the clique inequalities (9), relative to each
subgraph Wni ;nj := (Vi ; Vj ; Vi Vj ) of Wn1 ;:::;nk .
The mixed-cycle inequalities (12) can be separated in P
polynomial time, via the computation of at
most O( ki=1 ni ) shortest paths in (V; L [ A)[5] . A
Branch-and-Price algorithm based on the mixedcycle inequalities and the clique inequalities (as
well as other cuts, separated heuristically) allowed
Kececioglu et al.[5] to compute, for the rst time,
an optimal alignment for a set of 15 proteins of
about 300 amino acids each from the SWISSPROT
database, whereas the previous limit, achieved by
a combinatorial Branch-and-Bound algorithm for
this problem, was 6 sequences of size 200 We remark the the optimal alignment of 6 sequences
of size 200 is already out of reach for Dynamic
Programming-based algorithms. We must also
point out, however, that the length of the sequences
is still a major limitation to the applicability of the
Integer Programming solution, and the method is
not suitable for sequences longer than a few hundred letters.
3.1.2
An IP-Based Heuristic for SP Alignment
Due to the complexity of the alignment problem, many heuristic algorithms have been developed over time, some of which provide an
approximation-guarantee on the quality of the solution found. In this section we describe one such
heuristic procedure, which uses ideas from Network
Design and Integer Programming techniques. This
heuristic is well suited for long sequences.
A popular approach for many heuristics is the
so called \progressive" alignment, in which the solution is incrementally built by considering the sequences one at a time. One such algorithm is due
to Guseld[7] , who suggested to use a star (i.e.,
a tree in which at most one node {the center{ is
not a leaf), to align the sequences as follows. Let
S = fS1 ; : : : ; Sk g be the set of input sequences.
Fix a sequence Sc as the center, and compute k 1
pairwise optimal alignments Ai , one for each each
Si aligned with Sc . These alignments can then be
merged into a single alignment A(c), by putting in
the same column all the letters that are aligned to
the same letter of Sc .
Note that each Ai optimally aligns Si to Sc
(thereby realizing their edit distance), and it is true
also in A(c) that each sequence is aligned optimally
with Sc , i.e.
dA(c) (Si ; Sc ) = dAi (Si ; Sc ) = d(Si ; Sc )
8i: (14)
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
Furthermore, for all pairs i; j , the triangle inequality dA(c) (Si ; Sj ) 6 d(Si ; Sc ) + d(Sj ; Sc ) holds,
since the distance in an alignment is a metric over
the sequences (assuming the cost function is a
metric over ). Let c be such that SP (A(c )) =
minr SP (A(r)). Call OPT be the optimal SP
value,
P and HEUR := SP (A(c )). By comparing 16i<j 6k d(Si ; Sj ) (a lower bound to OPT) to
P
16i<j 6k (d(Si ; Sc )+ d(Sj ; Sc )) (an upper bound
to HEUR), Guseld proved the following approximation result:
[7] HEUR 6 2 2 OPT.
Theorem 4.
k
Guseld's algorithm can be generalized so as to
use any tree, instead of just a star, and still maintain the 2-approximation guarantee. The approach
is based on a connection to the following Network
Design problem. Given a complete weighted graph
and a spanning tree T , the routing cost of T , denoted by r(T ), is dened as the sum of the weights
of the unique paths, between all pairs of nodes,
identied by T . Computing the minimum routing
cost tree is NP-hard, but a Polynomial Time Approximation Scheme (PTAS) is possible[36] .
To relate routing cost and alignments, consider
the set of sequences S as the vertex set of a complete weighted graph G, where the weight of an
edge Si Sj is d(Si ; Sj ). The following procedure
shows that, for any spanning tree T , one can build
an alignment
A(T ) which is optimal for k 1 of the
the k2 pairs of sequences: (i) pick an edge Si Sj of
the tree and align recursively the sequences on both
sides of the cut dened by the edge, obtaining two
alignments, say Ai and Aj ; (ii) align optimally Si
with Sj ; (iii) merge Ai and Aj into a complete solution, by aligning the columns of Ai and Aj in the
same way as the letters of Si are optimally aligned
to the letters of Sj .
Theorem 5. For any spanning tree T = (S ; E )
of G, there exists an alignment A(T ) such that
dA(T ) (Si ; Sj ) = d(Si ; Sj ) for all pairs of sequences
Si Sj
eld's alignment). In [6], Fischetti et al. describe
a Branch-and-Price algorithm for nding the minimum routing cost tree in a graph, which is eective
for up to k = 30 nodes (a large enough value for
alignment problems). The algorithm is based on a
formulation with binary variables xP for each path
P in the graph (an exponential number of variables), and constraints that force the variables set
to 1 to be the set of paths induced by a tree. The
pricing problem is solved by computing O(k2 ) nonnegative shortest paths. Fischetti et al.[6] report
on computational experiments of their algorithm
applied to alignment problems, showing that using
the best routing cost tree instead of the best star
produces alignments that are within 6% of optimal
on average.
3.1.3
Consensus Sequences
Given a set S = fS1 ; : : : ; Sk g of sequences, an
important problem consists in determining their
consensus i.e., a new sequence which represents all
the sequences in the set. Let C be the consensus. To avoid a consensus bias (e.g., due to overrepresented sequences in the set S ), an appropriate
objective is to minimize the maximum distance of
C from any sequence in S . This objective is NPhard[37] .
Assume all input sequences have length n. For
the consensus problem instead of using the edit distance, the Hamming distance is preferable (some
biological reasons for this can be found in [11]).
This distance is dened,
P for sequences A and B of
the same length, as ni=1 (A[i]; B [i]), where X [i]
denotes the i{th symbol of a sequence X .
The Integer Programming formulation for the
consensus problem has a binary variable xi , for
every symbol 2 , and every position 1 6 i 6 n,
to indicate whether C [i] = (xi = 1) or not
(xi = 0). The model reads:
2 E.
Furthermore, the alignment A(T ) can be easily
computed, as outlined above. By triangle inequality, it follows
Corollary 6. For any spanning tree T = (S ; E )
of G, there exists an alignment A(T ) such that
SP (A(T )) 6 r(T ).
From Corollary 6, it follows that a good tree to
use for aligning the sequences is a tree T such that
r(T ) = minT r(T ). This tree does not guarantee an optimal alignment, but guarantees the best
possible upper bound to the alignment value, for
an alignment obtained from a tree (such as Gus-
7
min r
(15)
8i = 1; : : : ; n
(16)
subject to
X
2
xi
=1
n X
X
(; Sj [i]) xi 6 r 8j = 1; : : : ; k
2
(17)
xi 2 f0; 1g 8i = 1; : : : ; n; 8 2 :
(18)
i=1 Let OPT denote the value of the optimum solution to the program above. An approximation to OPT can be obtained by Randomized
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
8
Rounding[38] . Consider the Linear Programming
relaxation of (15){(18), obtained by replacing (18)
with xi > 0. Let r be the LP value and xi be the
LP optimal solution, possibly fractional. An integer solution x can be obtained from x by randomized rounding: independently, at each position i,
choose a letter 2 (i.e., set xi = 1 and xi = 0
for 6= ) with probability of xi = 1 given by
xi . Let HEUR be the value of the solution x, and
G amma = maxa;b2 (a; b). The rst approximation result for the consensus was given by Ben-Dor
et al.[8] , and states that, for any constant > 0,
Pr HEUR > OPT +
r
3 OPT log
k
< :
(19)
This probabilistic algorithm can be derandomized using standard techniques of conditional probabilities[39] .
The IP model (15){(18) was utilized in a string
of papers on consensus as well as other, related,
problems[9 11] , such as nding the farthest sequence (i.e., a sequence \as dissimilar as possible"
from each sequence of a given set S ).
The main results obtained on these problems
are PTAS based on the above IP formulation and
Randomized Rounding, coupled with random sampling. In [9, 10], Li et al. and Ma respectively, describe PTAS for nding the consensus, or a consensus substring. The main idea behind the approach
is the following. Given a subset of r sequences from
S , line them up in an r n array, and consider the
positions where they all agree. Intuitively, there
is a high likelihood that the consensus should also
agree with them at these positions. Hence, all one
needs to do is to optimize on the positions (symbols) where they do not agree, which is done by LP
relaxation and randomized rounding.
A PTAS for the farthest sequence problem, by
Lanctot et al., is described in [11]. Also this result is obtained by LP relaxation and Randomized
Rounding of an IP similar to (15){(18).
3.2
Sequence vs Structure Alignments
In order to build a protein, the DNA from a gene
is rst copied onto the messenger RNA, which will
serve as a template for the protein synthesis. This
process is called transcription. In RNA, the nucleotide Thymine (T) is replaced by the equivalent
nucleotide Uracile (U). RNA molecules are singlestranded and the exposed bases tend to form hydrogen bonds within the same molecule, according
to Watson-Crick pairing rules (A $ U, C $ G).
This bonds lead to structure formation, also called
secondary structure of the RNA sequence.
Determining the secondary structure from the
nucleotide sequence is not an easy problem. For instance, complementary bases may not hybridize if
they are not suÆciently apart in the molecule. It is
also possible that not all the feasible Watson-Crick
pairings can be realized at the same time, and Nature chooses the most favorable ones according to
an energy minimization which is not yet well understood.
All potential pairings can be represented by the
edge set of a graph GS = (VS ; ES ). The vertex
set VS = fv1 ; : : : ; vn g represents the nucleotides
of S , with vi corresponding to the i-th nucleotide.
Each edge e 2 ES connects two nucleotides that
may possibly hybridize to each other in the secondary structure. For instance, the graph of Fig.2
describes the possible pairings for the sequence
S = UCGUGCGGUAACUUCCACGA, assuming a pairing
can be achieved only if the bases are at least 8
positions apart from each other. Only a subset
ES0 ES of pairings is realized by the actual secondary structure. This subset is drawn in bold in
Fig.2. The graph G0S = (VS ; ES0 ) describes the secondary structure. Note that each node in G0S has
degree 6 1.
Fig.2. Graph of RNA secondary structure.
Two RNA sequences that look quite dierent as
strings of nucleotides, may have similar secondary
structure. A generic sequence alignment algorithm,
such as those described in the previous section,
would not spot the structure similarity, and hence
a dierent model for comparison is needed for this
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
case.
One possible model was proposed in Lenhof et
al.[12] (see also Kececioglu et al.[5] ). In this model,
a sequence U of unknown secondary structure is
compared to a sequence K of known secondary
structure. This is done by comparing the graph
0 ).
GU = (VU ; EU ) to G0K = (VK ; EK
Assume VU = [n] and VK = [m]. An alignment of GU and G0K is dened by a noncrossing matching of Wnm (which, recalling (8), is the
complete bipartite graph (VU ; VK ; L)). Given an
edge e = ij 2 EU and two noncrossing lines
l = [i; u] 2 L, h = [j; v ] 2 L, we say that l and
0 . Hence, two lines generh generate e if uv 2 EK
ate a possible pairing (edge of EU ) if they align its
endpoints to elements that hybridize to each other
in the secondary structure of K .
For each line l = [i; j ] 2 L let wl be the
prot of aligning the two endpoints of the line (e.g.,
wl = (U [i]; K [j ])). Furthermore, for each edge
e 2 EU , let we be a nonnegative weight associated to the edge, measuring the \strength" of the
corresponding bond. The objective of the RNA Sequence/Structure Alignment (RSA) problem is to
determine an alignment which has maximum combined value: the value is given by the sum of the
weights of the alignment lines and the weights of
the edges in EU that these lines generate.
Let be an arbitrary total order dened over
the set of all lines L. Let G be the set of pairs of
lines (p; q) with p q, such that each pair (p; q) 2 G
generates an edge in EU . For each (p; q) 2 G , dene wpq := we , where p = [i; u], q = [j; v] and
e = ij 2 EU .
The Integer Programming model in [12] for this
alignment problem has variables xl for lines l 2 L
and ypq for pairs of lines (p; q) 2 G :
max
X
2
l L
wl xl +
X
(p;q)2G
wpq ypq
(20)
subject to
xl + xh
6 1 8 l; h 2 Ljl and h cross
ypq 6 xp 8 (p; q ) 2 G
ypq 6 xq 8 (p; q ) 2 G
xl ; ypq 2 f0; 1g 8l 2 L;
8(p; q) 2 G :
(21)
(22)
(23)
(24)
The model can be readily tightened as follows.
First, the clique inequalities (9) can be replaced for
the weaker constraints (21). Secondly, for a line
9
2 L, let Gl = f(p; q) 2 Gjp = l _ q = lg. Each
element (i; j ) of Gl identies a pairing e 2 EU such
that l is one of the generating lines of e. Note that
if an edge e 2 EU can be generated by l and a 2 L
and, alternatively, by l and b 6= a 2 L, then a and b
must cross (in particular, they share an endpoint).
Hence, of all the generating pairs in Gl , at most
one can be selected in a feasible solution. The natural generalization for constraints (22) and (23) is
therefore
X
yab 6 xl 8l 2 L:
(25)
(a;b)2Gl
A Branch-and-Price based on the above inequalities is described in [5, 12]. The method
was validated by aligning RNA sequences of known
structure to RNA sequences of known structure,
but without using the structure information for one
of the two sequences. In all cases tested, the algorithm retrieved the correct alignment, as computed
by hand by the biologists. For comparison, a \standard" sequence alignment algorithm (which ignores
structure in both sequences) was used, and failed
to retrieve the correct alignments.
In a second experiment, the method was used to
align RNA sequences of unknown secondary structure, of up to 1,400 nucleotides each, to RNA sequences of known secondary structure. The optimal solutions found were compared with alternative solutions found by \standard" alignment, and
their merits were discussed. The results showed
that structure information is essential in retrieving
biologically relevant alignments.
l
3.3
Structure vs Structure Alignments
A protein is a chain of molecules known as amino
acids, or residues, which folds into a peculiar 3-D
shape (called its tertiary structure) under several
molecular forces (entropic, hydrophobic, thermodynamic). A protein's fold is perhaps the most
important of all protein's features, since it determines how the protein functions and interacts with
other molecules (in fact, most biological mechanisms {e.g., binding{ at the protein level are based
on shape-complementarity).
The comparison of protein structures is a problem of paramount importance in structural genomics, with applications to protein clustering,
function determination and assessment of protein fold predictions. An increasing number of
approaches for structure comparison have been
proposed over the past years (see Lemmen and
Lengauer[40] for a survey), and several protein
10
structure classication servers (the most important
of which is the Protein Data Bank, PDB[41] have
been designed and are extensively used in practice.
Loosely speaking, the structure comparison
problem is the following: Given two 3-D protein
structures, determine how similar they are. The
comparison of two structures requires to \align"
them in some way. Since, by their nature, threedimensional computational problems are inherently
more complex than the similar one-dimensional
ones, there is a dramatic dierence between the
complexity of sequence alignment and structure
alignment. Not surprisingly, various simplied versions of the structure comparison problems were
shown NP-hard[42] .
A popular structure comparison measure is
based on the comparison of the proteins contact
maps. A contact map is a 2-D representation of a
3-D protein structure. When a proteins folds, two
residues that were not adjacent in the protein's linear sequence, may end up close to each other in the
3-D space. The contact map of a protein with n
residues is dened as an n n 0-1 matrix, whose
1{elements correspond to pairs of residues that are
within a \small" distance (typically around 5
A) in
the protein's fold, but are not adjacent in the linear
sequence. We say that such a pair of residues are
in contact. A contact map can also be considered
the adjacency matrix of a graph G. Each residue
is a node of G, and there is an edge between two
nodes if the corresponding residues are in contact.
The Contact Map Overlap (CMO) problem tries
to evaluate the similarity in the 3-D folds of two
proteins by determining the similarity of their contact maps (the rationale being that a high contact
map similarity is a good indicator of high 3-D similarity). This measure was introduced in [43], and
its optimization was proved NP-hard in [42], thus
justifying the use of sophisticated heuristics or enumerative methods.
Given two folded proteins, the CMO problem
calls for determining an alignment between the
residues of the rst protein and of the second protein. The goal is to nd the alignment which highlights the largest set of common contacts. The
value of an alignment is given by the number of
pairs of residues in contact in the rst protein
which are aligned with pairs of residues that are
also in contact in the second protein. Given the
graph representation for contact maps, we can
phrase the CMO problem in graph{theoretic language. The input consists of two undirected graphs
G1 = (V1 ; E1 ) and G2 = (V2 ; E2 ), with V1 = [n]
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
and V2 = [m]. For each edge ij , with i < j , we
distinguish a left endpoint (i) and a right endpoint
(j ), and therefore we denote the edge by the ordered pair (i; j ). A solution of the problem is an
alignment of V1 and V2 , i.e., a noncrossing matching L0 L in Wnm . Two edges e = (i; j ) 2 E1
and f = (i0 ; j 0 ) 2 E2 contribute a sharing to the
value of an alignment L0 if l = [i; i0 ] 2 L0 and
h = [j; j 0 ] 2 L0 . In this case, we say that l and
h generate the sharing (e; f ). The CMO problem
consists in nding an alignment which maximizes
the number of sharings. Fig.3 shows two contact
maps and an alignment with 5 sharings.
Fig.3. An alignment of value 5.
A successful approach for the CMO solution relies on formulating the problem as an Integer Program and solving it by Branch-and-Price. The formulation is based on binary variables xl for each
line l, and ypq for pairs of lines p; q which generate
a sharing.
The model, proposed by Lancia et al.[13] , is
very similar to the formulation for the RSA problem, described in Section 3.2. We denote with G
the set of all pairs of lines that generate a sharing. For each pair (l; t) 2 G , where l = [i; i0 ];
and t = [j; j 0 ], the edge (i; j ) 2 E1 and the edge
(i0 ; j 0 ) 2 E2 . For a line l 2 L, let Gl = f(p; l) 2 Gg
and Gl+ = f(l; p) 2 Gg. The model for CMO is
max
X
(l;t) 2 G
ylt
(26)
subject to the clique inequalities (9) and the constraints
X
(p;l)2Gl
X
(l;p)2Gl+
xl ; ypq
ypl
6 xl 8l 2 L
(27)
ylp
6 xl 8l 2 L
(28)
2 f0; 1g 8l 2 L; 8(p; q) 2 G :
(29)
A Branch-and-Price based on the above constraints, as well as other, weaker, cuts, was used
in [13] to optimally align, for the rst time, real
proteins from the PDB. The algorithm was run on
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
597 protein pairs, with sizes ranging from 64 to
72 residues. Within the time limit of 1 hour per
instance, 55 problems were solved optimally and
for 421 problems (70 percent) the gap between the
best solution found and the upper bound was 6 5,
thereby providing a certicate of near-optimality.
In a second experiment, the CMO measure was
validated from a biological point of view. A set
of 33 proteins, known to belong to four dierent
families, were pairwise compared. The proteins
were then re-clustered according to their computed
CMO value. In the new clustering, no two proteins
were put in the same cluster which were not in the
same family before (a 0% false positives rate) and
only few pairs of proteins that belonged to the same
family were put in dierent clusters (a 1.3% false
negatives rate).
4
Genome Rearrangements
Thanks to the large amount of genomic data
that has become available in the past years, it is
now possible to compare the genomes of dierent
species, in order to nd their dierences and similarities. An important application of this problem
is, e.g., the comparison of mouse and man genomes,
since new drugs are typically tested on mice before
humans.
In principle, a genome can be thought of as a
(very long) sequence and hence one may want to
compare two genomes via a sequence alignment algorithm. However, sequence alignment is not the
right model for large-scale genome comparisons,
and, at the genome level, dierences should be measured in terms of rearrangements of long DNA regions which occurred during evolution.
Among the main rearrangements known are inversions, transpositions, and translocations. Each
of these events aects a long fragment of DNA on a
chromosome. When an inversion or a transposition
occurs, a DNA fragment is detached from its original position and then is reinserted, on the same
chromosome. In an inversion, it is reinserted at
the same place, but with opposite orientation than
it originally had. In a transposition, it keeps the
original orientation but ends up in a dierent position. A translocation causes a pair of fragments
to be exchanged between the ends of two chromosomes. Fig.4 illustrates these events, where each
string represents a chromosome.
ATTGTTataggttagAATTG
ATTgtttataGGCTAGATCCGCCAGA
CTGGATgcaggcat
ATTGTTgattggataAATTG
ATTGGCTAGATCCGCgtttataCAGA
CTGGATaata
(Inversion)
(Transposition)
#
11
#
#
TCATTGAaata
TCATTGAgcaggcat
(Translocation)
Fig.4. Evolutionary events.
Since evolutionary events aect long DNA regions (several thousand bases) the basic unit for
comparison is not the nucleotide, but rather the
gene.
The general Genome Comparison Problem can
be informally stated as: Given two genomes (represented by two sets of ordered lists of genes, one
list per chromosome) nd a sequence of evolutionary events that, applied to the rst genome, turn it
into the second. Under a common parsimony principle, the solution sought is the one requiring the
minimum possible number of events.
In the past decade, people have concentrated
on evolution by means of some specic event alone,
and have shown that even these special cases can
be already very hard to solve. Since inversions are
considered the predominant of all types of rearrangements, they have received the most attention.
For historical reasons, they have become known as
reversals in the computer science community.
In the remainder of this section, we will outline a successful Integer Programming approach for
the solution of computing the evolutionary distance
between two genomes evolved by reversals only, a
problem known as Sorting by Reversals. The approach is due to Caprara et al[14;15] .
Two genomes are compared by looking at their
common genes. After numbering each of n common genes with a unique label in f1; : : : ; ng, each
genome is a permutation of the elements f1; : : : ; ng.
Let = (1 : : : n ) and = (1 : : : n ) be
two genomes. By possibly relabeling the genes, we
can always assume that = := (1 2 : : : n), the
identity permutation. Hence, the problem becomes
turning into , i.e., sorting .
A reversal is a permutation ij , with 1
6i<
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
12
j
6 n, dened as
1 ::: i+ 1 i
j + 1 : : : n):
reversed
By applying ij to , one obtains
(1 : : : i 1 ; j j 1 : : : i ; j +1 : : : n ), i.e., the
order of the elements i ; : : : ; j has been reversed.
Let G = fij ; 1 6 i < j 6 ng. Each permutation
can be expressed as a product 1 2 : : : D with
i 2 G for i = 1 : : : D. The minimum value D
such that 1 2 : : : D = is called the reversal
distance of , and denoted by d( ). Sorting by
Reversals (SBR) is the problem of nding d( ) and
a sequence 1 2 : : : d() satisfying the previous
ij = (1 : : : i
1
jj
equation.
A major step towards the practical solution of
the problem was made by Bafna and Pevzner[44] ,
who investigated the relations between SBR and
the breakpoints of . A breakpoint is given by a
pair of adjacent elements in that are not adjacent in , that is, there is a breakpoint at position
i, if ji i 1 j > 1. Let b( ) denote the number of
breakpoints. A trivial bound is d() > db()=2e,
since a reversal can remove at most two breakpoints, and has no breakpoints. However, Bafna
and Pevzner showed how to obtain a much better bound from the breakpoint graph G(). G()
has a node for each element of and edges of two
colors, say red and blue. There is a red edge between i and i 1 for each position i at which there
is a breakpoint, and a blue edge between each h
and k such that jh kj = 1 and h; k are not adjacent in . Each node has the same number of
red and blue edges incident on it (0, 1 or 2), and
G( ) can be decomposed into a set of edge-disjoint
color-alternating cycles. Note that this cycles are
not necessarily simple, i.e., they can repeat nodes.
Let c() be the maximum number of edge-disjoint
alternating cycles in G(). Bafna and Pevzner[44]
proved the following
Theorem
d( ) > b( )
[44] For every permutation ,
c( ).
7.
The lower bound b() c() turns out to be
very tight, as observed rst experimentally by various authors and then proved to be almost always
the case by Caprara[45] who showed in [46] that
determining c() is essentially the same problem
as determining d(), and that both problems are
NP-hard.
Since computing c() is hard, the lower bound
b( ) c( ) should not be used directly in a Branchand-Bound procedure for SBR. However, for any
upper bound c0 () to c(), also the value b()
c0 ( ) is a lower bound to d( ), which may be quite
easier to compute. Given an eective Integer Linear Programming (ILP) formulation to nd c(), a
good upper bound to c() can be obtained by LP
relaxation. Let C denote the set of all the alternating cycles of G() = (V; E ), and for each C 2 C
dene a binary variable xC . The following is the
ILP formulation of the maximum cycle decomposition:
X
c( ) := max
xC
(30)
C 2C
subject to
X
3
xC
C e
xC
6 1 8e 2 E
2 f0; 1g 8C 2 C :
(31)
(32)
Caprara et al showed in [14] how to solve the
exponentially large LP relaxation of (30){(32) in
polynomial time by column-generation techniques.
Consider the dual of the LP relaxation of (30){
(32), which reads
c0 ( ) := min
X
ye
(33)
> 1 8C 2 C
(34)
2
e E
subject to
X
2
ye
e C
ye
> 0; 8e 2 E:
(35)
The separation problem for (33){(35), which is
the same as the pricing problem for (30){(32), is
the following: given a weight ye for each e 2 E ,
nd an alternating cycle C 2 C such that
X
2
ye < 1;
(36)
e C
or prove that none exists. Call an alternating cycle
satisfying (36) a violated cycle. To solve (33){(35)
in polynomial time, one must be able to identify a
violated cycle in polynomial time, i.e., an alternating cycle of G() having weight < 1, where each
edge e 2 E is given the weight ye . This problem
can be solved via a reduction to the solution of some
non-bipartite min-cost perfect matching problems
in a suitable graph. The reduction is similar to
one used by Grotschel and Pulleyblank for nding
minimum-weight odd and even paths in an undirected graph[47] .
Although polynomial, the solution of the
weighted matching problem on general graphs is
computationally expensive to nd. In [15] Caprara
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
et al. describe a weaker bound, obtained by en-
larging the set
C in (30){(32) to include also the
pseudo-alternating cycles, i.e., cycles that alternate
red and blue edges but may possibly use an edge
twice. With this new set of variables, the pricing
becomes much faster, since only bipartite matching
problems must be solved.
Although weaker than the bound based on
alternating cycles, the bound based on pseudoalternating cycles turns out to be still very tight.
Based on this bound, Caprara et al. developed
a branch-and-price algorithm[15] which can routinely solve, in a matter of seconds, instances with
n = 200 elements, a large enough size for all reallife instances available so far. Note that the previous limit for an optimization algorithm for SBR
was n = 30. The algorithm in [15] was used to
nd, for the rst time, the optimal solution to a
man vs mouse instance of SBR, and to compare
the mitochondrial genomes of 20 species, nding
Chrom. c, paternal:
Chrom. c, maternal:
Haplotype 1
Haplotype 2
!
!
13
the optimal solution in all cases.
5
Single Nucleotide Polymorphisms
A DNA region whose content varies in a statistically signicant way within a population is called a
genetic polymorphism. The smallest possible region
consists of a single nucleotide, and hence is called
a Single Nucleotide Polymorphism, or SNP (pronounced \snip"). A SNP is a nucleotide site, in the
middle of a DNA region which is otherwise identical for everybody, at which we observe a variability
in a population. Furthermore, the variability is restricted to only two values (called alleles) out of
the four possible (A, G, C, G). It is believed that
SNPs are the predominant form of human genetic
variation, and their importance cannot be overestimated for medical, drug-design, diagnostic and
forensic applications.
ataggtccCtatttccaggcgcCgtatacttcgacgggActata
ataggtccGtatttccaggcgcCgtatacttcgacgggTctata
C
G
C
C
A
T
Fig.5. A chromosome and the two haplotypes.
Since DNA of diploid organisms is organized in
pairs of chromosomes, for each SNP one can either
be homozygous (same allele on both chromosomes)
or heterozygous (dierent alleles). The values of a
set of SNPs on a particular chromosome copy is
called a haplotype. Haplotyping consists in determining a pair of haplotypes, one for each copy of a
given chromosome. In Fig.5 we give a simplistic example of a chromosome with three SNP sites. This
individual is heterozygous at SNPs 1 and 3 and homozygous at SNP 2. The haplotypes are CCA and
GCT.
In recent years, several optimization problems
have been dened for SNP data. Some of these
were tackled by Integer Programming techniques,
which we recall in this section.
We consider the haplotyping problems relative
to a single individual and to a population. In
the rst case, the input is inconsistent haplotype
data, coming from sequencing, with unavoidable
sequencing errors. In the latter case, the input is
ambiguous genotype data, which species only the
multiplicity of each allele for each individual (i.e.,
it is known if individual i is homozygous or het-
erozygous at SNP j , for each i and j ).
5.1
Haplotyping a Single Individual
The process of reading the content of a DNA
molecule is called sequencing. For technical limitations, only short DNA fragments (a couple thousand bases) can be sequenced. Furthermore, even
with today's best possible instruments, sequencing errors are unavoidable. These consist in bases
which have been misread or skipped altogether.
Another source of problems can be the presence of
contaminants, i.e. DNA coming from another organism which was wrongly mixed with the DNA
to be sequenced. In this framework, the Single
Individual Haplotyping Problem can be informally
stated as: \given inconsistent haplotype data coming from fragment sequencing, remove the errors
so as to retrieve a consistent pair of haplotypes."
The mathematical framework for this problem is as
follows. Denote the two values that each SNP can
take by 0 and 1. A haplotype, i.e. a chromosome
content projected on a set of SNPs, is then a binary
vector. Let S = f1; : : : ; ng be a set of SNPs and
F = f1; : : : ; mg be a set of fragments. Each SNP
14
is covered by some of the fragments, on which it
can take the values 0 or 1. A fragment f can be
represented by a vector in f0; 1; gn . For a pair
(f; s) 2 F S , f [s] denotes the s-th component of
f . When f [s] = , it means that the fragment f
does not cover the SNP s.
A gapless fragment is one covering a set of consecutive SNPs (i.e., between any two components in
f0; 1g, there are only components in f0; 1g), otherwise, the fragment has gaps. There can be gaps for
two reasons: (i) thresholding of low-quality reads
(i.e., when the sequencer cannot call a SNP 0 or 1
with enough condence, the component is marked
with a "-"); (ii) mate-pairing in shotgun sequencing (a mate pair is a pair of fragments coming from
the same chromosome copy. This pair is equivalent
to a single fragment with one gap in the middle).
Fig.6. (a) A set of fragments F . (b) The fragment conict
graph GF .
Two fragments f and g are said to be in conict if there exists a SNP k such that f [k] = 0 and
g [k] = 1 or f [k] = 1 and g [k] = 0. The fragment
conict graph is the graph GF = (F ; EF ) which
has an edge for each pair of fragments in conict
(see Fig.6). The data are consistent with the existence of two haplotypes (one for each chromosome
copy) if and only if GF is a bipartite graph. In the
presence of contaminants, or of some \bad" fragments (fragments with many sequencing errors), to
correct the data one must face the problem of removing the fewest number of fragments so that GF
becomes bipartite. This problem is called the Minimum Fragment Removal (MFR) problem. When
the fragments can contain gaps, the problem is NPhard[16] . For this problem, the following IP formulation can be adopted so as to minimize the number
of fragments (nodes) removed from GF to make GF
bipartite[17] . Introduce a 0-1 variable xf for each
fragment. The variables for which xf = 1 in an optimal solution dene the nodes to remove. Let C be
the set of all odd cycles in GF (M ). Since a graph
is bipartite if and only if there are no odd cycles,
one must remove at least one node from each cycle in C . Therefore, we obtain the following Integer
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
Programming (IP) formulation:
min
X
xf
(37)
> 1 8C 2 C
(38)
2 f0; 1g 8f 2 F :
(39)
f
s.t.
X
2
f C
xf
xf
2F
The LP relaxation of (37){(39) can be solved
in polynomial time, by solving the following separation problem: Given fractional weights x for
the
nd an odd cycle C for which
P fragments,
f 2C xf < 1. This can be obtained as follows.
First, build a new graph Q = (W; F ) made, loosely
speaking, of two copies of GF . For each node i 2 V
there are two nodes i0 ; i00 2 W , and for each edge
ij 2 E , there are two edges i0 j 00 ; i00 j 0 2 F . Edges in
F are weighted, with weights w(i0 j 00 ) = w(i00 j 0 ) :=
xi =2 + xj =2. By a parity argument, a path P in Q
of weight w(P ), from i0 to i00 , corresponds
to an odd
P
cycle C through i in GF , with j 2C xj = w(P ).
Hence, the shortest path corresponds to the \smallest" odd cycle, and if the shortest path has weight
> 1 then there are no violated inequalities (38).
Otherwise, the shortest path individuates a violated inequality.
To nd good solutions of MFR in a short time,
Lippert et al.[17] have also experimented with a
heuristic procedure, based on the above LP relaxation and randomized rounding. First, the LP relaxation of (37){(39) is solved, obtaining an optimal solution x . If x is fractional, each variable
xf is rounded-up to 1 with probability xf . If the
solution is feasible, stop. Otherwise, x to 1 all
the variables that were rounded-up so far (i.e. add
the constraints xf = 1) and iterate a new LP. This
method converged very quickly on all tests on real
data, with a small gap in the nal solution (an upper bound to the optimum) and the rst LP value
(a lower bound to the optimum). On real data
coming from shotgun sequencing, out of about 800
fragments, only less than 5 fragments had to be
thrown out to retrieve the correct haplotypes.
5.2
Haplotyping a Population
Because of economical reasons, when polymorphism screens are conducted on large populations
it is not feasible to examine the two copies of
each chromosome separately, and genotype, rather
than haplotype, data is usually obtained. A genotype describes the multiplicity of each SNP allele.
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
For any individual, at each SNP, three possibilities
arise: homozygous for the allele 0, homozygous for
the allele 1, or heterozygous (a situation denoted
by the symbol 2). Hence a genotype can be represented by a vector in f0; 1; 2gn , where each position with a 2 is called an ambiguous position. A
genotype g 2 f0; 1; 2gn is resolved by a pair of haplotypes h; q 2 f0; 1gn , written g = h q, if for
each SNP j , g[j ] 2 f0; 1g implies h[j ] = q[j ] = g[j ],
and g[j ] = 2 implies h[j ] 6= q[j ]. A genotype is
called ambiguous if it has at least two ambiguous
positions (a genotype with at most one ambiguous
positions can be resolved uniquely).
The Population Haplotyping Problem is the following: Given a set G of m genotypes over n SNPs,
nd a set H of haplotypes such that each genotype
is resolved by one pair of haplotypes in H. To
turn this problem into an optimization problem,
one has to specify the objective function. One objective which has been studied by Guseld[18;19] , is
based on a greedy algorithm for haplotype inference, known as Clark's rule[48] . The goal is to derive the haplotypes in H by successive applications
of the following inference rule.
Inference Rule. Given a genotype g and a
compatible haplotype h (i.e., a haplotype h which
agrees with g at all non-ambiguous positions), obtain a new haplotype q, such that g = h q, by
setting q[j ] = 1 h[j ] at all ambiguous positions
and q[j ] = h[j ] at the remaining positions.
In order to resolve the genotypes in G , Clark in
[48] proposed the following, nondeterministic, procedure: Let G 0 be the set of non-ambiguous genotypes. Start with setting G := G G 0 and H to be
the set of haplotypes obtained from G 0 . Then, repeat the following: take a g 2 G and a compatible
h 2 H and apply the inference rule, obtaining q .
Set G := G fgg, H := H [ fqg and iterate. When
no such g and h exist, the algorithm has succeeded
if G = ; and failed otherwise.
The non-determinism in the choice of the pair
g , h to which the inference rule is applied can lead
to unsuccessful runs even for instances in which
a there exists an order of application of the rule
which would resolve all genotypes in G . To nd
the best possible order of application of the inference rule, Guseld considered the following optimization problem: nd the ordering of application
of the inference rule that leaves the fewest number
of unresolved genotypes in the end.
To solve the problem, Guseld[18;19] proposed
an Integer Programming approach. The problem is
rst transformed into a problem on a digraph G =
15
S
(N; A), dened as follows. Let N = g2G N (g),
where N (g) := f(h; g)jh is compatible with gg.
N (g ) corresponds to the set of possible haplotypes obtainable by setting each ambiguous position of a genotype
S g to one of the 2 possible values. Let N 0 = g2G 0 N (g) correspond to the subset of haplotypes unambiguously determined from
the set G 0 of unambiguous genotypes. For each
pair v = (h; g0 ), w = (q; g) in N , there is an arc
(v; w) 2 A if g is ambiguous, g0 6= g and g = h q.
Any directed tree rooted at a node v 2 N 0 species a feasible history of successive applications of
the inference rule starting at node v 2 N 0 , and the
problem can be stated as: Find the largest number of nodes in N N 0 that can be reached by
a set of node{disjoint directed trees, where each
tree is rooted at a node in N 0 and where for every
ambiguous genotype g, at most one node in N (g)
is reached. For solving this problem, Guseld[18]
proposed the following formulation, based on 0{1
variables xv associated to the nodes in N :
max
X
v N N0
2
(40)
xv
subject to
X
2
v N (g )
xw
6
xv
6 1 8g 2 G G 0
X
2 (w )
xv
8w 2 N
(41)
N0
(42)
v Æ
with xv 2 f0; 1g for v 2 N .
Guseld observed that the LP solution, on real
and simulated data, was almost always integer,
thus requiring no branching in the branch and
bound search. Also, although there is a possibility
of an integer solution containing directed cycles,
this situation never occurred in the experiments,
and hence there was no not need to add subtour
elimination-type inequalities to the model. The
above model was used to resolve real and randomly
generated input instances of up to 80 genotypes, in
a matter of seconds.
One of the reasons for the popularity of Clark's
rule, is a common belief among biologists that
this rule often produces the smallest set of haplotypes for a given set of genotypes. However, when
the objective is to minimize the number of dierent haplotypes, it is easy to show examples of instances on which Clark's rule behaves extremely
poorly. This new objective for population haplotyping is called parsimony, and requires to determine a set H, of haplotypes resolving the set G of
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
16
genotypes, and such that jHj is minimum. For this
problem, Guseld[20] proposed the following Integer Programming model.
Denote by H^ the set of those haplotypes which
are compatible with some genotype in G . Associate
a binary 0{1 variable xh to every h 2 H^ , such that
xh = 1 if h is taken into the solution, and xh = 0
if h is not taken into the solution. Fix any total
order on H^ and denote by P the set of those pairs
(h1 ; h2 ) with h1 ; h2 2 H^ , h1 < h2 . Introduce a
binary 0{1 variable yh1 ;h2 for every (h1 ; h2 ) 2 P .
Moreover, for every g 2 G , let Pg := f(h1 ; h2 ) 2 P
jh1 + h2 = gg.
We obtain the following IP formulation for the
parsimony population haplotyping problem:
min
X
2H^
(43)
xh
h
X
(h1 ;h2 )2Pg
yh1 ;h2
> 1 8g 2 G
(44)
> yh1;h2 8(h1; h2 ) 2 P
(45)
> yh1;h2 8(h1; h2 ) 2 P
(46)
^
xh 2 f0; 1g; yh1 ;h2 2 f0; 1g 8h 2 H; 8(h1 ; h2 ) 2 P :
xh1
xh2
(47)
The computational experiments showd that the
IP approach can be practical for instances of this
problem of reasonable size. Guseld used the commercial package CPLEX as the IP solver, and reported solution for problems on up to 50 individuals and 30 SNPs, obtained, on average, in less
than 5 minutes. The instances were generated by a
standard program for haplotype simulation, which
includes a parameter to set the level of recombination. Guseld showed that, at low levels of recombination, the parsimony objective function can
retrieve the correct haplotypes with 100% accuracy,
and even at high levels of recombination, the objective allows to retrieve about 80% of the correct
haplotypes.
As showed by Lancia et al.[49] , the IP model
(43){(47) can be used to obtain an approximation
algorithm for the problem, when the number of ambiguous positions in each genotype is bounded.
For any given constant k, assume that every
genotype g 2 G contains at most k ambiguous positions. Then, the number of variables and equations
in the above IP is polynomial and the LP relaxation
can be solved to optimality in polynomial time.
Given the optimal LP solution (x ; y ), we can obtain a feasible integer solution (x; y) by deterministically rounding up to 1 each variable whose value
is at least 1=2k 1 , and rounding it down to 0 otherwise. It is easy to check that the rounding does
not destroy the feasibility of the solution at hand.
In particular, the new variables still satisfy constraints (44), since each genotype has at most 2k 1
pairs of haplotypes that can resolve it, and hence
for one of such pairs, y must be at least 1=2k 1 .
Since after the rounding-up the value of the objective function scales up by precisely a factor of
2k 1 , the approach is a 2k 1 -approximation algorithm.
6
Conclusions
In this survey we have reviewed some of the Integer Programming approaches to Computational
Biology problems of the last few years. For space
reasons, not all of this type of Integer Programming
approaches have been described. For instance, a
very recent Branch-and-Price approach was proposed for the solution of a protein fold prediction
problem. The approach, by Xu et al.[21] , which
tries to determine the best threading of a protein
sequence to a known 3D protein structure, bears
many similarities to the models described in Section 3.2 and Section 3.3 for structure and sequence
alignments.
Another problem for which Integer Programming approaches were proposed in the literature
concerns the mapping of genomic markers, such as
probes or radiation hybrids on a set of clones (typically, Yeast Articial Chromosomes).
For one of such problems, the Physical Mapping
with end-probes, a Branch-and-Price approach was
proposed by Christof et al.[22] . The problem requires to determine the order of occurrence of a set
of probes on a set of clones, when some probes are
known to be end-probes, i.e., coming from the ends
of some clone.
Finally, the mapping problems allow us to mention the indirect use of Integer Programming approaches for Computational Biology problems. In
fact, mapping problems can often be reduced to a
very well known combinatorial optimization problem, the TSP, for which the best solution methods
are based on Integer Programming and Branchand-Price. For the Radiation Hybrid Mapping
Problem, Agarwala et al.[23] used a state-of-the-art
TSP software package, CONCORDE[50] . In a validating experiment, some publicly available maps
were gathered from a radiation hybrid databank,
and new maps were computed by the program. The
quality of a map was accessed by comparing how
Giuseppe Lancia: Integer Programming Models for Computational Biology Problems
many times two markers that appear in some order on a genomic sequence deposited on GenBank,
are put in the opposite order by the map (i.e., the
map is inconsistent with the sequence). It was
shown that the public maps were inconsistent for at
least 50% of the markers, while the maps computed
by the program were almost perfectly consistent,
thereby validating the TSP-based approach.
References
[1] Nemhauser G L, Wolsey L. Integer and Combinatorial
Optimization. New York: fJohn Wiley and Sons, 1988.
[2] Cook W J, Cunningham W H, Pulleyblank W R, Schrijver A. Combinatorial Optimization. New York: John
Wiley and Sons, 1998.
[3] Schrijver A. Theory of Linear and Integer Programming.
New York: John Wiley and Sons, 1986.
[4] Reinert K, Lenhof H-P, Mutzel P, Mehlhorn K, Kececioglu J. A branch-and-cut algorithm for multiple sequence alignment. In Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB), 1997, pp.241{249.
[5] Kececioglu J, Lenhof H-P, Mehlhorn K, Mutzel P, Reinert K, Vingron M. A polyhedral approach to sequence
alignment problems. Discrete Applied Mathematics,
2000, 104: 143{186.
[6] Fischetti M, Lancia G, Serani P. Exact algorithms for
minimum routing cost trees. Networks, 2002, 39(3):
161{173.
[7] Guseld D. EÆcient methods for multiple sequence
alignment with guaranteed error bounds. Bulletin of
Mathematical Biology, 1993, 55: 141{154.
[8] Ben-Dor A, Lancia G, Perone J, Ravi R. Banishing
bias from consensus sequences. In Proceedings of the
Annual Symposium on Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science, 1997,
1264: 247{261.
[9] Li M, Ma B, Wang L. On the closest string and substring
problems. Journal of the ACM, 2002, 49(2): 157{171.
[10] Ma B. A polynomial time approximation scheme for
the closest substring problem. In Proceedings of the
Annual Symposium on Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science, 2000,
1848: 99{107.
[11] Lanctot J, Li M, Ma B et al. Distinguishing string selection problems. In Proceedings of the Annual ACMSIAM Symposium on Discrete Algorithms (SODA),
1999, 633{642.
[12] Lenhof H-P, Reinert K, Vingron M. A polyhedral approach to RNA sequence structure alignment. Journal
of Computational Biology, 1998, 5(3): 517{530.
[13] Lancia G, Carr R, Walenz B et al. 101 optimal PDB
structure alignments: A branch-and-cut algorithm for
the maximum contact map overlap problem. In Proceedings of the Annual International Conference on Computational Biology (RECOMB), 2001, pp.193{202.
[14] Caprara A, Lancia G, Ng S K. A Column-Generation
Based Branch-and-Bound Algorithm for Sorting by Reversals. In Mathematical Support for Molecular Biology,
Farach-Colton M, Roberts F S, VingronM, Waterman
M, (eds.), DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, 1999, 47: 213{226.
17
[15] Caprara A, Lancia G, Ng S K. Sorting permutations by
reversals through branch and price. INFORMS Journal
on Computing, 2001, 13(3): 224{244.
[16] Lancia G, Bafna V, Istrail S et al. SNPs problems,
complexity and algorithms. In Proceedings of the Annual European Symposium on Algorithms (ESA), Lecture Notes in Computer Science, 2001, 2161, pp.182{
193.
[17] Lippert R, Schwartz R, Lancia G et al. Algorithmic
strategies for the SNPs haplotype assembly problem.
Briengs in Bioinformatics, 2002, 3(1): 23{31.
[18] Guseld D. A practical algorithm for optimal inference
of haplotypes from diploid populations. In Proceedings
of the Annual International Conference on Intelligent
Systems for Molecular Biology (ISMB), 2000, pp.183{
189.
[19] Guseld D. Inference of haplotypes from samples of
diploid populations: Complexity and Algorithms. Journal of Computational Biology, 2001, 8(3): 305{324.
[20] Guseld D. Haplotype inference by pure parsimony.
Technical Report CSE-2003-2, Department of Computer
Science, University of California at Davis, 2003.
[21] Xu J, Li M, Kim D et al. RAPTOR: Optimal protein
threading by linear programming. Journal of Bioinformatics and Computational Biology, 2003, 1(1): 95{117.
[22] Christof T, Junger M, Kececioglu J et al. A branch-andcut approach to physical mapping with end-probes. In
Proceedings of the Annual International Conference on
Computational Molecular Biology (RECOMB), 1997.
[23] Agarwala R, Applegate D, Maglott D et al. A fast and
scalable radiation hybrid map construction and integration strategy. Genome Research, 2000, 10: 230{364.
[24] Kachian L G. A polynomial algorithm for linear programming. Soviet Mathematics Doklady, 1979, 20: 191{
194.
[25] Karmarkar N. A new polynomial time algorithm for linear programming. Combinatorica, 1984, 4: 375{395.
[26] Grotschel M, Lovasz L, Schrijver A. The ellipsoid
method and its consequences in combinatorial optimization. Combinatorica, 1981, 1: 169{197.
[27] Barnhart C, Johnson E L, Nemhauser G L et al.
Branch-and-price: Column generation for solving huge
integer programs. Operations Research, 1998, 46(3):
316{329.
[28] Casey D. The Human Genome Project Primer on
Molecular Genetics, U.S. Department of Energy,
http://www.ornl.gov/hgmis/publicat/primer/intro.
html, 1992.
[29] Fitch W M. An Introduction to Molecular Biology for
Mathematicians and Computer Programmers. In Mathematical Support for Molecular Biology, Farach-Colton
M, Roberts F S, Vingron M, Waterman M (eds.), DIMACS Series in Discrete Mathematics and Theoretical
Computer Science, 1999, 47: 1{31.
[30] Watson J D, Gilman M, Witkowski J et al. Recombinant DNA. Scientic American Books, W.H. Freeman
and Co, 1992.
[31] Caprara A, Lancia G. Structural alignment of large-size
proteins via lagrangian relaxation. In Proceedings of
the Annual International Conference on Computational
Molecular Biology (RECOMB), 2002, pp.100{108.
[32] Needleman S B, Wunsch C D. An eÆcient method applicable to the search for similarities in the amino acid
sequences of two proteins. Journal of Molecular Biology, 1970, 48: 443{453.
[33] Wang L, Jiang T. On the complexity of multiple sequence alignment. Journal of Computational Biology,
18
1994, 1: 337{348.
[34] Sanko D, Kruskal J. Time Warps, String Edits, and
Macromolecules: The Theory and Practice of String
Comparison. CSLI Publications, 1999.
[35] Kececioglu J. The maximum weight trace problem in
multiple sequence alignment. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching
(CPM), Lecture Notes in Computer Science 684, 1993,
pp.106{119.
[36] Wu B Y, Lancia G, Bafna V et al. A polynomial{time
approximation scheme for minimum routing cost spanning trees. SIAM Journal on Computing, 1999, 29(3):
761{778.
[37] Frances M, Litman A. On covering problems of codes.
Theory of Computing Systems, 1997, 30(2): 113{119.
[38] Raghavan P, Thompson C D. Randomized rounding: A
technique for provably good algorithms and algorithmic
proofs. Combinatorica, 1987, 7: 365{374.
[39] Raghavan P. A probabilistic construction of deterministic algorithms: Approximating packing integer programs. Journal of Computer and System Sciences,
1988, 37: 130{143.
[40] Lemmen C, Lengauer T. Computational methods for
the structural alignment of molecules. Journal of
Computer{Aided Molecular Design, 2000, 14: 215{232.
[41] Berman H M, Westbrook J, Feng Z et al. The protein
data bank. Nucleic Acids Research, 2000, 28: 235{242.
[42] Goldman D, Istrail S, Papadimitriou C. Algorithmic aspects of protein structure similarity. In Proceedings of
the Annual IEEE Symposium on Foundations of Computer Science (FOCS), 1999, pp.512{522.
[43] Godzik A, Skolnick J, Kolinski A. A topology ngerprint
approach to inverse protein folding problem. Journal of
Molecular Biology, 1992, 227: 227{238.
[44] Bafna V, Pevzner P A. Genome rearrangements and
sorting by reversals. SIAM Journal on Computing,
1996, 25: 272{289.
[45] Caprara A. On the tightness of the alternating-cycle
lower bound for sorting by reversals. Journal of Com-
J. Comput. Sci. & Technol., Jan. 2004, Vol.19, No.1
binatorial Optimization, 1999, 3: 149{182.
[46] Caprara A. Sorting permutations by reversals and eulerian cycle decompositions. SIAM Journal on Discrete
Mathematics, 1999, 12: 91{110.
[47] Grotschel M, Pulleyblank W R. Weakly bipartite graphs
and the max{cut problem. Operations Research Letters,
1981, 1: 23{27.
[48] Clark A. Inference of haplotypes from PCR{amplied
samples of diploid populations. Molecular Biology Evolution, 1990, 7: 111{122.
[49] Lancia G, Pinotti C, Rizzi R. Haplotyping populations:
Complexity and approximations. Technical Report,
University of Trento, 2002.
[50] Applegate D, Bixby R, Chvatal V, et al. CONCORDE:
A code for solving traveling salesman problems.
http://www.math.princeton.edu/tsp/concorde.html
Giuseppe Lancia is an associate professor in operations research at the University of Udine, Italy. He
holds the M.S. and the Ph.D. degrees in algorithms,
combinatorics and optimization from Carnegie Mellon
University, USA. His research interests include combinatorial optimization, mathematical programming, and
discrete mathematics, with a special emphasis on their
applications to computational molecular biology problems. In this eld, Giuseppe Lancia has authored some
30 papers. He has been a member of the program
committee of prestigious conferences such as RECOMB
and WABI and a co-editor of a special issue of the INFORMS Journal on Computing. Furthermore, he has
spent extended periods as a visiting scientist at Sandia National Labs, Albuquerque NM, and Celera Genomics, Rockville MD, working on computational biology problems related to SNPs genotyping, protein
structure analyisis, sequence alignment and genome rearrangements.