Math 239: Discrete Mathematics for the Life Sciences
Spring 2008
Lecture 9 — February 21
Lecturer: Lior Pachter
9.1
Scribe/ Editor: Sudeep Juvekar/ Allen Chen
What is an Alignment?
In this lecture, we will define different types of alignments and explain some of their properties. We begin with a suitable alphabet Σ. Throughout the discussion, we will choose
Σ = {A, C, G, T }, the four nucleotides, although note that the definitions hold for all finite
alphabets. Examples of other alphabets include Σ0 = {A, C, G, T, N }, where N denotes an
unknown base. The alphabet Σ0 is widely used in computational biology literature.
Notation. Consider k sequences over Σ, σ 1 , σ 2 , · · · σ k . The length of the ith sequence is
denoted as |σ i | = ni . The sequence lengths may be different. We use σji to denote j th
character of ith sequence. We also define the positions of the sequences, independent of the
characters occupying them. Thus,
Sσi = {(i, j)|j ∈ {1, 2, · · · , ni }}
denotes the set of positions in the sequence σ i . Therefore,
Sσ1 ···σk =
k
[
Sσi
i=1
denotes the union of positions of all sequences σ 1 , σ 2 , · · · , σ k .
Sequence alignment is complicated by the fact that DNA has a double helical structure,
in which the two strands are reverse complements of each other as shown in the Figure 9.1.
Further, the two strands of the helix have opposite directionality, denoted by (50 → 30 ). This
directionality and reverse complementarity must be incorporated in many computational
biology applications, including sequence alignment.
Figure 9.1. Reverse complementarity of DNA strands.
9-1
Math 239
Lecture 9 — February 21
Spring 2008
Throughout this discussion, we use the following convention to incorporate the reverse
complementarity of DNA: given any k DNA sequences, we add k more sequences to produce
2k sequences, σ 1 , σ 2 , · · · , σ 2k , where the odd sequences {σ 2i−1 |i ∈ {1, 2, · · · , n}} denote the
original set of sequences, and the even sequences σ 2i are the reverse complements of σ 2i−1 .
Definition 9.1. Given a set of sequences σ 1 , σ 2 , · · · , σ k , a homology forest is a laminar
family for Sσ1 ···σk .
9.2
Local Alignment
We now define an alignment of sequences.
Definition 9.2. An alignment of sequences σ 1 , σ 2 , · · · , σ k is an equivalence relation ∼ ⊆
Sσ1 ···σk × Sσ1 ···σk .
We say a character σji is aligned to the character σlk if σji ∼ σlk . There is a natural alignment associated with a homology forest, namely the equivalent relation given by the components. Thus, two characters are ∼-related if they belong to same component (tree) of the
homology forest. More formally, (i, j) is ∼-related to (k, l) iff there exist edges e1 , e2 , · · · , ek
and vertices Z1 , Z2 , · · · , Zk+1 such that, (i, j) = Z1 , (k, l) = Zk+1 and {Zr , Zr+1 } ⊆ er for
every r ∈ {1, 2, · · · , k}. It is easy to see that ∼ defines an equivalence relation.
Figure 9.2. Alignment of four sequences.
Figure 9.2 shows the equivalent classes formed by homology forests on a set of four
sequences. Equivalent positions of the sequences are connected by a line. Note that by
this definition, two positions in the same sequences can belong to same equivalent class and
hence can be “aligned”. We now define different classes of alignment that strengthen the
definition 9.2.
9-2
Math 239
Lecture 9 — February 21
Spring 2008
Definition 9.3. A monotopoorthologous (MTO) homology forest is a laminar family
where no edge contains (i, a) and (i, b) for any i, a, b.
It is easy to see that the homology forest in Figure 9.2 is not an MTO forest because a
line connects different positions of sequence σ 1 . Eliminating all such lines from the figure,
however, will result in an MTO homology forest.
We now define a local alignment based on definition of MTO forest above.
Definition 9.4. A local alignment of σ 1 , σ 2 , · · · , σ k is the equivalence relation induced by
the set of components of an MTO homology forest H with a partial order ≤H on components
C(H) of H, such that if (i, j1 ) ∈ Cr and (i, j2 ) ∈ Cs for some Cr ≤H Cs , then j1 ≤ j2 .
The name “local alignment” is used because the order on the components is defined using
“local information” of positions in individual sequences. The local alignment is represented
as a tuple (H, ≤H ). In further discussion, we will refer to the poset (C(H), ≤H ), as an
alignment poset.
For two sequences, the local alignment can also be represented as a dotplot shown in
Figure 9.3. In the figure, x-axis represents the positions of sequence σ 1 and y-axis represents
the positions of sequence σ 2 . A dot at position (i, j) means that (1, i) and (2, j) are aligned.
Figure 9.3. A dotplot for two sequences.
9.3
Partial Global Alignment
We now define a partial global alignment of a set of sequences.
9-3
Math 239
Lecture 9 — February 21
Spring 2008
Definition 9.5. A partial global alignment of a set of sequences σ 1 , σ 2 , · · · σ k is the
equivalence relation induced by the components of an MTO homology forest H for which
there exists a partial order ≤H on its components C(H), such that if (i, j1 ) ∈ Cr and
(i, j2 ) ∈ Cs and j1 ≤ j2 , then Cr ≤H Cs .
Note that there is a subtle difference between the definitions of local alignment and partial
global alignment. Consider two components C = {(1, 2), (2, 2)} and D = {(1, 3), (3, 3)}. By
the definition of partial global alignment, C ≤H D. However, local alignment asserts only
that D 6≤H C.
Example. An important example of partial global alignment is a null alignment shown
in Figure 9.4. The homology forest in a null alignment is a set of disjoint vertices. The
only order relations are those imposed by the definition of global alignment, namely the
singleton component {(i, j1 )} is ≤H the singleton component {(i, j2 )} for all i and j1 ≤ j2 .
For simplicity, reverse complements are omitted from Figure 9.4.
Figure 9.4. Null alignment of three sequences.
Proposition
9.6. For two sequences of length n and m, the number of partial global alignm+n
ments is
.
n
Proof: Consider any partial global alignment of the two sequences, where exactly i components are of size 2 and all remaining components are singleton. This
can
be done as follows:
n
select any i positions in the first sequence, which can by done in
ways. Similarly, the
i
9-4
Math 239
Lecture 9 — February 21
Spring 2008
m
i positions of second sequence can be chosen in
ways. Because of the condition on
i
MTO homology forest, these i positions of the two sequences can be alignedwitheach
other
n
m
in a unique way, namely, by aligning the pairs in order. Thus, there are
·
i
i
ways of aligning exactly i characters. Thus, the total number of ways of alignments of the
two sequences is
min(m,n) X
n
m
·
,
(9.1)
i
i
i=0
m+n
which is equal to
.
n
There is an alternative way to compute the sum in ( 9.1). Consider an m × n rectangular
grid as shown in Figure 9.5. Any path in the grid from the lower left corner to the upper
right corner involves m + n moves, out of which
n are horizontal and remaining m are
m+n
vertical. Thus, the total number of moves is
. Now, consider a move like the
n
one shown in the figure. Every such move has “corners”, where the path changes direction
from horizontal to vertical or vice-versa. The total number of paths containing exactly i
horizontal-to-vertical
corners can be found by identifying i x-coordinates and i y-coordinates
n
m
in
·
ways. The total number of paths is thus equal to the summation given
i
i
m+n
in ( 9.1), which is equal to
.
n
Figure 9.5. Partial global alignments of two sequences.
9-5
Math 239
9.4
Lecture 9 — February 21
Spring 2008
Global Alignment
We now turn our attention to global alignments.
Definition 9.7. A linear extension of a partially ordered set (poset) (C1 , C2 , · · · , Cm , )
is a bijection π : {1, 2, · · · , m} → {1, 2, · · · , m} such that if Ci Cj , then π(i) ≤ π(j).
Figure 9.4 shows one linear extension of the null alignment where an integer next to each
position denotes the value of π at that position.
Definition 9.8. A global alignment is a partial global alignment together with a linear
extension on the alignment poset.
A global alignment is often represented using a k × n matrix T , where k = number of
sequences and n = number of components of the alignment. The matrix T is defined as:
i
σk if ∃k σki ∈ Cπ(i) ;
Ti,j =
− otherwise.
Example. The global alignment shown
1
σ1 − − −
− σ13 − −
− − σ15 σ25
in Figure 9.4 has following matrix representation:
σ21 − − − − σ31
− − − σ23 σ33 − .
− σ35 σ45 −
−
There is a graph representation of a global alignment, which is similar to the one shown
in Figure 9.5. The number of global alignments between two sequences of length m and
n is given by the number of paths between the lower left and the upper right corners of
Figure 9.6.
Figure 9.6. Global alignments of two sequences.
9-6
Math 239
Lecture 9 — February 21
Spring 2008
Definition 9.9. The alignment graph Gn,m is the directed graph on the set of nodes
{0, 1, · · · , n} × {0, 1, · · · , m} and three classes of directed edges as follows:
• there are edges labeled by I (insertion, move up ↑) between pairs of nodes (i, j) →
(i, j + 1);
• there are edges labeled by D (deletion, move right →) between pairs of nodes (i, j) →
(i + 1, j), and;
• there are edges labeled by H (homology, move diagonal %) between pairs of nodes
(i, j) → (i + 1, j + 1).
Example. The following table shows an alignment of sequences σ 1 and σ 2 and the edges
of its directed graph.
G G A T T − − A C A
− − G C − T T A G −
→ → % % → ↑ ↑ % % →
Example. We give a typical example of global alignment of multiple sequences:
G
−
−
−
9.5
−
T
−
−
A
C
G
−
−
C
G
−
−
−
G
−
T
A
A
A
C
G
C
C
−
−
−
C
A
G
T
T
Homework
Partial global alignments of three sequences Let N (n1 , n2 , n3 ) denote the number
of partial global alignments of three sequences of lengths n1 , n2 , n3 respectively. Prove that
N (n1 , n2 , n3 ) = N (n1 −1, n2 , n3 )+N (n1 , n2 −1, n3 )+N (n1 , n2 , n3 −1)−N (n1 −1, n2 −1, n3 −1).
Number of global alignments Prove that the number of global alignments between two
sequences of lengths n and m respectively, is given by
min(m,n) X
i=1
9.6
n
i
m
·
2i .
i
References
Section 2.2, Sequence Alignment. Pachter and Sturmfels (2005): 49-53.
9-7
© Copyright 2026 Paperzz