Progressive alignment

Bioinformatics
Multiple Alignment
Overview
• Introduction Multiple Alignments
• Global multiple alignment
– Introduction
– Scoring
– Algorithms
Algorithms
Multiple Alignment
Dynamic
Programming
Pattern
recognition
Heuristic
Searches
HMM
Motif
Searches
Database searches
Chapter 2
Introduction
• Global multiple alignment (ClustalW)
–
–
–
–
Proteins, nucleotides
Long stretches of conservation essential
Identification of protein family profiles
Score gaps
• Local multiple alignments (Motif Detection,
Profile construction)
–
–
–
–
–
Proteins, nucleotides
Short stretches of conservation (12 NT, 6 AA)
Identification of regulatory motifs (DNA, protein)
No explicit gap scoring
Explicit use of a profile
Introduction
Evolution
Primary sequence
• duplication
• speciation
Homologs in related organisms
Families of proteins
Multiple sequence alignment
Features characteristic for the whole family
Introduction
Multiple sequence alignment
Features characteristic for
the protein family
Profile (HMM)
Phylogeny
Detect remote members of the
family
Reconstruct phylogenetic
relationships
Scoring a multiple alignment
Assumption:
– Independency between columns
– Residues within column independent (I.e. representative members of a
sequence family should be chosen, all evolutionary subfamilies should be
represented)
– Sequence score: score for all the columns and gaps
S (m)  G  i S (m )
i
Scoring
•
Sums of pair score is an approximation
S (m )   s(mk , ml )(1)
i k l i i
S(a,b) from scoring matrix PAM or BLOSUM
• But for tree-way alignment
log( p
/ q q q )(2) instead of log( p / qaq )  log( p / q qc )  log( pac / qaqc )(3)
ab
b
bc b
abc a b c
• SP problem:
– N sequences with L (score L is 5)
5 N ( N 1) / 2
a
b
c
– N-1 sequences with L and one with G (score G is -4)
5 N ( N 1) / 2  (9 ( N 1))
9( N 1)  18
5N ( N 1) / 2 5N
RAL
RTL
CAL
RAG
relative difference in score between the correct and the incorrect alignment
decreases with the number of sequences in the alignment
Counterintuitive !
Algorithm
Multidimensional dynamic programming
Tedious formalism (optimal alignment)
• computation of the whole dynamic programming matrices
L1,L2,…LN entries
• Maximize over all 2N-1 combinations of gaps in a column
• Time complexity (2N LN)
Clever algorithm : Carrillo & Lipman (MSA)
Algorithm
N(N 1)
2
Pairwise sequence
alignments
Multiple sequence
alignment
“once a gap always
Progressive alignment a gap”
A
B
C
D
B
C
142
95 101
60 62 55
Similarity
matrix
Progressive
clustering
D C B A
Guide tree
Algorithm
Algorithm
Progressive alignment methods
•
Hierarchical (heuristic): succession of pairwise alignments
• Two sequences are aligned by standard pairwise alignment
• This alignment is fixed
• Align next sequence
•
Different algorithms
– Order of the alignment
– Progression:
» Alignment of a new sequence to a growing alignment
» Subfamilies are built up on a tree structure and alignments are
aligned to alignments
– Process used to align and score sequences to alignments
•
Heuristic approach:
– Align most similar pairs of sequences first
– Most similar is based on a guide tree (quick and dirty and
unsuitable for phylogenetic inference)
Algorithm
Disadvantage
But it is advantageous to use position specific information from an
existing alignment
e.g. mismatches at highly conserved positions should be penalized
more than mismatches at variable positions
e.g. gap penalties might increase in regions which do not contain
gaps as compared to regions which contain gaps
PROFILE ALIGNMENT
(hidden Markov, frequency matrices)

C
T
T
G
T
C
A
T
G
T
C
A
C
T
T
C
A
T
T
G
 0


0.75
 
 0.25

 0
1 0.25
0 0.25
0
0
0
0.5


0 0 
0.25 0.25

0.75 0.75
0
0
Algorithm
PROFILE based progressive multiple alignment : CLUSTALW
– Construct distance matrix by pairwise dynamic programming
– Convert similarity scores to evolutionary distances
– Construct a guide tree (clustering, neighbour joining clustering)
– Progressively align in order of decreasing similarity
– Sequence-sequence
– Sequence-profile
– Profile-profile
» Weighting to compensate for defects in SP
» Closely related: hard matrices (BLOSUM80), distant related
soft matrices (BLOSUM50)
» Gap penalties adapted
– To hydrophobicity of the residue
– Gap-open and gap-extend penalties increased if there are
no gaps in a column
Algorithm
• Further improvement
– Iterative refinement
• Problem: progressive alignment: subalignments are
frozen
• Solution:
– Iterative alignment: remove sequence from alignment
and realign
– Repeat realignment until the alignment score
converges