Protein sequence alignments

Protein sequence alignments
Theodor Hanekamp
University of Wyoming
MOLB5650
Spring 2002, updated 2004
Preview
Lecture 1
Lecture 2
Lecture 3
Lecture 4
Protein Sequence Alignments
Protein Databases
Protein Structure Alignments
Protein Structure Predictions
Theodor Hanekamp © 2002 All rights reserved.
2
Lecture outline
1. What is the biological problem ?
2. What is the computational solution ?
(advantages, disadvantages, how it
works)
3. What tools are available and where ?
Theodor Hanekamp © 2002 All rights reserved.
3
Today’s topics
•
•
•
•
•
•
•
Bioinformatics definition
Orthologs, paralogs, xenologs, analogs
Global & local sequence alignments
Gapped and ungapped alignments, gap
penalties
Pairwise and multiple alignments
Dot matrix analysis, Dynamic Programming
(Needleman-Wunsch, Smith-Waterman),
Heuristic methods (BLAST, FASTA)
Substitution matrix, PAM250, BLOSUM62
Theodor Hanekamp © 2002 All rights reserved.
4
What is bioinformatics ?
• “Bioinformatics is the application of quantitative and
analytical computational techniques to model
biological systems.”
Cynthia Gibas and Per Jambeck pg. 3
• ”Developing analytical tools to discover knowledge
in the data is the second, and more scientific, aspect
of bioinformatics.”
Cynthia Gibas and Per Jambeck pg. 12
SOURCE:
C. Gibas & P. Jambeck Developing Bioinformatics Computer Skills, O’Reilly © 2001
Theodor Hanekamp © 2002 All rights reserved.
5
The OMICS Revolution
GENOMICS
TRANSCRIPTOMICS
Interactomics
PROTEOMICS
METABOLOMICS
Foldomics
Kinomics
Many others
OMICS Glossary: From behaviouromics to variomics
SOURCE: http://www.genomicglossaries.com/content/omes.asp
Theodor Hanekamp © 2002 All rights reserved.
6
Omics
bibliomics
inomics
phenomics
biomics
integromics
phylogenomics
cellomics
interactomics
phyloproteomics
chemogenomics
ionomics
physiogenomics
chromonomics
kinomics
physiomics
chronomics
ligandomics
postgenomics
clinomics
lipoproteomics
proteogenomics
cryptomics
metabolomics
proteomics
crystallomics
metabonomics
pseudogenomics
cytomics
metallomics
regulomics
degradomics
methylomics
riboproteomics
economics
neurogenomics
rnomics
epigenomics
oncogenomics
saccharomics
epitomics
operomics
separomics
fluxomics
pathogenomics
toxicogenomics
functomics
peptidomics
toxicomics
genomics
peptidomics
transcriptomics
glycomics
pharmacogenomics
transgenomics
gpcromics
pharmacomethylomics
vaccinomics
immunomics
pharmacophylogenomics
variomics
Theodor Hanekamp © 2002 All rights reserved.
7
Where can I learn more ?
David Mount
Bioinformatics
CSHL Press 2001
Baxevanis & Oulette
Bioinformatics 2nd ed.
Wiley Interscience 2001
BioTechniques
http://www.biotechniques.com/
BioComputing section
Review articles on bioinformatics
Theodor Hanekamp © 2002 All rights reserved.
8
Why do we want to align
sequences ?
1. Assigning functions to unknown proteins
2. Determine relatedness of organisms
3. Identify structurally and functionally
important elements
4. Make predictions about the 3D structure
Theodor Hanekamp © 2002 All rights reserved.
9
Alignment-based methods
• Needed if we have an unknown
DNA or protein sequence.
• Purpose:
To find sequences/regions of significant
similarity in a sequence repository or
database.
To identify all of the homologous sequences
in a database or repository.
To identify motifs or domains with a
sequence similarity that is significantly
better than chance expectation
The sequence alignment
problem
1. THESESENTENSESALIGN--NICELY
||||| || |
||||| ||||||
2. THESEQENCE----ALIGNEDNICELY
1. THESESENTENSESALIGN--NICELY
|||||
|| | ||||| ||||||
2. THESE-Q--ENCE-ALIGNEDNICLEY
1. THESESENTENSESALIGN--NICELY
||| || || | ||||| ||||||
2. THE--SEQ-ENCE-ALIGNEDNICLEY
Theodor Hanekamp © 2002 All rights reserved.
2
19
4
2
19
4
2
19
4
12
Sequence alignments
General goals
Find maximum degree of likeness
Find minimum evolutionary
distance
Theodor Hanekamp © 2002 All rights reserved.
13
Ancestor relationships
Common ancestor
x steps
y steps
Sequence 1
Sequence 2
homologous
Theodor Hanekamp © 2002 All rights reserved.
14
Ancestor relationships
x
Gene duplication
x1
x2
Speciation
paralogous
Species 1
Species 2
x1
x2
x1
x2
orthologous
Theodor Hanekamp © 2002 All rights reserved.
15
Homologs, Orthologs, and Paralogs
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
Homologs: Genes evolved from a common ancestor
Orthologs: Genes evolved from a common ancestor by speciation.
Paralogs: Genes evolved from a common ancestor by gene duplication.
Theodor Hanekamp © 2002 All rights reserved.
16
Similar protein sequences
Orthologs: arise via separation of a duplicated region and
speciation; they often have the same function in
different organisms (e.g. hemoglobins in species 1&2)
Paralogs: other members of multigene families, may have
similar functions (e.g. hemoglobin and myoglobin)
Xenologs: similar sequences that were “recently”
introduced through horizontal gene transfer
Analogs: emerge through convergence on the same function;
similar active sites but different sequence background (e.g.
chymotrypsin & subtilisin)
“Pseudogenes”: similar sequence, not translated
Theodor Hanekamp © 2002 All rights reserved.
17
Types of sequence alignments
• Global vs. Local alignments
• Gapped vs. Ungapped alignments
• Pairwise alignments vs. Multiple
alignments
Theodor Hanekamp © 2002 All rights reserved.
18
Local alignment
Finds domains and short regions of similarity between
a pair of sequences. The two sequences under
comparison do not necessarily need to have high levels
of similarity over their entire length in order to receive
locally high similarity scores. This feature of local
similarity searches give them the advantage of being
useful when looking for domains within proteins or
looking for regions of genomic DNA that contain
introns. Local similarity searches do not have the
constraint that similarity between two sequences
needs to be observed over the entire length of each
gene.
Global alignment
Finds the optimal alignment over the entire length of
the two sequences under comparison. Algorithms of
this nature are not particularly suited to the
identification of genes that have evolved by
recombination or insertion of unrelated regions of
DNA. In instances such as this, a global similarity
score will be greatly reduced. In cases where genes
are being aligned whose sequences are of comparable
length and also whose entire gene is homologous
(descendant from a common ancestor), global
alignment might be considered appropriate.
Global vs. Local alignment
Global alignment
- match as many characters as possible from end
to end
- find an alignment with highest total score
- regions of high local similarity may be ignored in
favor of a higher overall score
Example:
THIS-ISAGLALALIGNMENT
||
||
|| ||| |
THEREISTHEAL--IGN-EDSEQ
Theodor Hanekamp © 2002 All rights reserved.
21
Global vs. Local Alignment
Local alignment
- find subsequences with highest density of
matches
- find regions with high local scores
- sequence similarities may extend beyond the
local subsequence with a lower degree of
similarity
Example:
--------LALIGNM---|||||
--------EALIGNE----Theodor Hanekamp © 2002 All rights reserved.
22
Gapped vs. Ungapped
alignments
Ungapped
- sequence comparisons are roughly
proportional to the square of the average
lengths
MATCHES
||
MAKERS
Theodor Hanekamp © 2002 All rights reserved.
23
Gapped vs. Ungapped
alignments
Gapped
If gaps of any lengths at any position would be
allowed:
- computationally very expensive
- alignments would not be very meaningful
MATCHE-S
||
| |
MA--KERS
Need a manageable number of gaps!
Theodor Hanekamp © 2002 All rights reserved.
24
Gap penalties
•
•
•
•
Reduce number of gaps in the alignment
Ensure a more meaningful alignment
Opening a gap is costly
Extending a gap is cheap
Examples:
Gap opening penalty = - 12
Gap extension penalty = - 1
Theodor Hanekamp © 2002 All rights reserved.
25
Gap penalties
G = g + xn
G = gap penalty
g = cost of opening a gap (here: g = -12)
x = cost of extending the gap by one (here: x = -1)
n = length of the gap
* Gap penalties should be adjusted to the
substitution matrix that is being used
Theodor Hanekamp © 2002 All rights reserved.
26
Impact of gap penalties
Case 1: Gap penalty: low Mismatch cost: high
MARCHMADNESSANDBASKETBALL
-ARCHY----ISA—-BASKET----CASE
Case 2: Gap penalty: medium Mismatch cost:
medium
MARCHMADNESSANDBASKETBALL
-ARCHY----ISA—-BASKETCASE
Case 3: Gap penalty: high Mismatch cost: low
MARCHMADNESSANDBASKETBALL
-ARCHYISABASKETCASE
Theodor Hanekamp © 2002 All rights reserved.
27
Sequence alignment problem
Gap open =-12, Gap extension = -1
1. THESESENTENSESALIGN--NICELY
||||| || |
||||| ||||||
2. THESEQENCE----ALIGNEDNICELY
1. THESESENTENSESALIGN--NICELY
|||||
|| | ||||| ||||||
2. THESE-Q--ENCE-ALIGNEDNICLEY
-12 +(-)1
-28
-12 +(-3)
-12 +(-1)
-50
-12 +(-12)+(-1)+(-12)
-12 +(-1)
1. THESESENTENSESALIGN--NICELY
-50
||| || || | ||||| ||||||
2. THE--SEQ-ENCE-ALIGNEDNICLEY -12 +(-1)+(-12) +(-12)
Theodor Hanekamp © 2002 All rights reserved.
28
Rules of thumb for gap penalties
• Gap opening penalty:
should be 2 – 3 times larger than the most
negative value in the substitution matrix that
is being used
• Gap extension penalty:
should be 0.1 to 0.3 times the value of the
gap opening penalty
Theodor Hanekamp © 2002 All rights reserved.
29
Pairwise vs. Multiple Alignments
Pairwise alignments:
- requires 2 sequences
Multiple alignments:
- requires more than 2 sequences
- computational problem is a lot more difficult
Theodor Hanekamp © 2002 All rights reserved.
30
Pairwise Sequence
Alignments
Theodor Hanekamp © 2002 All rights reserved.
31
Pairwise Alignment Methods
1. Dot matrix analysis
(Gibbs and McIntyre)
2. Dynamic programming algorithms
(Needleman-Wunsch, Smith-Waterman)
3. Heuristic Algorithms = Word or k-tuple
methods (BLAST, FASTA)
Theodor Hanekamp © 2002 All rights reserved.
32
Algorithm
Definition:
• A systematic procedure for solving a
problem in a finite number of ordered steps.
Example: “calling from a payphone”
• An algorithm can be written in a computer
language and run as a program.
Theodor Hanekamp © 2002 All rights reserved.
33
1. Dot matrix analysis
Advantages:
- all possible matches between 2 sequences are
displayed
- readily reveals insertions & deletions
- readily identifies direct in inverted repeats
- same algorithm is used for DNA, RNA and
proteins
Disadvantages:
- doesn’t show an actual sequence alignment
- qualitative evaluation of alignments
- statistical significance of alignment is not obvious
Theodor Hanekamp © 2002 All rights reserved.
34
How dot matrix analysis works
M
A
K
E
A
M
A
T
C
H
M
A
K
E
R
M
*
A
T
C
H
M
*
*
A
K
E
R
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Theodor Hanekamp © 2002 All rights reserved.
35
How dot matrix analysis works
M
A
K
E
A
M
A
T
C
H
M
A
K
E
R
M
*
A
T
C
H
M
*
*
A
K
E
R
*
Direct
Repeat
*
*
*
*
*
*
Inverted
Repeat
*
Aligned
sequence
*
*
*
*
*
*
*
*
*
*
*
Theodor Hanekamp © 2002 All rights reserved.
36
Variations of dot matrix analysis
1. Chemical similarity of the R-group of
amino acids (in D. Mount 2001)
2. “Symbol comparison tables”
(PAM250, BLOSUM) (States and
Boguski 1991)
3. Scoring table for amino acids found in
2nd structures (Risler 1988) =>
identifies distantly related proteins
Theodor Hanekamp © 2002 All rights reserved.
37
Sources for dot matrix programs
• DNA Strider (vers. 1.3)
Mac
• MacVector (vers. 7.1)
Mac
• DOTLET
Mac, Win, Unix
http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
• Dotter
Unix
http://www.cgr.ki.se/cgr/gropus/sonnhammer/Dotter.
html
• GCG (Genetics Computer Group)
http://nun.oit.unc.edu/gcgmanual/
- COMPARE and DOTPLOT
Theodor Hanekamp © 2002 All rights reserved.
38
Dot matrix analysis with DNA
Settings:
Vertical scale: lambda cI
Horizontal scale: phage P22 c2
Window size: 1
Stringency: 1
Theodor Hanekamp © 2002 All rights reserved.
39
Dot matrix analysis with DNA
Settings:
Vertical scale lambda cI
Horizontal scale: phage P22 c2
Window size: 11
Stringency: 7
Theodor Hanekamp © 2002 All rights reserved.
40
Dot matrix analysis with DNA
Settings:
Vertical scale lambda cI
Horizontal scale: phage P22 c2
Window size: 23
Stringency: 15
Theodor Hanekamp © 2002 All rights reserved.
41
Dot matrix analysis with proteins
Settings:
Vertical scale lambda cI
Horizontal scale: phage P22 c2.
Window size: 1
Stringency: 1
Theodor Hanekamp © 2002 All rights reserved.
42
Dot matrix analysis with proteins
Settings:
Vertical scale lambda cI
Horizontal scale: phage P22 c2
Window size: 3
Stringency: 2
Theodor Hanekamp © 2002 All rights reserved.
43
Take home message
• DNA sequence alignments
- use large windows (7 - 11)
- use high stringencies
• Protein sequence alignments
- use small windows (1 - 3)
- use lower stringencies
Theodor Hanekamp © 2002 All rights reserved.
44
2. Dynamic Programming
Definition:
- solves a problem by combining
solutions to subproblems that are
computed once and saved in a table or
matrix
- used when many solutions are possible
and an optimal solution needs to be
found
Theodor Hanekamp © 2002 All rights reserved.
45
2. Dynamic Programming
Algorithms
Advantages:
- guaranteed to provide the optimal (i.e. highest
scoring) alignment (mathematically proven)
- user defined choice of substitution matrix
- user defined gap penalties
- may provide one or more sequence alignments
Disadvantages:
- relatively slow, computational steps increase as the
square or cube of the sequence lengths
Theodor Hanekamp © 2002 All rights reserved.
46
Implementations of DP methods
1. Needleman-Wunsch/ Sellers
Global alignment
2. Smith-Waterman/ Sellers
Local alignment
Theodor Hanekamp © 2002 All rights reserved.
47
Smith-Waterman algorithm
• Local alignment method
• Does not place any restrictions on the
evolutionary model
• Most rigorous method
• Very sensitive
• Computationally expensive
Theodor Hanekamp © 2002 All rights reserved.
48
You need a scoring system
Calculate the probabilities that:
1. a particular aa pair is found in the alignment
2. the same aa is aligned by chance
3. the insertion of a gap of one or more
residues in one of the sequence would
improve the alignment
1 and 2 are retrieved from a substitution matrix
Theodor Hanekamp © 2002 All rights reserved.
49
How DP works (3 steps)
1. Generate a sequence vs. sequence
matrix; fill in the best scores from [0,0] to
[n,m]. Keep track of pointers to allow
trace-back.
2. Identify highest score in matrix
3. Trace back to start to get alignment
position by position
Theodor Hanekamp © 2002 All rights reserved.
50
What is the best score ?
Given: Protein Sequence S with residues from i to m
S = s1 s2 s3 …. sm
Protein Sequence T with residues from j to n
T = t1 t2 t3 …. tn
B i-x,j-y + b(si tj)
Row j-1 B i-x,j-y
+ b(si tj)
Row j
B i,j-y -Gx
B i-x,j -Gy
Col. i –1
Bij
Bij = max B i,j-y – Gx
B i-x,j - Gy
Col. i
Theodor Hanekamp © 2002 All rights reserved.
51
Step 1 Create and fill matrix
GLOBAL
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
0
T -2
H -4
A -6
T -8
C -10
H -12
E -14
R -16
j j+1 …
M A T
-2 -4 -6
… … … n
C H E S
-8 -10 -12 -14
Penalize 1st column and row
Position * gap penalty
LOCAL
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
T
H
A
T
C
H
E
R
0
0
0
0
0
0
0
0
0
j j+1 …
M A T
0 0 0
…
C
0
…
H
0
…
E
0
n
S
0
No penalty for 1st col. or row
No negative numbers allowed
Theodor Hanekamp © 2002 All rights reserved.
52
Step 1 Create and fill matrix
BestSore[ij] = BestScore[<i,<j] + Match[i,j] + GapPenalty
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
0
T -2
H -4
A -6
T -8
C -10
H -12
E -14
R -16
j j+1 …
M A T
-2 -4 -6
-1 -3 -3
… … … n
C H E S
-8 -10 -12 -14
-5 -7 -9 -11
Scoring contributions:
Vertical: -2 (gap in T)
Horizontal: -2 (gap in S)
Diagonal: +1 if match
-1 if mismatch
There are only three ways of pairing at each step
1.
One residue from each sequence, either a match or mismatch
2.
One residue from sequence T and a gap in sequence S
3.
One residue from sequence S and a gap in sequence T
NOTE: Gaps don’t align with gaps
Theodor Hanekamp © 2002 All rights reserved.
53
Step 2 Find highest score
Global alignment: (Needleman-Wunsch)
Find highest score in final row and final
column
Local alignment: (Smith-Waterman)
Highest score anywhere in the matrix
Theodor Hanekamp © 2002 All rights reserved.
54
Step 2 Find highest score
GLOBAL
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
0
T -2
H -4
A -6
T -8
C -10
H -12
E -14
R -16
j
M
-2
-1
-3
-5
-7
-9
-11
-13
-15
j+1 …
A T
-4 -6
-3 -3
-2 -4
-2 -3
-4 -1
-6 -3
-8 -5
-10 -7
-12 -9
…
C
-8
-5
-4
-5
-3
0
-2
-4
-6
… … n
H E S
-10 -12 -14
-7 -9 -11
-4 -6 -8
-5 -5 -7
-5 -6 -6
-2 -4 -6
1 -1 -3
-1
2 0
-3 0
1
Highest score in last row
and last column
LOCAL
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
T
H
A
T
C
H
E
R
0
0
0
0
0
0
0
0
0
j j+1 …
M A T
0
0
0
0
0
5
0
0
0
0
5
0
0
0 10
0
0
2
0
0
0
0
0
0
0
0
0
…
C
0
0
2
0
2
23
15
7
0
…
H
0
0
10
2
0
15
33
25
17
…
E
0
0
2
9
9
7
25
39
31
n
S
0
2
0
3
3
3
17
31
38
Highest score anywhere
Theodor Hanekamp © 2002 All rights reserved.
55
Step 3 Trace back and align
- start at highest score and create alignment in reverse
order
- print sequence S[i] and sequence T[j] as aligned
- trace pointer back to previous highest score
- if sequence S[i-1] and sequence T[j-1] then print
- if sequence S[i-1] and sequence T[j-N] then report
matches to gaps for S[j-1] …T[j-(N-1)]
- if sequence S[i-N] and sequence T[j-1] then print
matches to gaps for S[i-1] …T[i-(N-1)]
Theodor Hanekamp © 2002 All rights reserved.
56
Global alignment
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
0
T -2
H -4
A -6
T -8
C -10
H -12
E -14
R -16
M
T
j
M
-2
-1
-3
-5
-7
-9
-11
-13
-15
j+1
A
-4
-3
-2
-2
-4
-6
-8
-10
-12
…
T
-6
-3
-4
-3
-1
-3
-5
-7
-9
H
A
A
T
T
… … …
C H E
-8 -10 -12
-5 -7 -9
-4 -4 -6
-5 -5 -5
-3 -5 -6
0 -2 -4
-2 1 -1
-4 -1 2
-6 -3 0
C
C
H
H
E
E
n
S
-14
-11
-8
-7
-6
-6
-3
0
1
Scoring contributions:
Vertical: -2 (gap in T)
Horizontal: -2 (gap in S)
Diagonal: +1 if match
-1 if mismatch
S
R
Theodor Hanekamp © 2002 All rights reserved.
57
Alignment paths & gap placement
No gap
gap in seq. S
gap in seq. T
Theodor Hanekamp © 2002 All rights reserved.
58
Local alignment
Seq. T
Seq. S
i
i+1
…
…
…
…
…
m
T
H
A
T
C
H
E
R
0
0
0
0
0
0
0
0
0
j j+1 … … … … n
M A T C H E S
0
0
0
0
0
0
0
0
0
5
0
0
0
2
0
0
0
2 10 2
0
0
5
0
0
2
9
3
0
0 10 2
0
9
3
0
0
2 23 15 7
3
0
0
0 15 33 25 17
0
0
0
7 25 39 31
0
0
0
0 17 31 38
A
A
T
T
C
C
H
H
Scoring contributions:
Vertical: -2 (gap in T)
Horizontal: -2 (gap in S)
Diagonal: +1 if match
-1 if mismatch
E
E
Stop alignment when BestScore[ij] is zero
Theodor Hanekamp © 2002 All rights reserved.
59
Similarity or substitution matrices
-
attempts to quantify whether a mutation
preserves or disrupts the function of a
protein
reflect different degrees of evolutionary
divergence
provide a quantifiable measure for amino
acid residue substitutions
Examples:
a) Point Accepted Mutations (PAM)
b) Block sum (BLOSUM)
Theodor Hanekamp © 2002 All rights reserved.
60
Log Odds Score (Sij)
• Given: Seq. A (i1 … in) and Seq. B (j1 … jn)
• Sij = a measure for the probability of residue i
replacing residue j in an alignment
Sij = log2 (qij/pi pj)
• qjj = observed frequency at which i replaces j
• pi pj = expected frequency at which i replaces
j if the pattern of mutations were random
Theodor Hanekamp © 2002 All rights reserved.
61
Values for (Sij)
Sij > 0
residues replace each other more
often than expected by random chance
Sij = 0
residues replace each other as expected
by random chance
Sij < 0
residues replace each other less
often than expected by random chance
Theodor Hanekamp © 2002 All rights reserved.
62
Example of Log odds score
Odds of winning a chess game = no. of times you won
a match /no. of times you lost a match
Odds of aligning 2 amino acids correctly = no. of times
they aligned in sequences known to be related/ no.
of times they aligned in seq. that are not related
C A
T
odds score = 8/256 = 1/32
C T
N
log odds score = 3-4-4 = -5
8/1 1/16 1/16
Theodor Hanekamp © 2002 All rights reserved.
63
Dayhoff or PAM250 matrix
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
C
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
S T
P A G N D E Q H R K M I
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
6
2
0
-2
-2
-2
-2
-3
0
-3
6
3
0
-2
-3
-2
-4
-4
2
L
V F
5
0 6
-2 2 5
-3 4 2 6
-2 2 4 2 4
-5 0 1 2 -1
-4 -2 -1 -1 -2
-3 -4 -5 -2 -6
Theodor Hanekamp © 2002 All rights reserved.
Y W
9
7 10
0 0 17
64
Dayhoff or PAM250 matrix
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
C
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
S T
P A G N D E Q H R K M I
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
6
2
0
-2
-2
-2
-2
-3
0
-3
6
3
0
-2
-3
-2
-4
-4
2
L
V F
5
0 6
-2 2 5
-3 4 2 6
-2 2 4 2 4
-5 0 1 2 -1
-4 -2 -1 -1 -2
-3 -4 -5 -2 -6
Theodor Hanekamp © 2002 All rights reserved.
Y W
9
7 10
0 0 17
65
PAM vs. BLOSUM matrix
PAM matrix (Dayhoff)
BLOSUM matrix (Henikoff)
1. Based on mutations in
conserved and variable
regions in global
alignments
1. based exclusively on
mutations in local, highly
conserved regions w/o
gaps
2. Limited # of observations
2. Large # of observations
3. Derived from an explicit
evolutionary model
3. Derived with a sum-of pairs
evolutionary model
Theodor Hanekamp © 2002 All rights reserved.
66
LALIGN
- local alignment method
- gives multiple alternative alignments
- defines the gap penalty as q + r (k-1).
Sources:
- http://fasta.bioch.virginia.edu/fasta/lalign.htm
- www.ch.embnet.org/software/LALIGN_form.html
- www-bio.unizh.ch/cgi-bin/man-cgi?lalign
Theodor Hanekamp © 2002 All rights reserved.
67
3. Heuristic Algorithms
• Heuristic = a procedure that
progresses along empirical lines by
using rules of thumb to reach a
solution. The solution is not guaranteed
to be optimal.
Theodor Hanekamp © 2002 All rights reserved.
68
Heuristic Alignment Algorithms
Word or k-tuple methods
Advantages:
- very fast
- reliable in a statistical sense
Disadvantages:
- less sensitive than dynamic
programming
Theodor Hanekamp © 2002 All rights reserved.
69
Basic Local Alignment Sequence
Tool (BLAST)
Advantages:
• most popular for searching large
databases
Disadvantages of BLAST:
• Needs islands of strong homology
• The variants blastx, tblastn, tblastx use 6
frame translations and will miss sequences
with frameshifts
• Finds ONLY local alignments
Theodor Hanekamp © 2002 All rights reserved.
70
How BLAST works
STEP 1. Make a list of 3 letter words of protein seq.
MATCHES
MAT
1,2,3
ATC
2,3,4
TCH
3,4,5
CHE
...
HES
...
Theodor Hanekamp © 2002 All rights reserved.
71
How BLAST works
STEP 2. Search for perfect matches in the database
•
•
•
uses BLOSUM62 substitution matrix
number of match scores = 8000 (i.e.20 x 20 x 20)
will find perfect and imperfect matches
MAT, ... HAT, MIT, MAN
Theodor Hanekamp © 2002 All rights reserved.
72
How BLAST works
STEP 3. Select cutoff score T
• T = neighborhood word score threshold
• reduces list of matches from 8000 to ~50
(based on highest score using BLOSUM62)
STEP 4. Repeat word search for all 3 letter
words
For a 250 aa seq. => 12,500 words
Theodor Hanekamp © 2002 All rights reserved.
73
How BLAST works
STEP 5. Scan database for exact matches of
words (short list of words)
- if match is found use word as a seed for a
possible ungapped alignment
-
Extend alignment in each direction along
the sequence as long as the score
increases.
Theodor Hanekamp © 2002 All rights reserved.
74
Multiple Sequence Alignments
Theodor Hanekamp © 2002 All rights reserved.
75
Multiple sequence alignments
Advantages:
- may identify structural & functional
domains
- may identify protein families
Disadvantages:
- still a difficult algorithmic problem
Theodor Hanekamp © 2002 All rights reserved.
76
Multiple Sequence Alignments
Global alignment methods
• ClustalW (most popular)
• PileUp
Local alignment methods
• Dialign
Theodor Hanekamp © 2002 All rights reserved.
77
How ClustalW works
• based on Progressive Pairwise Alignment
(PPA)
1. globally align most similar sequences first
2. construct a tree using neighbor-joining
(determines the order in which subsequent
seq. are incorporated into the alignment)
Theodor Hanekamp © 2002 All rights reserved.
78
When to use ClustalW ?
ClustalW performs well when:
• aligning sequences of similar lengths
• aligning small to large protein families
of similar sequences
• few divergent sequences may be
included
Theodor Hanekamp © 2002 All rights reserved.
79
Sources for ClustalW
• http://dot.imgen.bcm.tmc.edu:9331/multialign/Options/clustalw.html
• http://www.bionavigator.com
• http://www2.ebi.ac.uk/clustalw/
• http://bioweb.pasteur.fr/intro-uk.html
• http://pbil.ibcp.fr/
• http://www.clustalw.genome.ad.jp/
Theodor Hanekamp © 2002 All rights reserved.
80
How Dialign works ?
• Local alignment approach
• identify gap-free fragments (called
diagonals) of high similarity
• built segments into multiple alignment
using an iterative approach
• works with DNA and proteins
Theodor Hanekamp © 2002 All rights reserved.
81
When to use Dialign ?
Dialign performs well when:
• sequences have long terminal
extensions
• sequences have large insertions
• useful for finding conserved blocks
within a set of sequences
Theodor Hanekamp © 2002 All rights reserved.
82
Sources for Dialign
http://www.hgmp.mrc.ac.uk/
http://mep.bio.psu.edu/alignment.html
http://genomatix.gsf.de/
http://bibiserv.techfak.uni-bielefeld.de/
Theodor Hanekamp © 2002 All rights reserved.
83
Next lecture …
•
•
•
•
•
•
SwissProt, PIR-PSD
TrEMBL, Genpept
Pfam
Prosite
SCOP
CATH
Theodor Hanekamp © 2002 All rights reserved.
84