STAT 530 Multiple Sequence Alignment and Protein Motifs

STAT 530
Multiple Sequence Alignment
and Protein Motifs
Ping Ma
Outline
Motivation and Introduction
Global MSA
– ClustalW steps
– ClustalW features
– ClustalW example
Local protein MSA (protein motifs)
– Prosite and Pfam
Ping Ma
STAT 530
1
Multiple Sequence Alignment
MSA Uses:
– Establish evolutionary relationships (global)
– Find conserved nucleotides and amino acids (global)
– Characterize signature protein patterns or motifs
(local)
– Find acceptable substitutions (local)
Protein MSA gold standard: structural alignment
Ping Ma
STAT 530
MSA with Dynamic Programming
Theoretically can perform dynamic programming on
multiple sequences
Computational complexity O(nk) for k sequences each
n long
Ping Ma
STAT 530
2
Progressive MSA Method
Progressive (Feng and Doolittle 1987 J Mol Evol)
– Heuristic algorithm: approximation strategy, do not aim at
perfect
– Build alignment with most related sequences, progressively add
less-related to the alignment
– Often manual examination can improve alignments
Clustal:
– Higgens and Sharp. Comput Appl BIOSci. (CABIOS, now
bioinformatics) 1989, 5:151-3.
ClustalW, NAR 1994, 22:4673-80
– W stands for weighting: more distant seqs weigh more
– Reflect evolutionary distance
Ping Ma
STAT 530
ClustalW Steps
Global pairwise alignment for all pairs
– n (n-1) / 2 pairwise alignments
– Two options
ad hoc fast alignment
Needleman-Wunsch dynamic programming
Calculate pairwise sequence distances
– Distance ~ f(1 / alignment score)
– Distance = # mismatches / # matches
– Approximate evolutionary distance
Ping Ma
STAT 530
3
ClustalW Steps
Construct a tree based on sequence distances
– e.g. solve the following matrix
Suppose that we are to align sequences A, B, C, D
Ping Ma
STAT 530
ClustalW Steps
Progressively add sequences/alignments by the tree
order
– Starting from the smallest distance
– Add seq to seq, seq to align, align to align
AD form new node E, calc AE, DE distance
Calc E consensus, weighted
by AE DE distance
Calc B, C, E pairwise
distance
BE form new node F…
Ping Ma
STAT 530
4
ClustalW Features: Consensus
Consensus is used to represent the aligned sequences
– If exact match, accept
– If inexact match, place AA where sum of matchmatrix (e.g. BLOSUM or PAM) distances to the two
characters is minimized
AVKDC
I VH–C
__________
LVN–C
Ping Ma
STAT 530
ClustalW Features: Weighting
Scores for aligning more similar sequences are given
less weight
Weight adjusted based on branch length
– Weight for A = a = 0.2 + 0.3 / 2 = 0.35
– Weight for B = b = 0.1 + 0.3 / 2 = 0.25
– Weight for C = c = 0.5
Ping Ma
STAT 530
5
Scoring an Alignment
Ping Ma
STAT 530
ClustalW Features: Gaps
Sequence specific gap penalties
– Penalize gaps more in segments that are less
likely to have gaps
Ping Ma
STAT 530
6
ClustalW Features: Gaps
Position Specific Gap Penalties
Ping Ma
STAT 530
Progressive Alignment Limitations
“Once a gap, always a gap” rule: Gaps can proliferate, if
not careful
Align1: ABCD-E
ABC-D-E
Align2: ABC-DE
ABC-D-E
Need many heuristic parameters
Does not guarantee global optimum
Errors in initial alignments are propagated
Manual improvements:
– Shift residues from one side of gap to the other
– Reduce gaps
Ping Ma
STAT 530
7
ClustalW Input
Ping Ma
STAT 530
ClustalW, Waiting…
ClustalW
web server,
slow with
many long
sequences
ClustalX
windows
stand alone
program
Ping Ma
STAT 530
8
ClustalW Result Summary
Ping Ma
STAT 530
ClustalW Output
Input
Pairwise sequence
alignment scores
MSA: Formation
of each node
Ping Ma
STAT 530
9
ClustalW Alignment
* - identical
: - conserved
. - semi-conserved
Ping Ma
STAT 530
ClustalW Tree
Branch length
~ distance
Ping Ma
STAT 530
10
Local Sequence Patterns
Protein sequence motifs
– Usually conserved blocks in global MSA
– Functional or structurally important block
– Can be used to predict protein with unknown
function
Can be represented by direct alignments, regular
expressions, or profiles (position specific scoring
matrices)
Ping Ma
STAT 530
Local Sequence Patterns
Prosite: database of sequence motifs associated with
protein family membership.
– Example: Protein family X, members aligned, find
good blocks, represent them as regular expressions
– Prosite 1: [LIVM]-[ST]-A-[STAG]-H-C
– Prosite 2: [DNSTAGC]-[GSTAPIMVQH]-x(2)-G[DE]-S-G-[GS]-[SAPHV]-[LIVHMSGDP]
– [ ] set of possible residues, x(n) n any amino acids
– Reading Prosite 1: one of aa LIVM, then S or T, then
A, then one of aa STAG, then H, then C.
Ping Ma
STAT 530
11
Local Sequence Patterns
Motif profile:
– Position-specific scoring matrices (PSSM)
– Or position weight matrices (PWM)
– Prob or count (aai at position j)
Ping Ma
STAT 530
Local Sequence Patterns
Pfam: collection of MSA (both local and global)
covering many protein domains and families
(using HMM, next lecture)
Ping Ma
STAT 530
12
Pfam Results
Trusted matches
– To manually curated domains
– To Pfam-B (computationally derived)
Potential matches: less significant e-value
Ping Ma
STAT 530
Pfam Results
Ping Ma
STAT 530
13
Summary
Protein global / local MSA
Global progressive alignment
– ClustalW: pairwise, tree, merge alignments
– Merge with minimum edit, sequence weighting,
sequence/position specific gaps
– Heuristics, does not guarantee optimal
Local sequence patterns
– Often from global MSA
– Prosite and Pfam
Next lecture: how to get Pfam motifs HMM
Ping Ma
STAT 530
14