STAT 530 Multiple Sequence Alignment and Protein Motifs Ping Ma Outline Motivation and Introduction Global MSA – ClustalW steps – ClustalW features – ClustalW example Local protein MSA (protein motifs) – Prosite and Pfam Ping Ma STAT 530 1 Multiple Sequence Alignment MSA Uses: – Establish evolutionary relationships (global) – Find conserved nucleotides and amino acids (global) – Characterize signature protein patterns or motifs (local) – Find acceptable substitutions (local) Protein MSA gold standard: structural alignment Ping Ma STAT 530 MSA with Dynamic Programming Theoretically can perform dynamic programming on multiple sequences Computational complexity O(nk) for k sequences each n long Ping Ma STAT 530 2 Progressive MSA Method Progressive (Feng and Doolittle 1987 J Mol Evol) – Heuristic algorithm: approximation strategy, do not aim at perfect – Build alignment with most related sequences, progressively add less-related to the alignment – Often manual examination can improve alignments Clustal: – Higgens and Sharp. Comput Appl BIOSci. (CABIOS, now bioinformatics) 1989, 5:151-3. ClustalW, NAR 1994, 22:4673-80 – W stands for weighting: more distant seqs weigh more – Reflect evolutionary distance Ping Ma STAT 530 ClustalW Steps Global pairwise alignment for all pairs – n (n-1) / 2 pairwise alignments – Two options ad hoc fast alignment Needleman-Wunsch dynamic programming Calculate pairwise sequence distances – Distance ~ f(1 / alignment score) – Distance = # mismatches / # matches – Approximate evolutionary distance Ping Ma STAT 530 3 ClustalW Steps Construct a tree based on sequence distances – e.g. solve the following matrix Suppose that we are to align sequences A, B, C, D Ping Ma STAT 530 ClustalW Steps Progressively add sequences/alignments by the tree order – Starting from the smallest distance – Add seq to seq, seq to align, align to align AD form new node E, calc AE, DE distance Calc E consensus, weighted by AE DE distance Calc B, C, E pairwise distance BE form new node F… Ping Ma STAT 530 4 ClustalW Features: Consensus Consensus is used to represent the aligned sequences – If exact match, accept – If inexact match, place AA where sum of matchmatrix (e.g. BLOSUM or PAM) distances to the two characters is minimized AVKDC I VH–C __________ LVN–C Ping Ma STAT 530 ClustalW Features: Weighting Scores for aligning more similar sequences are given less weight Weight adjusted based on branch length – Weight for A = a = 0.2 + 0.3 / 2 = 0.35 – Weight for B = b = 0.1 + 0.3 / 2 = 0.25 – Weight for C = c = 0.5 Ping Ma STAT 530 5 Scoring an Alignment Ping Ma STAT 530 ClustalW Features: Gaps Sequence specific gap penalties – Penalize gaps more in segments that are less likely to have gaps Ping Ma STAT 530 6 ClustalW Features: Gaps Position Specific Gap Penalties Ping Ma STAT 530 Progressive Alignment Limitations “Once a gap, always a gap” rule: Gaps can proliferate, if not careful Align1: ABCD-E ABC-D-E Align2: ABC-DE ABC-D-E Need many heuristic parameters Does not guarantee global optimum Errors in initial alignments are propagated Manual improvements: – Shift residues from one side of gap to the other – Reduce gaps Ping Ma STAT 530 7 ClustalW Input Ping Ma STAT 530 ClustalW, Waiting… ClustalW web server, slow with many long sequences ClustalX windows stand alone program Ping Ma STAT 530 8 ClustalW Result Summary Ping Ma STAT 530 ClustalW Output Input Pairwise sequence alignment scores MSA: Formation of each node Ping Ma STAT 530 9 ClustalW Alignment * - identical : - conserved . - semi-conserved Ping Ma STAT 530 ClustalW Tree Branch length ~ distance Ping Ma STAT 530 10 Local Sequence Patterns Protein sequence motifs – Usually conserved blocks in global MSA – Functional or structurally important block – Can be used to predict protein with unknown function Can be represented by direct alignments, regular expressions, or profiles (position specific scoring matrices) Ping Ma STAT 530 Local Sequence Patterns Prosite: database of sequence motifs associated with protein family membership. – Example: Protein family X, members aligned, find good blocks, represent them as regular expressions – Prosite 1: [LIVM]-[ST]-A-[STAG]-H-C – Prosite 2: [DNSTAGC]-[GSTAPIMVQH]-x(2)-G[DE]-S-G-[GS]-[SAPHV]-[LIVHMSGDP] – [ ] set of possible residues, x(n) n any amino acids – Reading Prosite 1: one of aa LIVM, then S or T, then A, then one of aa STAG, then H, then C. Ping Ma STAT 530 11 Local Sequence Patterns Motif profile: – Position-specific scoring matrices (PSSM) – Or position weight matrices (PWM) – Prob or count (aai at position j) Ping Ma STAT 530 Local Sequence Patterns Pfam: collection of MSA (both local and global) covering many protein domains and families (using HMM, next lecture) Ping Ma STAT 530 12 Pfam Results Trusted matches – To manually curated domains – To Pfam-B (computationally derived) Potential matches: less significant e-value Ping Ma STAT 530 Pfam Results Ping Ma STAT 530 13 Summary Protein global / local MSA Global progressive alignment – ClustalW: pairwise, tree, merge alignments – Merge with minimum edit, sequence weighting, sequence/position specific gaps – Heuristics, does not guarantee optimal Local sequence patterns – Often from global MSA – Prosite and Pfam Next lecture: how to get Pfam motifs HMM Ping Ma STAT 530 14
© Copyright 2026 Paperzz