Protein sequence alignments Theodor Hanekamp University of Wyoming MOLB5650 Spring 2002, updated 2004 Preview Lecture 1 Lecture 2 Lecture 3 Lecture 4 Protein Sequence Alignments Protein Databases Protein Structure Alignments Protein Structure Predictions Theodor Hanekamp © 2002 All rights reserved. 2 Lecture outline 1. What is the biological problem ? 2. What is the computational solution ? (advantages, disadvantages, how it works) 3. What tools are available and where ? Theodor Hanekamp © 2002 All rights reserved. 3 Today’s topics • • • • • • • Bioinformatics definition Orthologs, paralogs, xenologs, analogs Global & local sequence alignments Gapped and ungapped alignments, gap penalties Pairwise and multiple alignments Dot matrix analysis, Dynamic Programming (Needleman-Wunsch, Smith-Waterman), Heuristic methods (BLAST, FASTA) Substitution matrix, PAM250, BLOSUM62 Theodor Hanekamp © 2002 All rights reserved. 4 What is bioinformatics ? • “Bioinformatics is the application of quantitative and analytical computational techniques to model biological systems.” Cynthia Gibas and Per Jambeck pg. 3 • ”Developing analytical tools to discover knowledge in the data is the second, and more scientific, aspect of bioinformatics.” Cynthia Gibas and Per Jambeck pg. 12 SOURCE: C. Gibas & P. Jambeck Developing Bioinformatics Computer Skills, O’Reilly © 2001 Theodor Hanekamp © 2002 All rights reserved. 5 The OMICS Revolution GENOMICS TRANSCRIPTOMICS Interactomics PROTEOMICS METABOLOMICS Foldomics Kinomics Many others OMICS Glossary: From behaviouromics to variomics SOURCE: http://www.genomicglossaries.com/content/omes.asp Theodor Hanekamp © 2002 All rights reserved. 6 Omics bibliomics inomics phenomics biomics integromics phylogenomics cellomics interactomics phyloproteomics chemogenomics ionomics physiogenomics chromonomics kinomics physiomics chronomics ligandomics postgenomics clinomics lipoproteomics proteogenomics cryptomics metabolomics proteomics crystallomics metabonomics pseudogenomics cytomics metallomics regulomics degradomics methylomics riboproteomics economics neurogenomics rnomics epigenomics oncogenomics saccharomics epitomics operomics separomics fluxomics pathogenomics toxicogenomics functomics peptidomics toxicomics genomics peptidomics transcriptomics glycomics pharmacogenomics transgenomics gpcromics pharmacomethylomics vaccinomics immunomics pharmacophylogenomics variomics Theodor Hanekamp © 2002 All rights reserved. 7 Where can I learn more ? David Mount Bioinformatics CSHL Press 2001 Baxevanis & Oulette Bioinformatics 2nd ed. Wiley Interscience 2001 BioTechniques http://www.biotechniques.com/ BioComputing section Review articles on bioinformatics Theodor Hanekamp © 2002 All rights reserved. 8 Why do we want to align sequences ? 1. Assigning functions to unknown proteins 2. Determine relatedness of organisms 3. Identify structurally and functionally important elements 4. Make predictions about the 3D structure Theodor Hanekamp © 2002 All rights reserved. 9 Alignment-based methods • Needed if we have an unknown DNA or protein sequence. • Purpose: To find sequences/regions of significant similarity in a sequence repository or database. To identify all of the homologous sequences in a database or repository. To identify motifs or domains with a sequence similarity that is significantly better than chance expectation The sequence alignment problem 1. THESESENTENSESALIGN--NICELY ||||| || | ||||| |||||| 2. THESEQENCE----ALIGNEDNICELY 1. THESESENTENSESALIGN--NICELY ||||| || | ||||| |||||| 2. THESE-Q--ENCE-ALIGNEDNICLEY 1. THESESENTENSESALIGN--NICELY ||| || || | ||||| |||||| 2. THE--SEQ-ENCE-ALIGNEDNICLEY Theodor Hanekamp © 2002 All rights reserved. 2 19 4 2 19 4 2 19 4 12 Sequence alignments General goals Find maximum degree of likeness Find minimum evolutionary distance Theodor Hanekamp © 2002 All rights reserved. 13 Ancestor relationships Common ancestor x steps y steps Sequence 1 Sequence 2 homologous Theodor Hanekamp © 2002 All rights reserved. 14 Ancestor relationships x Gene duplication x1 x2 Speciation paralogous Species 1 Species 2 x1 x2 x1 x2 orthologous Theodor Hanekamp © 2002 All rights reserved. 15 Homologs, Orthologs, and Paralogs http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html Homologs: Genes evolved from a common ancestor Orthologs: Genes evolved from a common ancestor by speciation. Paralogs: Genes evolved from a common ancestor by gene duplication. Theodor Hanekamp © 2002 All rights reserved. 16 Similar protein sequences Orthologs: arise via separation of a duplicated region and speciation; they often have the same function in different organisms (e.g. hemoglobins in species 1&2) Paralogs: other members of multigene families, may have similar functions (e.g. hemoglobin and myoglobin) Xenologs: similar sequences that were “recently” introduced through horizontal gene transfer Analogs: emerge through convergence on the same function; similar active sites but different sequence background (e.g. chymotrypsin & subtilisin) “Pseudogenes”: similar sequence, not translated Theodor Hanekamp © 2002 All rights reserved. 17 Types of sequence alignments • Global vs. Local alignments • Gapped vs. Ungapped alignments • Pairwise alignments vs. Multiple alignments Theodor Hanekamp © 2002 All rights reserved. 18 Local alignment Finds domains and short regions of similarity between a pair of sequences. The two sequences under comparison do not necessarily need to have high levels of similarity over their entire length in order to receive locally high similarity scores. This feature of local similarity searches give them the advantage of being useful when looking for domains within proteins or looking for regions of genomic DNA that contain introns. Local similarity searches do not have the constraint that similarity between two sequences needs to be observed over the entire length of each gene. Global alignment Finds the optimal alignment over the entire length of the two sequences under comparison. Algorithms of this nature are not particularly suited to the identification of genes that have evolved by recombination or insertion of unrelated regions of DNA. In instances such as this, a global similarity score will be greatly reduced. In cases where genes are being aligned whose sequences are of comparable length and also whose entire gene is homologous (descendant from a common ancestor), global alignment might be considered appropriate. Global vs. Local alignment Global alignment - match as many characters as possible from end to end - find an alignment with highest total score - regions of high local similarity may be ignored in favor of a higher overall score Example: THIS-ISAGLALALIGNMENT || || || ||| | THEREISTHEAL--IGN-EDSEQ Theodor Hanekamp © 2002 All rights reserved. 21 Global vs. Local Alignment Local alignment - find subsequences with highest density of matches - find regions with high local scores - sequence similarities may extend beyond the local subsequence with a lower degree of similarity Example: --------LALIGNM---||||| --------EALIGNE----Theodor Hanekamp © 2002 All rights reserved. 22 Gapped vs. Ungapped alignments Ungapped - sequence comparisons are roughly proportional to the square of the average lengths MATCHES || MAKERS Theodor Hanekamp © 2002 All rights reserved. 23 Gapped vs. Ungapped alignments Gapped If gaps of any lengths at any position would be allowed: - computationally very expensive - alignments would not be very meaningful MATCHE-S || | | MA--KERS Need a manageable number of gaps! Theodor Hanekamp © 2002 All rights reserved. 24 Gap penalties • • • • Reduce number of gaps in the alignment Ensure a more meaningful alignment Opening a gap is costly Extending a gap is cheap Examples: Gap opening penalty = - 12 Gap extension penalty = - 1 Theodor Hanekamp © 2002 All rights reserved. 25 Gap penalties G = g + xn G = gap penalty g = cost of opening a gap (here: g = -12) x = cost of extending the gap by one (here: x = -1) n = length of the gap * Gap penalties should be adjusted to the substitution matrix that is being used Theodor Hanekamp © 2002 All rights reserved. 26 Impact of gap penalties Case 1: Gap penalty: low Mismatch cost: high MARCHMADNESSANDBASKETBALL -ARCHY----ISA—-BASKET----CASE Case 2: Gap penalty: medium Mismatch cost: medium MARCHMADNESSANDBASKETBALL -ARCHY----ISA—-BASKETCASE Case 3: Gap penalty: high Mismatch cost: low MARCHMADNESSANDBASKETBALL -ARCHYISABASKETCASE Theodor Hanekamp © 2002 All rights reserved. 27 Sequence alignment problem Gap open =-12, Gap extension = -1 1. THESESENTENSESALIGN--NICELY ||||| || | ||||| |||||| 2. THESEQENCE----ALIGNEDNICELY 1. THESESENTENSESALIGN--NICELY ||||| || | ||||| |||||| 2. THESE-Q--ENCE-ALIGNEDNICLEY -12 +(-)1 -28 -12 +(-3) -12 +(-1) -50 -12 +(-12)+(-1)+(-12) -12 +(-1) 1. THESESENTENSESALIGN--NICELY -50 ||| || || | ||||| |||||| 2. THE--SEQ-ENCE-ALIGNEDNICLEY -12 +(-1)+(-12) +(-12) Theodor Hanekamp © 2002 All rights reserved. 28 Rules of thumb for gap penalties • Gap opening penalty: should be 2 – 3 times larger than the most negative value in the substitution matrix that is being used • Gap extension penalty: should be 0.1 to 0.3 times the value of the gap opening penalty Theodor Hanekamp © 2002 All rights reserved. 29 Pairwise vs. Multiple Alignments Pairwise alignments: - requires 2 sequences Multiple alignments: - requires more than 2 sequences - computational problem is a lot more difficult Theodor Hanekamp © 2002 All rights reserved. 30 Pairwise Sequence Alignments Theodor Hanekamp © 2002 All rights reserved. 31 Pairwise Alignment Methods 1. Dot matrix analysis (Gibbs and McIntyre) 2. Dynamic programming algorithms (Needleman-Wunsch, Smith-Waterman) 3. Heuristic Algorithms = Word or k-tuple methods (BLAST, FASTA) Theodor Hanekamp © 2002 All rights reserved. 32 Algorithm Definition: • A systematic procedure for solving a problem in a finite number of ordered steps. Example: “calling from a payphone” • An algorithm can be written in a computer language and run as a program. Theodor Hanekamp © 2002 All rights reserved. 33 1. Dot matrix analysis Advantages: - all possible matches between 2 sequences are displayed - readily reveals insertions & deletions - readily identifies direct in inverted repeats - same algorithm is used for DNA, RNA and proteins Disadvantages: - doesn’t show an actual sequence alignment - qualitative evaluation of alignments - statistical significance of alignment is not obvious Theodor Hanekamp © 2002 All rights reserved. 34 How dot matrix analysis works M A K E A M A T C H M A K E R M * A T C H M * * A K E R * * * * * * * * * * * * * * * * * * * Theodor Hanekamp © 2002 All rights reserved. 35 How dot matrix analysis works M A K E A M A T C H M A K E R M * A T C H M * * A K E R * Direct Repeat * * * * * * Inverted Repeat * Aligned sequence * * * * * * * * * * * Theodor Hanekamp © 2002 All rights reserved. 36 Variations of dot matrix analysis 1. Chemical similarity of the R-group of amino acids (in D. Mount 2001) 2. “Symbol comparison tables” (PAM250, BLOSUM) (States and Boguski 1991) 3. Scoring table for amino acids found in 2nd structures (Risler 1988) => identifies distantly related proteins Theodor Hanekamp © 2002 All rights reserved. 37 Sources for dot matrix programs • DNA Strider (vers. 1.3) Mac • MacVector (vers. 7.1) Mac • DOTLET Mac, Win, Unix http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • Dotter Unix http://www.cgr.ki.se/cgr/gropus/sonnhammer/Dotter. html • GCG (Genetics Computer Group) http://nun.oit.unc.edu/gcgmanual/ - COMPARE and DOTPLOT Theodor Hanekamp © 2002 All rights reserved. 38 Dot matrix analysis with DNA Settings: Vertical scale: lambda cI Horizontal scale: phage P22 c2 Window size: 1 Stringency: 1 Theodor Hanekamp © 2002 All rights reserved. 39 Dot matrix analysis with DNA Settings: Vertical scale lambda cI Horizontal scale: phage P22 c2 Window size: 11 Stringency: 7 Theodor Hanekamp © 2002 All rights reserved. 40 Dot matrix analysis with DNA Settings: Vertical scale lambda cI Horizontal scale: phage P22 c2 Window size: 23 Stringency: 15 Theodor Hanekamp © 2002 All rights reserved. 41 Dot matrix analysis with proteins Settings: Vertical scale lambda cI Horizontal scale: phage P22 c2. Window size: 1 Stringency: 1 Theodor Hanekamp © 2002 All rights reserved. 42 Dot matrix analysis with proteins Settings: Vertical scale lambda cI Horizontal scale: phage P22 c2 Window size: 3 Stringency: 2 Theodor Hanekamp © 2002 All rights reserved. 43 Take home message • DNA sequence alignments - use large windows (7 - 11) - use high stringencies • Protein sequence alignments - use small windows (1 - 3) - use lower stringencies Theodor Hanekamp © 2002 All rights reserved. 44 2. Dynamic Programming Definition: - solves a problem by combining solutions to subproblems that are computed once and saved in a table or matrix - used when many solutions are possible and an optimal solution needs to be found Theodor Hanekamp © 2002 All rights reserved. 45 2. Dynamic Programming Algorithms Advantages: - guaranteed to provide the optimal (i.e. highest scoring) alignment (mathematically proven) - user defined choice of substitution matrix - user defined gap penalties - may provide one or more sequence alignments Disadvantages: - relatively slow, computational steps increase as the square or cube of the sequence lengths Theodor Hanekamp © 2002 All rights reserved. 46 Implementations of DP methods 1. Needleman-Wunsch/ Sellers Global alignment 2. Smith-Waterman/ Sellers Local alignment Theodor Hanekamp © 2002 All rights reserved. 47 Smith-Waterman algorithm • Local alignment method • Does not place any restrictions on the evolutionary model • Most rigorous method • Very sensitive • Computationally expensive Theodor Hanekamp © 2002 All rights reserved. 48 You need a scoring system Calculate the probabilities that: 1. a particular aa pair is found in the alignment 2. the same aa is aligned by chance 3. the insertion of a gap of one or more residues in one of the sequence would improve the alignment 1 and 2 are retrieved from a substitution matrix Theodor Hanekamp © 2002 All rights reserved. 49 How DP works (3 steps) 1. Generate a sequence vs. sequence matrix; fill in the best scores from [0,0] to [n,m]. Keep track of pointers to allow trace-back. 2. Identify highest score in matrix 3. Trace back to start to get alignment position by position Theodor Hanekamp © 2002 All rights reserved. 50 What is the best score ? Given: Protein Sequence S with residues from i to m S = s1 s2 s3 …. sm Protein Sequence T with residues from j to n T = t1 t2 t3 …. tn B i-x,j-y + b(si tj) Row j-1 B i-x,j-y + b(si tj) Row j B i,j-y -Gx B i-x,j -Gy Col. i –1 Bij Bij = max B i,j-y – Gx B i-x,j - Gy Col. i Theodor Hanekamp © 2002 All rights reserved. 51 Step 1 Create and fill matrix GLOBAL Seq. T Seq. S i i+1 … … … … … m 0 T -2 H -4 A -6 T -8 C -10 H -12 E -14 R -16 j j+1 … M A T -2 -4 -6 … … … n C H E S -8 -10 -12 -14 Penalize 1st column and row Position * gap penalty LOCAL Seq. T Seq. S i i+1 … … … … … m T H A T C H E R 0 0 0 0 0 0 0 0 0 j j+1 … M A T 0 0 0 … C 0 … H 0 … E 0 n S 0 No penalty for 1st col. or row No negative numbers allowed Theodor Hanekamp © 2002 All rights reserved. 52 Step 1 Create and fill matrix BestSore[ij] = BestScore[<i,<j] + Match[i,j] + GapPenalty Seq. T Seq. S i i+1 … … … … … m 0 T -2 H -4 A -6 T -8 C -10 H -12 E -14 R -16 j j+1 … M A T -2 -4 -6 -1 -3 -3 … … … n C H E S -8 -10 -12 -14 -5 -7 -9 -11 Scoring contributions: Vertical: -2 (gap in T) Horizontal: -2 (gap in S) Diagonal: +1 if match -1 if mismatch There are only three ways of pairing at each step 1. One residue from each sequence, either a match or mismatch 2. One residue from sequence T and a gap in sequence S 3. One residue from sequence S and a gap in sequence T NOTE: Gaps don’t align with gaps Theodor Hanekamp © 2002 All rights reserved. 53 Step 2 Find highest score Global alignment: (Needleman-Wunsch) Find highest score in final row and final column Local alignment: (Smith-Waterman) Highest score anywhere in the matrix Theodor Hanekamp © 2002 All rights reserved. 54 Step 2 Find highest score GLOBAL Seq. T Seq. S i i+1 … … … … … m 0 T -2 H -4 A -6 T -8 C -10 H -12 E -14 R -16 j M -2 -1 -3 -5 -7 -9 -11 -13 -15 j+1 … A T -4 -6 -3 -3 -2 -4 -2 -3 -4 -1 -6 -3 -8 -5 -10 -7 -12 -9 … C -8 -5 -4 -5 -3 0 -2 -4 -6 … … n H E S -10 -12 -14 -7 -9 -11 -4 -6 -8 -5 -5 -7 -5 -6 -6 -2 -4 -6 1 -1 -3 -1 2 0 -3 0 1 Highest score in last row and last column LOCAL Seq. T Seq. S i i+1 … … … … … m T H A T C H E R 0 0 0 0 0 0 0 0 0 j j+1 … M A T 0 0 0 0 0 5 0 0 0 0 5 0 0 0 10 0 0 2 0 0 0 0 0 0 0 0 0 … C 0 0 2 0 2 23 15 7 0 … H 0 0 10 2 0 15 33 25 17 … E 0 0 2 9 9 7 25 39 31 n S 0 2 0 3 3 3 17 31 38 Highest score anywhere Theodor Hanekamp © 2002 All rights reserved. 55 Step 3 Trace back and align - start at highest score and create alignment in reverse order - print sequence S[i] and sequence T[j] as aligned - trace pointer back to previous highest score - if sequence S[i-1] and sequence T[j-1] then print - if sequence S[i-1] and sequence T[j-N] then report matches to gaps for S[j-1] …T[j-(N-1)] - if sequence S[i-N] and sequence T[j-1] then print matches to gaps for S[i-1] …T[i-(N-1)] Theodor Hanekamp © 2002 All rights reserved. 56 Global alignment Seq. T Seq. S i i+1 … … … … … m 0 T -2 H -4 A -6 T -8 C -10 H -12 E -14 R -16 M T j M -2 -1 -3 -5 -7 -9 -11 -13 -15 j+1 A -4 -3 -2 -2 -4 -6 -8 -10 -12 … T -6 -3 -4 -3 -1 -3 -5 -7 -9 H A A T T … … … C H E -8 -10 -12 -5 -7 -9 -4 -4 -6 -5 -5 -5 -3 -5 -6 0 -2 -4 -2 1 -1 -4 -1 2 -6 -3 0 C C H H E E n S -14 -11 -8 -7 -6 -6 -3 0 1 Scoring contributions: Vertical: -2 (gap in T) Horizontal: -2 (gap in S) Diagonal: +1 if match -1 if mismatch S R Theodor Hanekamp © 2002 All rights reserved. 57 Alignment paths & gap placement No gap gap in seq. S gap in seq. T Theodor Hanekamp © 2002 All rights reserved. 58 Local alignment Seq. T Seq. S i i+1 … … … … … m T H A T C H E R 0 0 0 0 0 0 0 0 0 j j+1 … … … … n M A T C H E S 0 0 0 0 0 0 0 0 0 5 0 0 0 2 0 0 0 2 10 2 0 0 5 0 0 2 9 3 0 0 10 2 0 9 3 0 0 2 23 15 7 3 0 0 0 15 33 25 17 0 0 0 7 25 39 31 0 0 0 0 17 31 38 A A T T C C H H Scoring contributions: Vertical: -2 (gap in T) Horizontal: -2 (gap in S) Diagonal: +1 if match -1 if mismatch E E Stop alignment when BestScore[ij] is zero Theodor Hanekamp © 2002 All rights reserved. 59 Similarity or substitution matrices - attempts to quantify whether a mutation preserves or disrupts the function of a protein reflect different degrees of evolutionary divergence provide a quantifiable measure for amino acid residue substitutions Examples: a) Point Accepted Mutations (PAM) b) Block sum (BLOSUM) Theodor Hanekamp © 2002 All rights reserved. 60 Log Odds Score (Sij) • Given: Seq. A (i1 … in) and Seq. B (j1 … jn) • Sij = a measure for the probability of residue i replacing residue j in an alignment Sij = log2 (qij/pi pj) • qjj = observed frequency at which i replaces j • pi pj = expected frequency at which i replaces j if the pattern of mutations were random Theodor Hanekamp © 2002 All rights reserved. 61 Values for (Sij) Sij > 0 residues replace each other more often than expected by random chance Sij = 0 residues replace each other as expected by random chance Sij < 0 residues replace each other less often than expected by random chance Theodor Hanekamp © 2002 All rights reserved. 62 Example of Log odds score Odds of winning a chess game = no. of times you won a match /no. of times you lost a match Odds of aligning 2 amino acids correctly = no. of times they aligned in sequences known to be related/ no. of times they aligned in seq. that are not related C A T odds score = 8/256 = 1/32 C T N log odds score = 3-4-4 = -5 8/1 1/16 1/16 Theodor Hanekamp © 2002 All rights reserved. 63 Dayhoff or PAM250 matrix C S T P A G N D E Q H R K M I L V F Y W C 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8 S T P A G N D E Q H R K M I 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7 4 3 1 1 -1 -2 -2 -2 -5 -4 -5 6 2 0 -2 -2 -2 -2 -3 0 -3 6 3 0 -2 -3 -2 -4 -4 2 L V F 5 0 6 -2 2 5 -3 4 2 6 -2 2 4 2 4 -5 0 1 2 -1 -4 -2 -1 -1 -2 -3 -4 -5 -2 -6 Theodor Hanekamp © 2002 All rights reserved. Y W 9 7 10 0 0 17 64 Dayhoff or PAM250 matrix C S T P A G N D E Q H R K M I L V F Y W C 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8 S T P A G N D E Q H R K M I 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7 4 3 1 1 -1 -2 -2 -2 -5 -4 -5 6 2 0 -2 -2 -2 -2 -3 0 -3 6 3 0 -2 -3 -2 -4 -4 2 L V F 5 0 6 -2 2 5 -3 4 2 6 -2 2 4 2 4 -5 0 1 2 -1 -4 -2 -1 -1 -2 -3 -4 -5 -2 -6 Theodor Hanekamp © 2002 All rights reserved. Y W 9 7 10 0 0 17 65 PAM vs. BLOSUM matrix PAM matrix (Dayhoff) BLOSUM matrix (Henikoff) 1. Based on mutations in conserved and variable regions in global alignments 1. based exclusively on mutations in local, highly conserved regions w/o gaps 2. Limited # of observations 2. Large # of observations 3. Derived from an explicit evolutionary model 3. Derived with a sum-of pairs evolutionary model Theodor Hanekamp © 2002 All rights reserved. 66 LALIGN - local alignment method - gives multiple alternative alignments - defines the gap penalty as q + r (k-1). Sources: - http://fasta.bioch.virginia.edu/fasta/lalign.htm - www.ch.embnet.org/software/LALIGN_form.html - www-bio.unizh.ch/cgi-bin/man-cgi?lalign Theodor Hanekamp © 2002 All rights reserved. 67 3. Heuristic Algorithms • Heuristic = a procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal. Theodor Hanekamp © 2002 All rights reserved. 68 Heuristic Alignment Algorithms Word or k-tuple methods Advantages: - very fast - reliable in a statistical sense Disadvantages: - less sensitive than dynamic programming Theodor Hanekamp © 2002 All rights reserved. 69 Basic Local Alignment Sequence Tool (BLAST) Advantages: • most popular for searching large databases Disadvantages of BLAST: • Needs islands of strong homology • The variants blastx, tblastn, tblastx use 6 frame translations and will miss sequences with frameshifts • Finds ONLY local alignments Theodor Hanekamp © 2002 All rights reserved. 70 How BLAST works STEP 1. Make a list of 3 letter words of protein seq. MATCHES MAT 1,2,3 ATC 2,3,4 TCH 3,4,5 CHE ... HES ... Theodor Hanekamp © 2002 All rights reserved. 71 How BLAST works STEP 2. Search for perfect matches in the database • • • uses BLOSUM62 substitution matrix number of match scores = 8000 (i.e.20 x 20 x 20) will find perfect and imperfect matches MAT, ... HAT, MIT, MAN Theodor Hanekamp © 2002 All rights reserved. 72 How BLAST works STEP 3. Select cutoff score T • T = neighborhood word score threshold • reduces list of matches from 8000 to ~50 (based on highest score using BLOSUM62) STEP 4. Repeat word search for all 3 letter words For a 250 aa seq. => 12,500 words Theodor Hanekamp © 2002 All rights reserved. 73 How BLAST works STEP 5. Scan database for exact matches of words (short list of words) - if match is found use word as a seed for a possible ungapped alignment - Extend alignment in each direction along the sequence as long as the score increases. Theodor Hanekamp © 2002 All rights reserved. 74 Multiple Sequence Alignments Theodor Hanekamp © 2002 All rights reserved. 75 Multiple sequence alignments Advantages: - may identify structural & functional domains - may identify protein families Disadvantages: - still a difficult algorithmic problem Theodor Hanekamp © 2002 All rights reserved. 76 Multiple Sequence Alignments Global alignment methods • ClustalW (most popular) • PileUp Local alignment methods • Dialign Theodor Hanekamp © 2002 All rights reserved. 77 How ClustalW works • based on Progressive Pairwise Alignment (PPA) 1. globally align most similar sequences first 2. construct a tree using neighbor-joining (determines the order in which subsequent seq. are incorporated into the alignment) Theodor Hanekamp © 2002 All rights reserved. 78 When to use ClustalW ? ClustalW performs well when: • aligning sequences of similar lengths • aligning small to large protein families of similar sequences • few divergent sequences may be included Theodor Hanekamp © 2002 All rights reserved. 79 Sources for ClustalW • http://dot.imgen.bcm.tmc.edu:9331/multialign/Options/clustalw.html • http://www.bionavigator.com • http://www2.ebi.ac.uk/clustalw/ • http://bioweb.pasteur.fr/intro-uk.html • http://pbil.ibcp.fr/ • http://www.clustalw.genome.ad.jp/ Theodor Hanekamp © 2002 All rights reserved. 80 How Dialign works ? • Local alignment approach • identify gap-free fragments (called diagonals) of high similarity • built segments into multiple alignment using an iterative approach • works with DNA and proteins Theodor Hanekamp © 2002 All rights reserved. 81 When to use Dialign ? Dialign performs well when: • sequences have long terminal extensions • sequences have large insertions • useful for finding conserved blocks within a set of sequences Theodor Hanekamp © 2002 All rights reserved. 82 Sources for Dialign http://www.hgmp.mrc.ac.uk/ http://mep.bio.psu.edu/alignment.html http://genomatix.gsf.de/ http://bibiserv.techfak.uni-bielefeld.de/ Theodor Hanekamp © 2002 All rights reserved. 83 Next lecture … • • • • • • SwissProt, PIR-PSD TrEMBL, Genpept Pfam Prosite SCOP CATH Theodor Hanekamp © 2002 All rights reserved. 84
© Copyright 2025 Paperzz