Welcome to Introduction to Bioinformatics I. Scenario 4: Sequence alignment • Bring up course web site • Go to Scenario 4 • Open the first sequence alignment notes Scenario 3: Our Story You: Our first defense at CDC Outbreak: . . . Anthrax? Samples: • Confirm agent • Identify strain Scenario 3: Our Story Toxin genespecific primers Scenario 3: Our Story If DNA from bacterium with toxin gene PCR If DNA NOT from bacterium with toxin gene? Scenario 3: Our Story If DNA from bacterium with toxin gene If DNA NOT from bacterium with toxin gene? PCR (no product) Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG >gi|16031490|emb|AJ413935.1|BAN413935 Bacillus anthracis partial lef gene, isolate Microsoft-6259 Length = 2417 Score = 155 bits (78), Expect = 2e-35 Identities = 138/158 (87%) Strand = Plus / Plus Query: 1 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1267 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 1326 Query: 61 tatgaaaacatgaatataaataacttaacagcaacgttaggtgccgatttagtagattcc 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1327 tatgaaaacatgaatataaataacctaacagcaacgttaggtgccgatttagtagattcc 1386 Query: 121 acagataatacaaaaattaatcgaggtatattcaatga 158 |||||||||||||||||||||||||||||||||||||| Sbjct: 1387 acagataatacaaaaattaatcgaggtatattcaatga 1424 Scenario 3: Our Story PCR Toxin gene present Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Do it! Scenario 3: Our Story Maybe it’s not from the toxin gene?? Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Translate NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGIFNEFKKNFKYSIS Do it! DG47 nucleotide sequence: Matches nothing in GenBank DG47 amino acid sequence: 100% match to toxin gene Scenario 3: Our Story Compare nucleotide sequences by hand DG47 vs lef Do it! Scenario 3: Our Story Compare nucleotide sequences by hand DG47 1 lef gene 1831 DG47 61 lef gene 1891 DG47 121 lef gene 1951 DG47 181 lef gene 2011 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG TATGAAAACATGAATATAAATAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC |||||||| |||||||| |||||| | |||||||| ||||||| |||||||| |||||| TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTTAGTTGATTCC ACAGATAATACAAAAATTAATCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC || |||||||| ||||||||| ||||||| |||||||| ||||||||||||||||||||| ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAATTTCAAATAT AGTATTTCTA |||||||||| AGTATTTCTA 89% identical! Scenario 3: Our Story Compare nucleotide sequences by hand DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG + lef gene Sequence 1lcl|PCR Product DG47 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. No significant similarity was found Length190 Length190 Scenario 3: Our Story DG47 1 lef gene 1831 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG 89% identical! Why can’t Blast figure out what you can plainly see? Sequence 1lcl|PCR Product DG47 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. No significant similarity was found Length190 Length190 Scenario 3: How does Blast work? • Clearly we need to understand more about how sequence alignment really works! • Theory behind nucleotide vs nucleotide Blast • Working BlastN program • Theory behind protein-protein Blast • How to get Blast to do what you want “Flavours” of sequence alignment Global Alignment - Needleman-Wunsch algorithm - Compares two sequences across their whole length - Mostly only useful when you already know sequences might be similar - Not useful for comparing a short query to an entire genome. - Not discussed further in this class. Local Alignment - Allows alignment of subsequences of the target and the query - Usually what we want ; the query can be searched against entire genomes or large databases. Crude Local Alignment Methods The “Dot Matrix” method (Gibbs and McIntyre, 1970) Represents the query and target sequences as a matrix ( a twodimensional array) using a sliding window of similarity The human eye can powerfully distinguish the identity line from the noise The “Dot Matrix” method (Gibbs and McIntyre, 1970) Normally a “window size” and “stringency” are specified i.e. if the window size is 8 and stringency is 6, a dot is only placed if at least 6 of the current 8 positions in the query match the target The “Dot Matrix” method (Gibbs and McIntyre, 1970) G G T A A T A G G T A A T A window =2 stringency = 2 Problems with the Dot Matrix method 1. Requires human supervision! 2. A memory and processor time pig (a complete m*n matrix is calculated each time) 3. No explicit handling of gaps 4. No good quantitative score of alignment quality The Smith-Waterman Algorithm (no gaps version) G G T G G 1 1 1 T A A 2 NoMatch Penalty = -2 3 1 4 C A T Match Extension = +1 A T A 1 2 Negative values are reset to zero!! 2 1 3 2 1 4 Download SmithWaterman1.py Smith Waterman – Dynamic Programming An optimal alignment can be found starting from the highest scoring box and working backwards. Dynamic Programming is a method for recording the solutions to subproblems, then working backwards to find an overall solution. If we incorporate gaps, we must start keeping track of this “traceback” pathway. The Smith-Waterman Algorithm (with gaps) G G G 1 1 NoMatch Penalty = -2 G 1 2 Gap Penalty T A C T A T A A 3 T A Match Extension = +1 = -3 Take the Max of: 4 1 1 0; adding Query Gap; -2 2 adding Target Gap; Match/No match; Download SmithWaterman2.py Problems with Smith-Waterman Still a pig! Memory and processor time requirements are huge when the query and/or the database gets large….. (a complete m*n matrix is still calculated each time!!) Do we really need to calculate the whole matrix? BlastN – “word” based heuristics Notice that in a typical S-W matrix, most of the boxes are empty!!! What if we find exact matches of some seed words, then just work in the area surrounding these seeds trying to extend the alignment? This is exactly the heuristic that blast employs to avoid calculating the whole matrix! (see figure on page 6 of Alignment notes) BlastN Procedure Filter the query sequence for repetitive “low complexity” sequences Identify the subsequences of size word in the query Find the exact matches in the target of the all the words Use a modified S-W to extend the hits around the seed words Score and report on the best matches More on scoring on next class!!!
© Copyright 2025 Paperzz