Scenario 3: Our Story

Welcome to
Introduction to Bioinformatics
I. Scenario 4: Sequence alignment
• Bring up course web site
• Go to Scenario 4
• Open the first sequence alignment notes
Scenario 3: Our Story
You: Our first defense at CDC
Outbreak: . . . Anthrax?
Samples:
• Confirm agent
• Identify strain
Scenario 3: Our Story
Toxin genespecific primers
Scenario 3: Our Story
If DNA from
bacterium with
toxin gene
PCR
If DNA NOT
from bacterium
with toxin gene?
Scenario 3: Our Story
If DNA from
bacterium with
toxin gene
If DNA NOT
from bacterium
with toxin gene?
PCR
(no product)
Scenario 3: Our Story
DG47
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
>gi|16031490|emb|AJ413935.1|BAN413935 Bacillus anthracis partial lef gene, isolate Microsoft-6259
Length = 2417 Score = 155 bits (78), Expect = 2e-35 Identities = 138/158 (87%) Strand = Plus / Plus
Query: 1
aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1267 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 1326
Query:
61 tatgaaaacatgaatataaataacttaacagcaacgttaggtgccgatttagtagattcc 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1327 tatgaaaacatgaatataaataacctaacagcaacgttaggtgccgatttagtagattcc 1386
Query:
121 acagataatacaaaaattaatcgaggtatattcaatga 158
||||||||||||||||||||||||||||||||||||||
Sbjct: 1387 acagataatacaaaaattaatcgaggtatattcaatga 1424
Scenario 3: Our Story
PCR
Toxin gene present
Scenario 3: Our Story
DG47
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
Do it!
Scenario 3: Our Story
Maybe it’s not from the toxin gene??
Scenario 3: Our Story
DG47
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
Translate
NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGIFNEFKKNFKYSIS
Do it!
DG47 nucleotide sequence: Matches nothing in GenBank
DG47 amino acid sequence: 100% match to toxin gene
Scenario 3: Our Story
Compare nucleotide sequences by hand
DG47 vs lef
Do it!
Scenario 3: Our Story
Compare nucleotide sequences by hand
DG47
1
lef gene 1831
DG47
61
lef gene 1891
DG47
121
lef gene 1951
DG47
181
lef gene 2011
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG
|||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| ||
AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG
TATGAAAACATGAATATAAATAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC
|||||||| |||||||| |||||| | |||||||| ||||||| |||||||| ||||||
TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTTAGTTGATTCC
ACAGATAATACAAAAATTAATCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC
|| |||||||| ||||||||| ||||||| |||||||| |||||||||||||||||||||
ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAATTTCAAATAT
AGTATTTCTA
||||||||||
AGTATTTCTA
89% identical!
Scenario 3: Our Story
Compare nucleotide sequences by hand
DG47
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG
+
lef gene
Sequence 1lcl|PCR
Product DG47
Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020.
No significant similarity was found
Length190
Length190
Scenario 3: Our Story
DG47
1
lef gene 1831
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG
|||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| ||
AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG
89% identical!
Why can’t Blast figure out
what you can plainly see?
Sequence 1lcl|PCR
Product DG47
Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020.
No significant similarity was found
Length190
Length190
Scenario 3: How does Blast work?
• Clearly we need to understand more about how
sequence alignment really works!
• Theory behind nucleotide vs nucleotide Blast
• Working BlastN program
• Theory behind protein-protein Blast
• How to get Blast to do what you want
“Flavours” of sequence alignment
Global Alignment - Needleman-Wunsch algorithm
- Compares two sequences across their whole length
- Mostly only useful when you already know sequences might be similar
- Not useful for comparing a short query to an entire genome.
- Not discussed further in this class.
Local Alignment
- Allows alignment of subsequences of the target and the query
- Usually what we want ; the query can be searched against entire
genomes or large databases.
Crude Local Alignment Methods
The “Dot Matrix” method (Gibbs and McIntyre, 1970)
Represents the query and target sequences as a matrix ( a twodimensional array) using a sliding window of similarity
The human eye can powerfully distinguish the identity line from the noise
The “Dot Matrix” method (Gibbs and McIntyre, 1970)
Normally a “window size” and “stringency” are specified
i.e. if the window size is 8 and stringency is 6, a dot is only placed
if at least 6 of the current 8 positions in the query match the target
The “Dot Matrix” method (Gibbs and McIntyre, 1970)
G
G
T
A
A
T
A
G
G
T
A
A
T
A
window
=2
stringency = 2
Problems with the Dot Matrix method
1. Requires human supervision!
2. A memory and processor time pig
(a complete m*n matrix is calculated each time)
3. No explicit handling of gaps
4. No good quantitative score of alignment quality
The Smith-Waterman Algorithm (no gaps version)
G
G
T
G
G
1
1
1
T
A
A
2
NoMatch Penalty = -2
3
1
4
C
A
T
Match Extension = +1
A
T
A
1
2
Negative values are
reset to zero!!
2
1
3
2
1
4
Download
SmithWaterman1.py
Smith Waterman – Dynamic Programming
An optimal alignment can be found starting from the
highest scoring box and working backwards.
Dynamic Programming is a method for recording the
solutions to subproblems, then working backwards
to find an overall solution. If we incorporate gaps,
we must start keeping track of this “traceback”
pathway.
The Smith-Waterman Algorithm (with gaps)
G
G
G
1
1
NoMatch Penalty = -2
G
1
2
Gap Penalty
T
A
C
T
A
T
A
A
3
T
A
Match Extension = +1
= -3
Take the Max of:
4
1
1
0;
adding Query Gap;
-2
2
adding Target Gap;
Match/No match;
Download
SmithWaterman2.py
Problems with Smith-Waterman
Still a pig! Memory and processor time
requirements are huge when the query
and/or the database gets large…..
(a complete m*n matrix is still calculated each time!!)
Do we really need to calculate the whole matrix?
BlastN – “word” based heuristics
Notice that in a typical S-W matrix, most of the
boxes are empty!!!
What if we find exact matches of some seed
words, then just work in the area surrounding
these seeds trying to extend the alignment?
This is exactly the heuristic that blast employs to
avoid calculating the whole matrix!
(see figure on page 6 of Alignment notes)
BlastN Procedure
Filter the query sequence for repetitive “low
complexity” sequences
Identify the subsequences of size word in the query
Find the exact matches in the target of the all the words
Use a modified S-W to extend the hits around the seed words
Score and report on the best matches
More on scoring on next class!!!