Pairwise sequence alignment and dynamic programming

ALGORITHMES POUR LA BIO-INFORMATIQUE ET
LA VISUALISATION
COURS 4
Raluca Uricaru
Pairwise sequence alignment and dynamic programming
Based on:
MIT OpenCourseWare, Computational Biology: Genomes,
Networks, Evolution
We would like to highlight similarity between the sequences
Our ultimate goal: infer the evolution scenario
The space of global alignments •  some of the possible alignments between ELV et VIS ELV -­‐ELV -­‐-­‐ELV ELV-­‐ E-­‐LV EL-­‐V VIS VIS-­‐ VIS-­‐-­‐ -­‐VIS VIS-­‐ -­‐VIS •  how many are they? The space of global alignments •  some of the possible alignments between ELV et VIS ELV -­‐ELV -­‐-­‐ELV ELV-­‐ E-­‐LV EL-­‐V VIS VIS-­‐ VIS-­‐-­‐ -­‐VIS VIS-­‐ -­‐VIS •  how many are they? for 2 sequences of 100 length we have 1077 possible alignments How can we compute best alignments •  Given sequences S1= ACGTCATCA & S2 = TAGTGTCA •  And an addiHve score foncHon based on –  score of each mutaHon (AG, CT, other) –  cost of inserHon/deleHon –  reward of match •  We need an algorithm that infers the best alignment •  Observa(on: enumeraHon would be too long SoluHon – Dynamic Programming •  solves an instance of a problem by taking advantage of solu=ons to subparts of the problem –  reduces the problem of the best alignment of two sequences to best alignments of all prefixes of the sequences –  avoid recalculaHng the scores already considered e.g. like for the Fibonacci sequence •  Needleman & Wunsch, J. Mol. Biol. 1970 Dynamic programming idea •  given an n length sequence S1 and an m length sequence S2, •  build a (n+1) x (m+1) matrix F •  F(i,j) = score of the best alignment of S1[1..i] and S2[1..j] •  systemaHcally fill in the matrix and compute the opHmal score in F(n, m) •  trace back from the opHmal score, and find alignment soluHon Filling in the matrix •  compute scores for prefixes of increasing length (start from top-­‐leb) •  only 3 possibiliHes to advance on the 2 seqs –  a gap in one sequence –  a gap in the other –  a match, or a mismatch F(i-­‐1,j) -­‐ gap F(i, j) = max { F(i-­‐1,j-­‐1) + score } F(i,j-­‐1) -­‐ gap Increased sequence availability new problems like querying sequence db Query – new sequence Subject – many old sequences Goal – find related sequences •  most sequences are unrelated •  individual alignment needs not to be perfect •  queries must be fast Speeding up searches •  Exploit the nature of the problem –  if we reject any match with id% <90, then why bother looking at sequences without long stretches of idenHcal characters –  pre-­‐screen sequences for common long stretches •  Put the speed where you need it –  pre-­‐processing the database is offline –  once the query arrives, act fast SoluHon content based indexing e.g. BLAST, J.Mol. Biol. 1990 46 versions, among which Gapped Blast (24000 citaHons) Whole Genome Alignment