Pairwise sequence alignment and dynamic programming

ALGORITHMES POUR LA BIO-INFORMATIQUE ET
LA VISUALISATION
COURS 4
Raluca Uricaru
Pairwise sequence alignment and dynamic programming
Based on:
MIT OpenCourseWare, Computational Biology: Genomes,
Networks, Evolution
We would like to highlight similarity between the sequences
Our ultimate goal: infer the evolution scenario
The space of global alignments •  some of the possible alignments between ELV et VIS ELV -‐ELV -‐-‐ELV ELV-‐ E-‐LV EL-‐V VIS VIS-‐ VIS-‐-‐ -‐VIS VIS-‐ -‐VIS •  how many are they? The space of global alignments •  some of the possible alignments between ELV et VIS ELV -‐ELV -‐-‐ELV ELV-‐ E-‐LV EL-‐V VIS VIS-‐ VIS-‐-‐ -‐VIS VIS-‐ -‐VIS •  how many are they? for 2 sequences of 100 length we have 1077 possible alignments How can we compute best alignments •  Given sequences S1= ACGTCATCA & S2 = TAGTGTCA •  And an addiHve score foncHon based on –  score of each mutaHon (AG, CT, other) –  cost of inserHon/deleHon –  reward of match •  We need an algorithm that infers the best alignment •  Observa(on: enumeraHon would be too long SoluHon – Dynamic Programming •  solves an instance of a problem by taking advantage of solu=ons to subparts of the problem –  reduces the problem of the best alignment of two sequences to best alignments of all prefixes of the sequences –  avoid recalculaHng the scores already considered e.g. like for the Fibonacci sequence •  Needleman & Wunsch, J. Mol. Biol. 1970 Dynamic programming idea •  given an n length sequence S1 and an m length sequence S2, •  build a (n+1) x (m+1) matrix F •  F(i,j) = score of the best alignment of S1[1..i] and S2[1..j] •  systemaHcally fill in the matrix and compute the opHmal score in F(n, m) •  trace back from the opHmal score, and find alignment soluHon Filling in the matrix •  compute scores for prefixes of increasing length (start from top-‐leb) •  only 3 possibiliHes to advance on the 2 seqs –  a gap in one sequence –  a gap in the other –  a match, or a mismatch F(i-‐1,j) -‐ gap F(i, j) = max { F(i-‐1,j-‐1) + score } F(i,j-‐1) -‐ gap Increased sequence availability new problems like querying sequence db Query – new sequence Subject – many old sequences Goal – find related sequences •  most sequences are unrelated •  individual alignment needs not to be perfect •  queries must be fast Speeding up searches •  Exploit the nature of the problem –  if we reject any match with id% <90, then why bother looking at sequences without long stretches of idenHcal characters –  pre-‐screen sequences for common long stretches •  Put the speed where you need it –  pre-‐processing the database is offline –  once the query arrives, act fast SoluHon content based indexing e.g. BLAST, J.Mol. Biol. 1990 46 versions, among which Gapped Blast (24000 citaHons) Whole Genome Alignment

Download Report

Pairwise sequence alignment and dynamic programming

Paperzz.com

Your Paperzz