ALGORITHMES POUR LA BIO-INFORMATIQUE ET
LA VISUALISATION
COURS 4
Raluca Uricaru
Pairwise sequence alignment and dynamic programming
Based on:
MIT OpenCourseWare, Computational Biology: Genomes,
Networks, Evolution
We would like to highlight similarity between the sequences
Our ultimate goal: infer the evolution scenario
The space of global alignments • some of the possible alignments between ELV et VIS ELV -‐ELV -‐-‐ELV ELV-‐ E-‐LV EL-‐V VIS VIS-‐ VIS-‐-‐ -‐VIS VIS-‐ -‐VIS • how many are they? The space of global alignments • some of the possible alignments between ELV et VIS ELV -‐ELV -‐-‐ELV ELV-‐ E-‐LV EL-‐V VIS VIS-‐ VIS-‐-‐ -‐VIS VIS-‐ -‐VIS • how many are they? for 2 sequences of 100 length we have 1077 possible alignments How can we compute best alignments • Given sequences S1= ACGTCATCA & S2 = TAGTGTCA • And an addiHve score foncHon based on – score of each mutaHon (AG, CT, other) – cost of inserHon/deleHon – reward of match • We need an algorithm that infers the best alignment • Observa(on: enumeraHon would be too long SoluHon – Dynamic Programming • solves an instance of a problem by taking advantage of solu=ons to subparts of the problem – reduces the problem of the best alignment of two sequences to best alignments of all prefixes of the sequences – avoid recalculaHng the scores already considered e.g. like for the Fibonacci sequence • Needleman & Wunsch, J. Mol. Biol. 1970 Dynamic programming idea • given an n length sequence S1 and an m length sequence S2, • build a (n+1) x (m+1) matrix F • F(i,j) = score of the best alignment of S1[1..i] and S2[1..j] • systemaHcally fill in the matrix and compute the opHmal score in F(n, m) • trace back from the opHmal score, and find alignment soluHon Filling in the matrix • compute scores for prefixes of increasing length (start from top-‐leb) • only 3 possibiliHes to advance on the 2 seqs – a gap in one sequence – a gap in the other – a match, or a mismatch F(i-‐1,j) -‐ gap F(i, j) = max { F(i-‐1,j-‐1) + score } F(i,j-‐1) -‐ gap Increased sequence availability new problems like querying sequence db Query – new sequence Subject – many old sequences Goal – find related sequences • most sequences are unrelated • individual alignment needs not to be perfect • queries must be fast Speeding up searches • Exploit the nature of the problem – if we reject any match with id% <90, then why bother looking at sequences without long stretches of idenHcal characters – pre-‐screen sequences for common long stretches • Put the speed where you need it – pre-‐processing the database is offline – once the query arrives, act fast SoluHon content based indexing e.g. BLAST, J.Mol. Biol. 1990 46 versions, among which Gapped Blast (24000 citaHons) Whole Genome Alignment
© Copyright 2026 Paperzz