Large-scale sequence comparison: Spaced seeds and suffix arrays Martin C. Frith Computational Biology Research Center, AIST, Tokyo Contents • Motivation: why is large-scale sequence comparison important? – Genome alignment – Tag mapping • Sequence alignment: basic definitions • Sequence alignment: basic algorithms – Smith-Waterman-Gotoh – Seed-and-extend (BLAST) • Spaced seeds • Suffix arrays • Unifications Motivation 1: Genome alignment What is “genome alignment”? Human genome Mouse genome …atttcgatcgaatg---ctattgatcgatttg… |||x||||x||x||xxx||||x|||xx||x|| …attacgatggattgaaactatcgatttatatg… What is “genome alignment”? Human genome Mouse genome Find and align similar pieces of the genomes …atttcgatcgaatg---ctattgatcgatttg… |||x||||x||x||xxx||||x|||xx||x|| …attacgatggattgaaactatcgatttatatg… Uses of genome alignments The human and chimp genomes are mostly very similar: Human Chimp …actgctacgtacgtacggcatgctacgtagggcatgca-gctagcatgc… ||||||x|||||||||||||||||||||||||||||||x|||||||||| …actgcttcgtacgtacggcatgctacgtggggcatgcaggctagcatgc… So why does this region have so many differences? Human Chimp …cgtagctacgtacgtagcattatcgtcgtatcgtgctat-gtgatcgta… ||||||x|||||x||x|||||||x|||||xxxx||||||x||||||||| …cgtagccacgtaagttgcattatggtcgt----tgctatcgtgatcgta… It may have a function that “makes us human” Uses of genome alignments The human and mouse genomes are mostly quite different: Human Mouse …actgctacgtacgtacggcatgcta-----ggcatgcatgctagcatgc… ||xx||x||||xx|||||xx|x|||xxxxx|||x||x|x||xxx|xx|| …acaacttcgta--tacggggttctacgtggggcttgaaagc---cgggc… So why is this region so similar? Human Mouse …cgtagctacgtacgtagcattatcgtcgtatcgtgctatagtgatcgta… |||||||||||||x||||||||||||||||||xxx|||||||||||||| …cgtagctacgtacatagcattatcgtcgtatc---ctatagtgatcgta… It may have a function essential for life in mammals Uses of genome alignments Human Chicken …actgctacgtacgtacggcatgctacgtaggg… ||x||x||x|||||x||x||x||x||x||x|| …accgcaacatacgtgcgccacgcgacttatgg… Are these differences random, or is there a pattern? The genetic code Uses of genome alignments Human Chicken …actgctacgtacgtacggcatgctacgtaggg… ||x||x||x|||||x||x||x||x||x||x|| …accgcaacatacgtgcgccacgcgacttatgg… This region may code for a protein Summary • Genome alignment is a powerful way to decode genome sequences Motivation 2: Tag mapping What is “tag mapping”? • “Tag” = short DNA sequence • “Map” = match it to a genome sequence Lots of human tags: catcgtagctta catgctgatcgt agctatgcaatg cgtacgtagcta Human genome: cgtacgtagcta agctatgcaatg …catgctacgtagctacgtacgtagctagctagctatgcaatgc… What is “tag mapping”? • “Tag” = short DNA sequence • “Map” = match it to a genome sequence Lots of human tags: catcgtagctta catgctgatcgt agctatgcaatg cgtacgtagcta The matches may not be perfect: sequencing errors, genetic variation Human genome: cgtacgtagcta agctatgcaatg …catgctacgtagctacgtacgtagctagctagctatgcaatgc… Uses of tag mapping 1. Find transcription start sites 2. Find the binding sites in the genome of a particular type of protein 3. Discover genetic variation 4. … (many other uses) Uses of tag mapping 1. Find transcription start sites 2. Find the binding sites in the genome of a particular type of protein 3. Discover genetic variation 4. … (many other uses) Uses of tag mapping Cells Messenger RNA Extract Cut Sequence, map to genome gctacg aatcat …catgctacgtagctacgcaatcattgacgc… Starts of mRNA Uses of tag mapping Cells Messenger RNA Extract Find gene start sites Sequence, map to genome gctacg aatcat …catgctacgtagctacgcaatcattgacgc… Cut Starts of mRNA Uses of tag mapping 1. Find transcription start sites 2. Find the binding sites in the genome of a particular type of protein 3. Discover genetic variation 4. … (many other uses) Uses of tag mapping Uses of tag mapping Find protein-binding sites in the genome Summary • Tag generation and mapping is a powerful way to learn about genome biology Sequence alignment: basic definitions 2 kinds of alignment • Global alignment – Align the whole sequences: tttgc--tgctagctgcg-acatcaggtatataccc xx|||xx||x||xxx|||x|||||x||||xxx|x|x catgctatggta---gcgtacatctggtagctagct • Local alignment – Find and align similar sub-sequences: ctatcgtcgctaacgtacctactacgtagctagtcgtacgtacac |||||x|||||xx|||||| actgatcgtacatcgtacgtacta--tagctatgctgcgcgtatg • We usually want local alignment – Global is a special case of local Problem definition • Before developing alignment algorithms, we need a precise definition of the problem ctatcgtcgctaacgtacctactacgtagctagtcgtacgtacac |||||x|||||xx|||||| actgatcgtacatcgtacgtacta--tagctatgctgcgcgtatg Alignment scores Score matrix: c c t g a c g t a +2 -2 -1 -2 c -2 +2 -2 -1 g -1 -2 +2 -2 t a g t t c a | | x | x | | t a a t - c a +2 +2 -1 +2 -3 +2 +2 t -2 -1 -2 +2 Gap score: -3 t c Alignment score = 6 Problem definition: find the alignment with maximum score More than one alignment • Find alignments with high scores – Not just the highest-scoring one catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat • Complication: – Small variants of the highest-scoring alignment also have high scores – We don’t want lots of minor variants – For a precise definition, see: • MS Waterman & M Eggert 1987, J Mol Biol. 197:723-8 Gap scores atgctagcgcgtatgctgtagctgatcgta |||||||||| ||||||x|||||| atgctagcgc-------gtagctcatcgta 7 • Linear gap penalty: – gap score = −g×7 – E.g. g=3 • Affine gap penalty: – gap score = −(a + b×7) – E.g. a=11, b=1 We usually use this Bad scoring schemes • Not all scoring schemes are reasonable (for local alignment) a c g t a +9 -1 -1 -1 c -1 +9 -1 -1 g -1 -1 +9 -1 t -1 -1 -1 +9 • The score should be negative on average – Otherwise, even random sequences will produce large alignments! • Some popular alignment tools get this wrong! Alignment score statistics • Even random sequences will have some highest-scoring alignment • So we need to know whether a particular alignment score is higher than would be expected for random sequences – If it is, the alignment is significant • The score distribution for random sequences can be calculated – ftp://ftp.ncbi.nih.gov/pub/spouge/BLAST_Gumbel/ A problem with the definition catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat Very similar (score=400) Completely different Very similar (score=500) • If the “completely different” part can be aligned with score > −400, then the highestscoring alignment will include it! • No-one has a perfect solution Summary • Local alignment is usually defined in terms of maximizing a score • There are various subtleties • The definition is not perfect Sequence alignment: basic algorithms Alignment algorithms • Exact (Smith-Waterman-Gotoh) – Guaranteed to find the highest alignment score – Slow • Heuristic (seed-and-extend) – Not guaranteed to find the highest alignment score – Fast – Avoids the problem of aligning completely different regions Smith-Waterman-Gotoh algorithm First sequence Second sequence a c g a t 0 0 0 0 a 2 0 0 2 t 0 1 0 0 g 0 0 3 1 c 0 2 1 1 a 2 0 0 3 a 2 0 0 2 t 0 1 0 0 c 0 2 0 0 • Each cell has the score of the best alignment ending at that point Smith-Waterman-Gotoh algorithm • Run time: proportional to product of sequence lengths • Recent implementations can calculate ~3 billion cells per second • Human genome × mouse genome: – About 3 billion × 3 billion – Run time: ~95 CPU years – Feasible with 1000 CPUs? Seed-and-extend algorithms (e.g. BLAST) 1. Find “seeds”: catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat 2. Extend alignments from these seeds: catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat Seeds • What are “seeds”? – The simplest seeds are exact matches of a fixed size (e.g. size = 11) • Seed finding step: – Find all exact matches of size 11 Seed finding algorithm 1. Pre-processing step – Record the positions of all 11-letter strings (11-mers) in the first sequence 2. Search step – Scan the second sequence: for each 11-mer, look up its positions in the first sequence. Example: seed size = 3 First sequence Positions a t c g a t c g a 0 1 2 3 4 5 6 7 8 Position table atc cga gat tcg 0 4 2 6 3 1 5 3-mer table 0 0 0 … 0 2 2 … 2 4 … aaa aac aag … atc atg att … cga cgc … Extension catgctagctagctagctagctagctagctagcatcgatcagggattg atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgt catgctagctagctagctagctagctagctagcatcgatcagggattg atatgctagctacgtagctgatagctat Extension catgctagctagctagctagctagctagctagcatcgatcagggattg atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgt catgctagctagctagctagctagctagctagcatcgatcagggattg atatgctagctacgtagctgatagctat Extension catgctagctagctagctagctagctagctagcatcgatcagggattg atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgt Do Smith-Waterman-Gotoh in a small part of the matrix catgctagctagctagctagctagctagctagcatcgatcagggattg atatgctagctacgtagctgatagctat Stop extending when the score drops by a lot “X-drop algorithm” Three step method • Find seeds • Do extension without gaps – Reject if the alignment score is very low – Much faster than gapped extension • Do extension with gaps Choice of seed size • Too large: low sensitivity – A large exact match is required • Too small: very long running time – We get lots of small exact matches just by chance, even in random sequences – We have to do lots of extensions Random seed hits M = length of the first sequence N = length of the second sequence S = seed size M catgctatcgtatgtagctagctagctagctagctagctagcatcgatcagggattg Possible seed start positions = M-S+1 S A seed hit can occur at (M-S+1) × (N-S+1) positions At each position, the match probability is: 1/4S => Expected number of hits = (M-S+1) × (N-S+1) / 4S Random seed hits Length of the human genome ≈ 3 billion Length of the mouse genome ≈ 3 billion What seed size should we use? Random seed hits Length of the human genome ≈ 3 billion Length of the mouse genome ≈ 3 billion Seed size Random hits 12 536 billion 16 2 billion 20 8 million Problems with seed-and-extend • Limited sensitivity – Exact matches are required • Trouble with repeats – Genomes have lots of “repeats”, e.g. tatatatatatatatatatatata – These cause (lots)2 seed hits – E.g. human and chimp each have ~1 million repeated Alu elements => (1 million)2 hits – In practice, repeat-masking is essential Improvements • Limited sensitivity –Spaced seeds • Trouble with repeats –Suffix arrays Spaced seeds Quiz question • We have a random DNA sequence of length 5. • Which of these words is more likely to occur? aaaa gatc What are “spaced seeds”? • Matches where we ignore certain positions – E.g. ignore the 3rd and 6th positions out of 8: a c a t t g a c a c c t t c a c How do spaced seeds help? • They increase sensitivity in an important special case – Protein-coding sequences • Surprisingly, they increase sensitivity in general too Spaced seeds for protein-coding sequence Protein-coding sequence tends to mutate at every 3rd position …actgctacgtacgtacggcatgctacgtaggg… ||x||x||x|||||x||x||x||x||x||x|| …accgcaacatacgtgcgccacgcgacttatgg… Here, we cannot find any match with un-spaced seeds, but we can with spaced seeds Spaced seeds in general • Example: try to detect similarities with length 64, and 70% identity 64 tatgctagctacctgatatgccatcgactaatgatcgtgtgga tagctagctactagctagctggatgctagtcgtttctatgtcc Match probability at each position = 0.7 • Which of these seed patterns is better? Both have 11 match positions Spaced seed example 11 64 tatgctagctacctgatatgccatcgactaatgatcgtgtgga tagctagctactagctagctggatgctagtcgtttctatgtcc Match probability at each position = 0.7 • Probability of getting a seed hit: – Hard to calculate • Expected number of hits: – (64-11+1) × 0.711 = 1.07 Spaced seed example 18 64 tatgctagctacctgatatgccatcgactaatgatcgtgtgga tagctagctactagctagctggatgctagtcgtttctatgtcc Match probability at each position = 0.7 • Probability of getting a seed hit: – Hard to calculate • Expected number of hits: – (64-18+1) × 0.711 = 0.93 Spaced seed example 11 18 64 tatgctagctacctgatatgccatcgactaatgatcgtgtgga tagctagctactagctagctggatgctagtcgtttctatgtcc Match probability at each position = 0.7 • Probability of getting a seed hit: – Hard to calculate, but can be done with a dynamic programming algorithm – Un-spaced seed: = 0.30 – Spaced seed: = 0.466 • So the spaced seed is more sensitive! Analogy: word counts • We have a random DNA sequence of length 5. • Which of these words is more likely to occur? aaaa gatc • Expected number of occurrences – gatc: (5 - 4 + 1) × (1/4)4 – aaaa: (5 - 4 + 1) × (1/4)4 = = 1/128 1/128 Analogy: word counts • We have a random DNA sequence of length 5. Prob(aaaa occurs twice): Prob(aaaa occurs once): Prob(aaaa occurs): aaaaa aaaac aaaag aaaat caaaa gaaaa taaaa 1 / 45 6 / 45 7 / 45 Analogy: word counts • We have a random DNA sequence of length 5. 0 Prob(gatc occurs twice): Prob(gatc occurs once): Prob(gatc occurs): gatca gatcc gatcg gatct agatc cgatc ggatc tgatc 8 / 45 8 / 45 Analogy: word counts • We have a random DNA sequence of length 5. • Which of these words is more likely to occur? aaaa gatc • Surprise! gatc is more likely to occur – Because it cannot overlap itself Suffix arrays What is a “suffix array”? A sequence: a c g a t c g a t g Positions in the sequence: 0 1 2 3 4 5 6 7 8 9 Positions sorted alphabetically: 0 3 7 2 5 9 2 6 4 8 This is a suffix array What can you do with a suffix array? • Find matches by binary search: A sequence: a c g a t c g a t g Positions in the sequence: 0 1 2 3 4 5 6 7 8 9 Positions sorted alphabetically: 0 3 7 2 5 9 2 6 4 8 Query sequence: a t c What can you do with a suffix array? • Find matches by binary search: A sequence: a c g a t c g a t g Positions in the sequence: 0 1 2 3 4 5 6 7 8 9 Positions sorted alphabetically: 0 3 7 2 5 9 2 6 4 8 a Query sequence: A t c What can you do with a suffix array? • Find matches by binary search: A sequence: a c g a t c g a t g Positions in the sequence: 0 1 2 3 4 5 6 7 8 9 Positions sorted alphabetically: 0 3 7 2 5 9 2 6 4 8 a Query sequence: A T c at What can you do with a suffix array? • Find matches by binary search: A sequence: a c g a t c g a t g Positions in the sequence: 0 1 2 3 4 5 6 7 8 9 Positions sorted alphabetically: 0 3 7 2 5 9 2 6 4 8 a Query sequence: A T C at atc What use are suffix arrays? • Earlier, we saw how to find 3-mer matches using lookup tables • Binary search is slower than lookup – So what use are suffix arrays? • Suffix arrays are more flexible: – With a suffix array, we can find matches of any size – With a lookup table, we can only find matches of one size, fixed in advance – Lookup tables do not work for large match sizes, because they need 4S units of memory Suffix arrays and repeats Sequence: acgattgtgtgtgtgaacgatgcctagtgtg Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8 Query sequence: tgtgtgtgtgtg Suffix arrays and repeats Sequence: acgattgtgtgtgtgaacgatgcctagtgtg Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8 t Query sequence: Tgtgtgtgtgtg Suffix arrays and repeats Sequence: acgattgtgtgtgtgaacgatgcctagtgtg Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8 t Query sequence: TGtgtgtgtgtg tg Suffix arrays and repeats Sequence: acgattgtgtgtgtgaacgatgcctagtgtg Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8 t Query sequence: TGTgtgtgtgtg tg tgt Suffix arrays and repeats Sequence: acgattgtgtgtgtgaacgatgcctagtgtg Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8 t Query sequence: TGTgtgtgtgtg tg tgt • We can adapt the match size to the repetitiveness: use larger matches for repeats Suffix array construction • Any sorting algorithm can be used – E.g. radix sort – But it might be slow: worse-than-linear running time • There are specialized suffix array construction algorithms – Running time linearly proportional to the sequence length Unifications Seed finding vs. suffix array First sequence Positions a t c g a t c g a 0 1 2 3 4 5 6 7 8 Position table atc cga gat tcg 0 4 2 6 3 1 5 3-mer table 0 0 0 … 0 2 2 … 2 4 … aaa aac aag … atc atg att … cga cgc … Seed finding vs. suffix array First sequence Positions This almost is a suffix array! We just need to finish sorting it a t c g a t c g a 0 1 2 3 4 5 6 7 8 Position table atc cga gat tcg 0 4 2 6 3 1 5 3-mer table 0 0 0 … 0 2 2 … 2 4 … aaa aac aag … atc atg att … cga cgc … Hybrid method First sequence Positions a t c g a t c g a 0 1 2 3 4 5 6 7 8 Suffix array Use lookup for short matches atc Use binary search for long matches cga gat tcg 4 0 6 2 3 5 1 3-mer table 0 0 0 … 0 2 2 … 2 4 … aaa aac aag … atc atg att … cga cgc … Hybrid method First sequence Positions a t c g a t c g a 0 1 2 3 4 5 6 7 8 3-mer table Suffix array Use lookup for short matches atc Use binary search for long matches cga gat tcg 4 0 6 Best 2 3 5 1 of both worlds 0 0 0 … 0 2 2 … 2 4 … aaa aac aag … atc atg att … cga cgc … Spaced seeds & suffix arrays • We can make a “spaced suffix array”: Sequence Spaced suffixes Sorted suffixes atgctagcc 0 1 2 3 4 5 6 7 8 5 0 8 7 3 6 2 4 1 at_ct_gc_ tg_ta_cc gc_ag_c ct_gc_ ta_cc ag_c gc_ cc c ag_c at_ct_gc_ c cc ct_gc_ gc_ gc_ag_c ta_cc tg_ta_cc An implementation • LAST – http://last.cbrc.jp • Sequence comparison & alignment – Seed-and-extend, similar to BLAST – Uses spaced suffix array to find seeds Example • Compare the human and mouse genomes – Took about 1 day on 1 CPU • The next figure is a dot-plot – Human chromosomes: along the top – Mouse chromosomes: down the side Conclusions • Sequence comparison is arguably the most fundamental task in computational biology • Usual problem definition: maximum score • Exact algorithm: Smith-Waterman-Gotoh • Heuristic approach: seed-and-extend • Spaced seeds improve sensitivity • Suffix arrays allow more flexible seed-finding • These seed-finding methods can be unified Looking for a job? http://www.cbrc.jp/
© Copyright 2026 Paperzz