Large-scale sequence comparison: Spaced seeds and suffix arrays

Large-scale sequence comparison:
Spaced seeds and suffix arrays
Martin C. Frith
Computational Biology Research Center,
AIST, Tokyo
Contents
• Motivation: why is large-scale sequence
comparison important?
– Genome alignment
– Tag mapping
• Sequence alignment: basic definitions
• Sequence alignment: basic algorithms
– Smith-Waterman-Gotoh
– Seed-and-extend (BLAST)
• Spaced seeds
• Suffix arrays
• Unifications
Motivation 1:
Genome alignment
What is “genome alignment”?
Human genome
Mouse genome
…atttcgatcgaatg---ctattgatcgatttg…
|||x||||x||x||xxx||||x|||xx||x||
…attacgatggattgaaactatcgatttatatg…
What is “genome alignment”?
Human genome
Mouse genome
Find and align similar
pieces of the genomes
…atttcgatcgaatg---ctattgatcgatttg…
|||x||||x||x||xxx||||x|||xx||x||
…attacgatggattgaaactatcgatttatatg…
Uses of genome alignments
The human and chimp genomes are mostly very similar:
Human
Chimp
…actgctacgtacgtacggcatgctacgtagggcatgca-gctagcatgc…
||||||x|||||||||||||||||||||||||||||||x||||||||||
…actgcttcgtacgtacggcatgctacgtggggcatgcaggctagcatgc…
So why does this region have so many differences?
Human
Chimp
…cgtagctacgtacgtagcattatcgtcgtatcgtgctat-gtgatcgta…
||||||x|||||x||x|||||||x|||||xxxx||||||x|||||||||
…cgtagccacgtaagttgcattatggtcgt----tgctatcgtgatcgta…
It may have a function that “makes us human”
Uses of genome alignments
The human and mouse genomes are mostly quite different:
Human
Mouse
…actgctacgtacgtacggcatgcta-----ggcatgcatgctagcatgc…
||xx||x||||xx|||||xx|x|||xxxxx|||x||x|x||xxx|xx||
…acaacttcgta--tacggggttctacgtggggcttgaaagc---cgggc…
So why is this region so similar?
Human
Mouse
…cgtagctacgtacgtagcattatcgtcgtatcgtgctatagtgatcgta…
|||||||||||||x||||||||||||||||||xxx||||||||||||||
…cgtagctacgtacatagcattatcgtcgtatc---ctatagtgatcgta…
It may have a function essential for life in mammals
Uses of genome alignments
Human
Chicken
…actgctacgtacgtacggcatgctacgtaggg…
||x||x||x|||||x||x||x||x||x||x||
…accgcaacatacgtgcgccacgcgacttatgg…
Are these differences random, or is there a pattern?
The genetic code
Uses of genome alignments
Human
Chicken
…actgctacgtacgtacggcatgctacgtaggg…
||x||x||x|||||x||x||x||x||x||x||
…accgcaacatacgtgcgccacgcgacttatgg…
This region may code for a protein
Summary
• Genome alignment is a powerful way to
decode genome sequences
Motivation 2:
Tag mapping
What is “tag mapping”?
• “Tag” = short DNA sequence
• “Map” = match it to a genome sequence
Lots of human tags:
catcgtagctta
catgctgatcgt
agctatgcaatg
cgtacgtagcta
Human genome:
cgtacgtagcta
agctatgcaatg
…catgctacgtagctacgtacgtagctagctagctatgcaatgc…
What is “tag mapping”?
• “Tag” = short DNA sequence
• “Map” = match it to a genome sequence
Lots of human tags:
catcgtagctta
catgctgatcgt
agctatgcaatg
cgtacgtagcta
The matches may not be perfect:
sequencing errors,
genetic variation
Human genome:
cgtacgtagcta
agctatgcaatg
…catgctacgtagctacgtacgtagctagctagctatgcaatgc…
Uses of tag mapping
1. Find transcription start sites
2. Find the binding sites in the genome
of a particular type of protein
3. Discover genetic variation
4. … (many other uses)
Uses of tag mapping
1. Find transcription start sites
2. Find the binding sites in the genome
of a particular type of protein
3. Discover genetic variation
4. … (many other uses)
Uses of tag mapping
Cells
Messenger RNA
Extract
Cut
Sequence,
map to genome
gctacg
aatcat
…catgctacgtagctacgcaatcattgacgc…
Starts of mRNA
Uses of tag mapping
Cells
Messenger RNA
Extract
Find gene
start sites
Sequence,
map to genome
gctacg
aatcat
…catgctacgtagctacgcaatcattgacgc…
Cut
Starts of mRNA
Uses of tag mapping
1. Find transcription start sites
2. Find the binding sites in the genome
of a particular type of protein
3. Discover genetic variation
4. … (many other uses)
Uses of tag mapping
Uses of tag mapping
Find protein-binding
sites in the genome
Summary
• Tag generation and mapping is a
powerful way to learn about genome
biology
Sequence alignment:
basic definitions
2 kinds of alignment
• Global alignment
– Align the whole sequences:
tttgc--tgctagctgcg-acatcaggtatataccc
xx|||xx||x||xxx|||x|||||x||||xxx|x|x
catgctatggta---gcgtacatctggtagctagct
• Local alignment
– Find and align similar sub-sequences:
ctatcgtcgctaacgtacctactacgtagctagtcgtacgtacac
|||||x|||||xx||||||
actgatcgtacatcgtacgtacta--tagctatgctgcgcgtatg
• We usually want local alignment
– Global is a special case of local
Problem definition
• Before developing alignment
algorithms, we need a precise definition
of the problem
ctatcgtcgctaacgtacctactacgtagctagtcgtacgtacac
|||||x|||||xx||||||
actgatcgtacatcgtacgtacta--tagctatgctgcgcgtatg
Alignment scores
Score matrix:
c
c
t
g
a
c
g
t
a
+2
-2
-1
-2
c
-2
+2
-2
-1
g
-1
-2
+2
-2
t a g t t c a
| | x | x | |
t a a t - c a
+2 +2 -1 +2 -3 +2 +2
t
-2
-1
-2
+2
Gap score: -3
t
c
Alignment score = 6
Problem definition: find the alignment with maximum score
More than one alignment
• Find alignments with high scores
– Not just the highest-scoring one
catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat
• Complication:
– Small variants of the highest-scoring alignment
also have high scores
– We don’t want lots of minor variants
– For a precise definition, see:
• MS Waterman & M Eggert 1987, J Mol Biol. 197:723-8
Gap scores
atgctagcgcgtatgctgtagctgatcgta
||||||||||
||||||x||||||
atgctagcgc-------gtagctcatcgta
7
• Linear gap penalty:
– gap score = −g×7
– E.g. g=3
• Affine gap penalty:
– gap score = −(a + b×7)
– E.g. a=11, b=1
We usually use this
Bad scoring schemes
• Not all scoring schemes are reasonable (for
local alignment)
a
c
g
t
a
+9
-1
-1
-1
c
-1
+9
-1
-1
g
-1
-1
+9
-1
t
-1
-1
-1
+9
• The score should be negative on average
– Otherwise, even random sequences will produce
large alignments!
• Some popular alignment tools get this wrong!
Alignment score statistics
• Even random sequences will have some
highest-scoring alignment
• So we need to know whether a particular
alignment score is higher than would be
expected for random sequences
– If it is, the alignment is significant
• The score distribution for random sequences
can be calculated
– ftp://ftp.ncbi.nih.gov/pub/spouge/BLAST_Gumbel/
A problem with the definition
catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat
Very similar
(score=400)
Completely different
Very similar
(score=500)
• If the “completely different” part can be
aligned with score > −400, then the highestscoring alignment will include it!
• No-one has a perfect solution
Summary
• Local alignment is usually defined in
terms of maximizing a score
• There are various subtleties
• The definition is not perfect
Sequence alignment:
basic algorithms
Alignment algorithms
• Exact (Smith-Waterman-Gotoh)
– Guaranteed to find the highest alignment score
– Slow
• Heuristic (seed-and-extend)
– Not guaranteed to find the highest alignment score
– Fast
– Avoids the problem of aligning completely different
regions
Smith-Waterman-Gotoh
algorithm
First sequence
Second
sequence
a
c
g
a
t
0
0
0
0
a
2
0
0
2
t
0
1
0
0
g
0
0
3
1
c
0
2
1
1
a
2
0
0
3
a
2
0
0
2
t
0
1
0
0
c
0
2
0
0
• Each cell has the score of the best alignment
ending at that point
Smith-Waterman-Gotoh
algorithm
• Run time: proportional to product of
sequence lengths
• Recent implementations can calculate
~3 billion cells per second
• Human genome × mouse genome:
– About 3 billion × 3 billion
– Run time: ~95 CPU years
– Feasible with 1000 CPUs?
Seed-and-extend algorithms
(e.g. BLAST)
1. Find “seeds”:
catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat
2. Extend alignments from these seeds:
catgctagctagctagctagctagctagctagcatcgatcgtacgtacgtagctatagcgaactgcgt
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgtactattatattcggcatctat
Seeds
• What are “seeds”?
– The simplest seeds are exact matches of a
fixed size (e.g. size = 11)
• Seed finding step:
– Find all exact matches of size 11
Seed finding algorithm
1. Pre-processing step
– Record the positions of all 11-letter
strings (11-mers) in the first sequence
2. Search step
– Scan the second sequence:
for each 11-mer, look up its positions in
the first sequence.
Example: seed size = 3
First sequence
Positions
a t c g a t c g a
0 1 2 3 4 5 6 7 8
Position table
atc
cga
gat
tcg
0
4
2
6
3
1
5
3-mer table
0
0
0
…
0
2
2
…
2
4
…
aaa
aac
aag
…
atc
atg
att
…
cga
cgc
…
Extension
catgctagctagctagctagctagctagctagcatcgatcagggattg
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgt
catgctagctagctagctagctagctagctagcatcgatcagggattg
atatgctagctacgtagctgatagctat
Extension
catgctagctagctagctagctagctagctagcatcgatcagggattg
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgt
catgctagctagctagctagctagctagctagcatcgatcagggattg
atatgctagctacgtagctgatagctat
Extension
catgctagctagctagctagctagctagctagcatcgatcagggattg
atatgctagctacgtagctgatcgtagctatcgatgctagctgatcgt
Do Smith-Waterman-Gotoh
in a small part of the matrix
catgctagctagctagctagctagctagctagcatcgatcagggattg
atatgctagctacgtagctgatagctat
Stop extending when
the score drops by a lot
“X-drop algorithm”
Three step method
• Find seeds
• Do extension without gaps
– Reject if the alignment score is very low
– Much faster than gapped extension
• Do extension with gaps
Choice of seed size
• Too large: low sensitivity
– A large exact match is required
• Too small: very long running time
– We get lots of small exact matches just by
chance, even in random sequences
– We have to do lots of extensions
Random seed hits
M = length of the first sequence
N = length of the second sequence
S = seed size
M
catgctatcgtatgtagctagctagctagctagctagctagcatcgatcagggattg
Possible seed start positions = M-S+1
S
A seed hit can occur at (M-S+1) × (N-S+1) positions
At each position, the match probability is: 1/4S
=> Expected number of hits = (M-S+1) × (N-S+1) / 4S
Random seed hits
Length of the human genome ≈ 3 billion
Length of the mouse genome ≈ 3 billion
What seed size should we use?
Random seed hits
Length of the human genome ≈ 3 billion
Length of the mouse genome ≈ 3 billion
Seed size
Random hits
12
536 billion
16
2 billion
20
8 million
Problems with
seed-and-extend
• Limited sensitivity
– Exact matches are required
• Trouble with repeats
– Genomes have lots of “repeats”, e.g.
tatatatatatatatatatatata
– These cause (lots)2 seed hits
– E.g. human and chimp each have ~1 million
repeated Alu elements => (1 million)2 hits
– In practice, repeat-masking is essential
Improvements
• Limited sensitivity
–Spaced seeds
• Trouble with repeats
–Suffix arrays
Spaced seeds
Quiz question
• We have a random DNA sequence
of length 5.
• Which of these words is more likely to occur?
aaaa
gatc
What are “spaced seeds”?
• Matches where we ignore certain positions
– E.g. ignore the 3rd and 6th positions out of 8:
a c a t t g a c
a c c t t c a c
How do spaced seeds help?
• They increase sensitivity in an important
special case
– Protein-coding sequences
• Surprisingly, they increase sensitivity in
general too
Spaced seeds for
protein-coding sequence
Protein-coding sequence tends to mutate at every
3rd position
…actgctacgtacgtacggcatgctacgtaggg…
||x||x||x|||||x||x||x||x||x||x||
…accgcaacatacgtgcgccacgcgacttatgg…
Here, we cannot find any match with un-spaced
seeds, but we can with spaced seeds
Spaced seeds in general
• Example: try to detect similarities with length
64, and 70% identity
64
tatgctagctacctgatatgccatcgactaatgatcgtgtgga
tagctagctactagctagctggatgctagtcgtttctatgtcc
Match probability at each position = 0.7
• Which of these seed patterns is better?
Both have 11
match positions
Spaced seed example
11
64
tatgctagctacctgatatgccatcgactaatgatcgtgtgga
tagctagctactagctagctggatgctagtcgtttctatgtcc
Match probability at each position = 0.7
• Probability of getting a seed hit:
– Hard to calculate
• Expected number of hits:
– (64-11+1) × 0.711 = 1.07
Spaced seed example
18
64
tatgctagctacctgatatgccatcgactaatgatcgtgtgga
tagctagctactagctagctggatgctagtcgtttctatgtcc
Match probability at each position = 0.7
• Probability of getting a seed hit:
– Hard to calculate
• Expected number of hits:
– (64-18+1) × 0.711 = 0.93
Spaced seed example
11
18
64
tatgctagctacctgatatgccatcgactaatgatcgtgtgga
tagctagctactagctagctggatgctagtcgtttctatgtcc
Match probability at each position = 0.7
• Probability of getting a seed hit:
– Hard to calculate, but can be done with a dynamic
programming algorithm
– Un-spaced seed: = 0.30
– Spaced seed: = 0.466
• So the spaced seed is more sensitive!
Analogy: word counts
• We have a random DNA sequence
of length 5.
• Which of these words is more likely to occur?
aaaa
gatc
• Expected number of occurrences
– gatc: (5 - 4 + 1) × (1/4)4
– aaaa: (5 - 4 + 1) × (1/4)4
=
=
1/128
1/128
Analogy: word counts
• We have a random DNA sequence
of length 5.
Prob(aaaa occurs twice):
Prob(aaaa occurs once):
Prob(aaaa occurs):
aaaaa
aaaac
aaaag
aaaat
caaaa
gaaaa
taaaa
1 / 45
6 / 45
7 / 45
Analogy: word counts
• We have a random DNA sequence
of length 5.
0
Prob(gatc occurs twice):
Prob(gatc occurs once):
Prob(gatc occurs):
gatca
gatcc
gatcg
gatct
agatc
cgatc
ggatc
tgatc
8 / 45
8 / 45
Analogy: word counts
• We have a random DNA sequence
of length 5.
• Which of these words is more likely to occur?
aaaa
gatc
• Surprise! gatc is more likely to occur
– Because it cannot overlap itself
Suffix arrays
What is a “suffix array”?
A sequence:
a c g a t c g a t g
Positions in the sequence:
0 1 2 3 4 5 6 7 8 9
Positions sorted alphabetically:
0 3 7 2 5 9 2 6 4 8
This is a suffix array
What can you do with a suffix
array?
• Find matches by binary search:
A sequence:
a c g a t c g a t g
Positions in the sequence:
0 1 2 3 4 5 6 7 8 9
Positions sorted alphabetically:
0 3 7 2 5 9 2 6 4 8
Query sequence:
a t c
What can you do with a suffix
array?
• Find matches by binary search:
A sequence:
a c g a t c g a t g
Positions in the sequence:
0 1 2 3 4 5 6 7 8 9
Positions sorted alphabetically:
0 3 7 2 5 9 2 6 4 8
a
Query sequence:
A t c
What can you do with a suffix
array?
• Find matches by binary search:
A sequence:
a c g a t c g a t g
Positions in the sequence:
0 1 2 3 4 5 6 7 8 9
Positions sorted alphabetically:
0 3 7 2 5 9 2 6 4 8
a
Query sequence:
A T c
at
What can you do with a suffix
array?
• Find matches by binary search:
A sequence:
a c g a t c g a t g
Positions in the sequence:
0 1 2 3 4 5 6 7 8 9
Positions sorted alphabetically:
0 3 7 2 5 9 2 6 4 8
a
Query sequence:
A T C
at
atc
What use are suffix arrays?
• Earlier, we saw how to find 3-mer matches
using lookup tables
• Binary search is slower than lookup
– So what use are suffix arrays?
• Suffix arrays are more flexible:
– With a suffix array, we can find matches of any
size
– With a lookup table, we can only find matches of
one size, fixed in advance
– Lookup tables do not work for large match sizes,
because they need 4S units of memory
Suffix arrays and repeats
Sequence: acgattgtgtgtgtgaacgatgcctagtgtg
Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8
Query sequence:
tgtgtgtgtgtg
Suffix arrays and repeats
Sequence: acgattgtgtgtgtgaacgatgcctagtgtg
Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8
t
Query sequence:
Tgtgtgtgtgtg
Suffix arrays and repeats
Sequence: acgattgtgtgtgtgaacgatgcctagtgtg
Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8
t
Query sequence:
TGtgtgtgtgtg
tg
Suffix arrays and repeats
Sequence: acgattgtgtgtgtgaacgatgcctagtgtg
Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8
t
Query sequence:
TGTgtgtgtgtg
tg
tgt
Suffix arrays and repeats
Sequence: acgattgtgtgtgtgaacgatgcctagtgtg
Suffix array: 12 0 11 3 16 7 13 20 2 19 18 5 9 2 6 4 14 8
t
Query sequence:
TGTgtgtgtgtg
tg
tgt
• We can adapt the match size to the
repetitiveness: use larger matches for repeats
Suffix array construction
• Any sorting algorithm can be used
– E.g. radix sort
– But it might be slow: worse-than-linear running
time
• There are specialized suffix array
construction algorithms
– Running time linearly proportional to the sequence
length
Unifications
Seed finding vs. suffix array
First sequence
Positions
a t c g a t c g a
0 1 2 3 4 5 6 7 8
Position table
atc
cga
gat
tcg
0
4
2
6
3
1
5
3-mer table
0
0
0
…
0
2
2
…
2
4
…
aaa
aac
aag
…
atc
atg
att
…
cga
cgc
…
Seed finding vs. suffix array
First sequence
Positions
This almost is
a suffix array!
We just need
to finish
sorting it
a t c g a t c g a
0 1 2 3 4 5 6 7 8
Position table
atc
cga
gat
tcg
0
4
2
6
3
1
5
3-mer table
0
0
0
…
0
2
2
…
2
4
…
aaa
aac
aag
…
atc
atg
att
…
cga
cgc
…
Hybrid method
First sequence
Positions
a t c g a t c g a
0 1 2 3 4 5 6 7 8
Suffix array
Use lookup
for short
matches
atc
Use binary
search for
long matches
cga
gat
tcg
4
0
6
2
3
5
1
3-mer table
0
0
0
…
0
2
2
…
2
4
…
aaa
aac
aag
…
atc
atg
att
…
cga
cgc
…
Hybrid method
First sequence
Positions
a t c g a t c g a
0 1 2 3 4 5 6 7 8
3-mer table
Suffix array
Use lookup
for short
matches
atc
Use binary
search for
long matches
cga
gat
tcg
4
0
6
Best
2
3
5
1
of both worlds
0
0
0
…
0
2
2
…
2
4
…
aaa
aac
aag
…
atc
atg
att
…
cga
cgc
…
Spaced seeds & suffix arrays
• We can make a “spaced suffix array”:
Sequence
Spaced suffixes
Sorted suffixes
atgctagcc
0
1
2
3
4
5
6
7
8
5
0
8
7
3
6
2
4
1
at_ct_gc_
tg_ta_cc
gc_ag_c
ct_gc_
ta_cc
ag_c
gc_
cc
c
ag_c
at_ct_gc_
c
cc
ct_gc_
gc_
gc_ag_c
ta_cc
tg_ta_cc
An implementation
• LAST
– http://last.cbrc.jp
• Sequence comparison & alignment
– Seed-and-extend, similar to BLAST
– Uses spaced suffix array to find seeds
Example
• Compare the human and mouse
genomes
– Took about 1 day on 1 CPU
• The next figure is a dot-plot
– Human chromosomes: along the top
– Mouse chromosomes: down the side
Conclusions
• Sequence comparison is arguably the most
fundamental task in computational biology
• Usual problem definition: maximum score
• Exact algorithm: Smith-Waterman-Gotoh
• Heuristic approach: seed-and-extend
• Spaced seeds improve sensitivity
• Suffix arrays allow more flexible seed-finding
• These seed-finding methods can be unified
Looking for a job?
http://www.cbrc.jp/