Principles of aligning sequences

Sequencing a genome and Basic
Sequence Alignment
Lecture 10
Global Sequence
1
Introduction
• Annotation of DNA sequences
• Discovering genomes the shot-gun approach
• Sequence alignment and sequence matching
Annotation of sequences
• As discussed before when the gene sequence’s
(DNA and/or mRNA) have been determined
(obtained) then the data must be annotated:
(Klug 2010)
– what sequences correspond UTR, exons/ introns,
coding sequences (cds), polyA signal
– Other sequences of interest include: promoters sites
and other regulatory regions (enhancers…)
• Annotation also contains important
supplementary material; other organisms that
have the same gene; the corresponding protein
sequence and journal articles related to the
sequences….
Global Sequence
3
Sequence similarity
• In many cases of the annotation of gene sequence;
a sequence homology “test”, to existing sequences
whose function is known, is performed.
• the assumption is that the both sequences were
homologous [ have a common ancestor; were the
same sequence] but are now different because of a
series Mutations: substitution, deletions, insertions
• The basic concepts behind this process is sequence
alignment and determining the strength of the
match for the aligned sequence.
Sequence Alignment ( Pair-wise) : A
simple global match
• The assignment of residuesresidue corresponds:
– A Global match: align all of one
sequence with another .
– The figure shows to sequences of
nucleic acids.
– Some have the same base (nucleic
acid ) and so there is a match at
this position between the strands.
This is represented by a vertical
line and a blue highlight.
– Others do not match and have no
vertical line and no blue highligh
This figure adapted from Klug is a comparison
of a “leptin gene” from a dog (top) and a
homo sapiens (bottom)
Global Sequence
5
A simple global Match
• The non matches are presumed to correspond to
mutations; in this case a substitution mutation.
• In DNA (nucleic acids) mutations
– A transition A <-> G is more probable than a
transversions T <-> C
– The substitution mutation is more probable than
insertion/deletion.
• The relative probability of such mutations has to
be taken into account when determining the
strength of the match. (we will discuss this in
greater detail later)
Global sequence alignment: different size sequences
•
•
•
•
•
A Global alignment between sequence
of difference sizes requires the
inclusions of gaps [dash] in order to
optimise the matching process.
In Example 1 (only considers
substitution mutations) produces a
much lower number of matches than
Example 2 which considers all types of
3 types of point mutations.
This examples calculates a simple
matching score;
in DNA you would need to factor in the
relative probability of substitutions.
In amino acids the calculation is more
complicated.
Example 1
I am from Cork
I am not from Cork
****
(4 matches out of 18; based on
length of bottom string)
•
•
•
•
Example 2
I am ---- from Cork
I am not from Cork
**** **********
•
(14 matches out of 18; based on
length of bottom string)
Global Sequence
7
Example of DNA sequence alignment
Adapted from Klug p. 384
Global Sequence
8
Sequence alignment: Amino Acids
•
“*” match; “-” gap; “:” conserved substitution “.”semi-conserved substitution.
In DNA the sequence “itself” is most important; All nucleic acids have the “same”
basics properties.
However amino acid sequences produce a 3-D structure, which relates to the
property of amino acids in the sequence.
Amino Acids with similar, side chain, properties will have overlapping “effects” on
3-D structure of the protein.
The above figure takes this into account by referring to two types of substitutions:
conserved and semi-conserved substitutions
Global Sequence
9
Sequence Alignment: a local Match
A local Match :
•
•
Example
Find a region in one sequence that
matches a region in the other.
A local match is generally used if
there is a larger difference in size
between the sequences
•
The overhangs at the beginning
and end of the query string are not
treated as gaps.
•
In the example
– A global (alignment) gives a score of 9
out of 13;
– A Local (alignment) gives a score is 8
out of 10 ( do not count overhangs…)
– In general the Alignment with the
highest score is the one that is taken.
Global Sequence
10
Sequence Alignment: pairwise : a
motif match
• A motif match can find:
• a “perfect match between a
small sequence and one or more
regions in a larger sequence.
• This plays an important part in
looking for repeating sequences
[tandem repeats] , and
important other “small”
sequences;
• The motif match like the others of
course does not have to be
“contigiuos ; it can also include
conserved distributed pattern
• You are not from Cork
• You are not normal
• They are not happy about…
•
*** ***
Global Sequence
11
Multiple sequence alignment
• Similar to the previous except you
look for areas conserved between
all the sequences in the
alignment:
•
•
•
•
•
My name is denis and I am from cork
My name is kieran and I am not from cork
We name the dog “canis familiaris”
name
used to align multiple sequences which
can be used to check for conserved
motifs/sequences in many species: used
to determine protein functionality,
promoter signals, enhancer and silencer
regions…. From this determine
phylogenetic relationships. ( evolution:
refer to understanding bioinformatics
chapter 7)
Global Sequence
12
GENOMES: Sequencing and assembling
• The supplementary lecture covers how to
produce and determine the sequence of DNA
strands. However, the size of the Strands are
limited to a few 1000 base pairs.
• To sequence an organism’ s entire genome :
– Must use the “shot gun” approach
– Cut the genome into small fragments whose sequence
can be determined.
– use computational techniques (sequence alignment)
to join them back together in the correct order
Global Sequence
13
Shot-gun
• Shot gun approach requires two genetic
technologies (refer to supplementary material
for more detail) and one computational
technique (overlapping contigs) :
– Restriction enzymes: cut up denatured (ss)DNA
– Fast DNA sequencing of fragments (sequences)
– Combining overlapping contiguous DNA
sequences
Overlapping Contiguous Fragments
Adapted from [1] p. 377
Global Sequence
15
Overlapping Fragments: example
• Original sentence:
• This is the school of computing bioinformatics course.
• Cut 2 copies of the sentence into fragments
• This is
• The school of
• Computing bioinformatics course
• This is the
• School of computing
• Bioinformatics course
Global Sequence
16
Overlapping Fragments: example
• Check for overlaps (prefix and suffix)
• This is
• This is the
•
The school of
•
School of computing
•
computing bioinformatics course
• Bioinformatics course
• Result of alignment of fragments is:
– This is the school of computing bioinformatics course
Global Sequence
17
Example of Contigs alignment:
The above diagram shows an DNA example of how overlapping contiguous
sequences are aligned. However it is an oversimplification as actual segments are
many times larger than shown and overlapping does not always happen at then end
of ends of segments. Adapted from: Klug 7th p 378
Global Sequence
18
Example 2:
• Reconstruct the following fragments
1. the men and women merely players;\none
2. man in his time
3. All the world's
4. their entrances,\nand one man
5. a stage,\nAnd all the men and women
6. They have their exits and their entrances,\n
7. world's a stage,\nAnd all
8. their entrances,\nand one man
9. in his time plays many parts.
10. merely players;\nThey have
Example 2 Solution
•
•
•
•
all the world’s a stage,
And all the men and women merely players;
They have their exits and their entrances
And one man plays many parts
• Order of statements joining together are:
• 3,7,5,1,10, 6,4,8,2,9
Example 2 Solution in detail.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
the men and women merely players;(\n)
one man in his time
All the world's
their entrances,(\n) and one man
stage, (\n) And all the men and women
They have their exits and their entrances,(\n)
world's a stage, (\n) And all
their entrances, (\n) and one man
in his time plays many parts.
merely players; (\n) They have
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Order of the statements
3: all the world’s,
7: all the world’s a stage,
And all
5: all the world’s a stage,
And all the men and women
1: all the world’s a stage,
And all the men and women merely players;
10: all the world’s a stage,
And all the men and women merely players;
They have
6: all the world’s a stage,
And all the men and women merely players;
They have their exits and their entrances
4: all the world’s a stage,
And all the men and women merely players;
They have their exits and their entrances
And one man
8: all the world’s a stage,
And all the men and women merely players;
They have their exits and their entrances
And one man
2: all the world’s a stage,
And all the men and women merely players;
They have their exits and their entrances
And one man in his time
9: all the world’s a stage,
And all the men and women merely players;
They have their exits and their entrances
And one man plays many parts
Algorithm to join contigs
• we need two relationships between
fragments:#
• (1) which fragment shares no prefix with
suffix of another fragment# (This tells us
which fragment comes first)
• (2) which fragment shares longest suffix with a
prefix of another# (This tells us which
fragment follows any fragment)
Potential Exam question
• Briefly describe the three main types of
sequence alignment (6 marks)
• Explain how would determine the DNA
sequence of a genome given that technology
can only determine the DNA sequences of
relatively small DNA strands (14 marks).
• Explain, two important elements, of an
algorithm that can solve the problem.
(10 marks)