Sequence alignment and Evolution

MSCBIO 2070/02-710: Computational Genomics, Spring 2016
HW1: Sequence alignment and Evolution
Due: 24:00 EST, Feb 15, 2016 by autolab
Your goals in this assignment are to
1. Complete a genome assembler
2. Implement sequence alignment algorithms (global and local)
3. Hidden Markov Model
4. Evolution and Phylogeny
What to hand in.
• One report (in pdf format) addressing each of following questions.
• All source codes. If the skeleton is provided, you just need to complete the script and send it back.
Submit a zip file containing the completed code (if any) and the pdf file (if any) to autolab. The zip
file should have the following structure
./S2016HW1.pdf
./Q1/
put all codes related to Q1 here, if any
./Q2/
put all codes related to Q2 here, if any
1. [20 points] Genome Assembler
In genome assembly, many short sequences (reads) from a sequencing machine is assembled into long
sequences. By ordering overlapping reads we are likely to reconstruct the sequenced genome. For
example, given three sequences AGGTCGTAG, CGTAGAGCTGGGAG and GGGAGGTTGAAA,
order them by the overlapping parts as follows,
AGGTCGTAG
CGTAGAGCTGGGAG
GGGAGGTTGAAA
The final genomic sequence is likely to be,
AGGTCGTAGAGCTGGGAGGTTGAAA
Real genome assembler is more sophisticated but the idea is the same. We make two assumptions
for simplification here.
• There is no sequencing error and all the overlaps are perfect matches.
• No reads are nested in any other read. In other words, we only have overlaps of type A, no type
B.
Type A
AAAAAA
AAAAAA
Type B
AAAAAAAA
AAAA
You task for this question is to read short reads from the input file, identify the overlaps, find the right
order of these reads and reconstruct the sequenced genome. There is one text file reads.txt which
contains all short reads and one python skeleton GenomeAssembler.py you need to complete. The
input format of the text file is defined as follows,
1, ATCG . . .
2, TCGA . . .
3, CGAT . . .
...
In each column, the first number will be the name (barcode) and the string following the barcode
is the short read itself. There is no header in the input file. Barcodes and reads are separated by
commas.
Your task
(a) (4 points) Complete the function ReadDataFromfile(filename) - Read the text file into memory and return one variable dat which contains all barcodes and corresponding reads. If you are
not comfortable to use a single variable to record all the information, you could use your own
variables. But make sure the output variables could be directly used as the input arguments in
(b).
Hint: You could define dat as a dictionary which records barcodes and sequences at the same
time.
2
(b) (3 points) Complete the function MeanLengthofReads(dat) - You need to complete the
function and report the mean length of the input reads. Include the mean length in your report.
The mean length of the input reads is 226.9 bases.
(c) (8 points) Compute the overlap lengths between all pairs of reads. Include the longest overlap
segment and two corresponding barcodes in your report. Also report the first (left-most) read
and its barcode.
The longest overlap segment is 165-base long. It is formed by 1 → 4.
ACATCTGTGAGTGAGAACAGGTGTCACCTTGAAGGTGGGAGTGATCAAAAGGACCT
TGTACAAGAGCTTCAGGAAGAGAAACCTTCATCTTCACATTTGGTTTCTAGACCAT
CTACCTCATCTAGAAGGAGAGCAATTAGTGAGACAGAAGAAAATTCAGATGAA
The left-most read is read 9.
ATGTGCAATACCAACATGTCTGTACCTACTGATGGTGCTGTAACCACCTCACAGA
TTCCAGCTTCGGAACAAGAGACCCTGGTTAGACCAAAGCCATTGCTTTTGAAGTT
ATTAAAGTCTGTTGGTGCACAAAAAGACACTTATACTATGAAAGAGGTTCTTTTT
TATCTTGGCCAGTATATTATGACTAAACGATTATATGATGAGAAGCAACAACATA
TTGTATATTGTTCAAATGATCTTCTAGGAGATTTGTTTGGCG
(d) (4 points) Assemble the reads into the sequenced genome. Report the final genome and the
order of barcodes in which all reads are assembled.
9 → 3 → 5 → 10 → 1 → 4 → 8 → 2 → 6 → 7
ATGTGCAATACCAACATGTCTGTACCTACTGATGGTGCTGTAACCACCTCACAGATTCCA
GCTTCGGAACAAGAGACCCTGGTTAGACCAAAGCCATTGCTTTTGAAGTTATTAAAGTCT
GTTGGTGCACAAAAAGACACTTATACTATGAAAGAGGTTCTTTTTTATCTTGGCCAGTAT
ATTATGACTAAACGATTATATGATGAGAAGCAACAACATATTGTATATTGTTCAAATGAT
CTTCTAGGAGATTTGTTTGGCGTGCCAAGCTTCTCTGTGAAAGAGCACAGGAAAATATAT
ACCATGATCTACAGGAACTTGGTAGTAGTCAATCAGCAGGAATCATCGGACTCAGGTACA
TCTGTGAGTGAGAACAGGTGTCACCTTGAAGGTGGGAGTGATCAAAAGGACCTTGTACAA
GAGCTTCAGGAAGAGAAACCTTCATCTTCACATTTGGTTTCTAGACCATCTACCTCATCT
AGAAGGAGAGCAATTAGTGAGACAGAAGAAAATTCAGATGAATTATCTGGTGAACGACAA
AGAAAACGCCACAAATCTGATAGTATTTCCCTTTCCTTTGATGAAAGCCTGGCTCTGTGT
GTAATAAGGGAGATATGTTGTGAAAGAAGCAGTAGCAGTGAATCTACAGGGACGCCATCG
AATCCGGATCTTGATGCTGGTGTAAGTGAACATTCAGGTGATTGGTTGGATCAGGATTCA
GTTTCAGATCAGTTTAGTGTAGAATTTGAAGTTGAATCTCTCGACTCAGAAGATTATAGC
CTTAGTGAAGAAGGACAAGAACTCTCAGATGAAGATGATGAGGTATATCAAGTTACTGTG
TATCAGGCAGGGGAGAGTGATACAGATTCATTTGAAGAAGATCCTGAAATTTCCTTAGCT
GACTATTGGAAATGCACTTCATGCAATGAAATGAATCCCCCCCTTCCATCACATTGCAAC
AGATGTTGGGCCCTTCGTGAGAATTGGCTTCCTGAAGATAAAGGGAAAGATAAAGGGGAA
ATCTCTGAGAAAGCCAAACTGGAAAACTCAACACAAGCTGAAGAGGGCTTTGATGTTCCT
GATTGTAAAAAAACTATAGTGAATGATTCCAGAGAGTCATGTGTTGAGGAAAATGATGAT
AAAATTACACAAGCTTCACAATCACAAGAAAGTGAAGACTATTCTCAGCCATCAACTTCT
AGTAGCATTATTTATAGCAGCCAAGAAGATGTGAAAGAGTTTGAAAGGGAAGAAACCCAA
GACAAAGAAGAGAGTGTGGAATCTAGTTTGCCCCTTAATGCCATTGAACCTTGTGTGATT
TGTCAAGGTCGACCTAAAAATGGTTGCATTGTCCATGGCAAAACAGGACATCTTATGGCC
TGCTTTACATGTGCAAAGAAGCTAAAGAAAAGGAATAAGCCCTGCCCAGTATGTAGACAA
CCAATTCAAATGATTGTGCTAACTTATTTCCCCTAG
3
(e) (1 points) What is this genome?
Hint: Uniprot
http://www.uniprot.org/blast/
• A8WFP2: MDM2 protein, Homo sapiens (Human)
• Q00987: E3 ubiquitin-protein ligase Mdm2, Homo sapiens (Human)
• Q00987-11: Isoform 11 of E2 ubiquitin-protein ligase Mdm2, Homo sapiens (Human)
2. [14 points] Sequence alignment
Warning: You should implement the algorithm from scratch in Python. If you use some
existing functions to complete your tasks, only partial grades will be assigned to your
answer.
Your task
(a) (6 points) Write a program to perform global alignment on two DNA sequences in DNA.fasta.
Use a match score +5, mismatch penalty of -4 and a gap penalty of -5. Run your algorithm,
report your alignment and the score.
The score for the alignment is 14. The alignment is not unique and you just need to report one
of them.
(b) (8 points) Write a program to find the best locally aligned region of two proteins in PROTEIN.fasta. Use BLOSUM62 and set gap open/extend penalties as -5. Run your algorithm and report the alignment result. The BLOSUM62 matrix has been extracted into BLOSUM62Matrix.txt.
The score for the local alignment is 202. The alignment is not unique and you just need to report
one of them. One way to check your algorithm is to compare its performance with EMBOSS
Water (http://www.ebi.ac.uk/Tools/psa/emboss_water/)
3. [24 points] Hidden Markov Model
Your task
(a) (4 points) A transcription factor (TF) is a protein that binds to certain DNA sequences in
promoters and affects transcription. The binding sites of a TF that binds sequences of length
n can be described using a 4 × n matrix in which the ith column gives the probabilities for the
different nucleotides at position i of the binding site. Describe an HMM for a promoter that can
have multiple disjoint binding sites for a given TF, draw the topology and label all parameters.
4
Here we define three states: start, binding site, non-binding site. ps→b , ps→n , pn→n , pb→n and
pn→b are all transition probabilities labelled. The emission probability matrices are defined for
binding site state and non-binding site stage.
• Each position in the binding site needs to be defined as a separate state with a transition
probability 1. It’s also ok to define A, C, G, T as four separate states instead of a single
non-binding site state. But the latter one is more preferred.
• The start stage is not compulsory in your topology. It’s also good to include end stage or
include the transition event ”binding site → binding site” which is unlikely to happen.
(b) (2 points) Suppose we know that the distance between binding sites should be between 100
and 200 nucelotides. Can we construct an efficient HMM to capture this? Sketch an HMM
topoplogy or explain why HMM is not suitable.
The HMM proposed above is not able to satisfy the length constraint. Each state only depends
on the adjacent element without the ability to incorporate higher order information.
• You can also modify the existing HMM to make it work by playing the same trick as we do
for the binding site. For the first 100 positions between binding sites, you can treat them
as separate states with a transition probability 1. Starting from the 101 position, you can
define another state similar with the non-binding site state above.
• You could argue your HMM is efficient or not. We don’t have a strong opinion on this point.
(c) Consider the following HMM, where transition probabilities are on the edges and emission
probabilities are given in tables next to the nodes:
0.2
X
0.2
a=0.0
c=0.0
g=0.0
t=1.0
a=0.2
c=0.3
g=0.5
t=0.0
q0
q1
0.1
1.0
0.6
q3
0.5
a=0.1
c=0.8
g=0.1
t=0.0
0.1
q2
a=0.0
c=0.9
g=0.1
t=0.0
0.5
Figure 1: HMM
• (4 points) Suppose we start in state q0 . Give two paths that could emit the string tagcat.
What are their probabilities?
First we need to evaluate the value of x.
x = P (q1 → q1 ) = 1 − 0.1 − 0.1 = 0.8
In fact, there are only three possible paths.
Path I: q0 (t) → q1 (a) → q1 (g) → q2 (c) → q3 (a) → q0 (t)
P = 1 × 0.2 × 0.1 × 0.8 × 0.1 × 0.1 × 0.9 × 0.5 × 0.2 × 1 × 1 = 1.44 × 10−5
Path II: q0 (t) → q1 (a) → q2 (g) → q2 (c) → q3 (a) → q0 (t)
P = 1 × 0.2 × 0.1 × 0.1 × 0.1 × 0.5 × 0.9 × 0.5 × 0.2 × 1 × 1 = 9 × 10−6
5
Path III: q0 (t) → q1 (a) → q1 (g) → q1 (c) → q1 (a) → q0 (t)
P = 1 × 0.2 × 0.1 × 0.8 × 0.1 × 0.8 × 0.8 × 0.8 × 0.1 × 0.1 × 1 = 8.192 × 10−6
You could also calculate the probabilities as the posterior probability given the sequence
tagcat.
P (Path I|tagcat) =
1.44 × 10−5
= 0.456
1.44 × 10−5 + 9 × 10−6 + 8.192 × 10−6
• (8 points) Suppose we start in state q0 with probability 1. Compute and show the Vitterbi
dynamic programming matrix for the string tacccgt. What is the highest probability path
for this string?
If you use probabilities,
q0
q1
q2
q3
t
1
0
0
0
a
0
0.02
0
0
c
0
0.0128
0.0018
0
c
0
8.192 × 10−4
1.152 × 10−3
2.7 × 10−4
c
0
5.2429 × 10−3
7.3728 × 10−4
1.728 × 10−4
g
0
4.1943 × 10−4
5.2429 × 10−5
1.8432 × 10−4
t
1.8432 × 10−4
0
0
0
If you use log2 values of probabilities,
t
0
-Inf
-Inf
-Inf
q0
q1
q2
q3
a
-Inf
-5.64
-Inf
-Inf
c
-Inf
-6.29
-9.12
-Inf
c
-Inf
-6.93
-9.76
-11.85
c
-Inf
-7.58
-10.41
-12.50
g
-Inf
-11.22
-14.22
-12.41
c
-Inf
-5.25
-7.21
-8.66
g
-Inf
-7.78
-9.86
-8.60
t
-12.41
-Inf
-Inf
-Inf
If you use loge /ln values of probabilities,
q0
q1
q2
q3
t
0
-Inf
-Inf
-Inf
a
-Inf
-3.91
-Inf
-Inf
c
-Inf
-4.36
-6.32
-Inf
c
-Inf
-4.80
-6.77
-8.22
t
-8.60
-Inf
-Inf
-Inf
The most probable path is q0 → q1 → q1 → q1 → q2 → q3 → q0
(d) (4 points) Recall from class that a profile HMM looks something like this
D1
D2
D3
D4
D5
D6
I0
I1
I2
I3
I4
I5
I6
Begin
M1
M2
M3
M4
M5
M6
Figure 2: profileHMM
6
End
where the I, D, M states are insertion, deletion, and match states, respectively.
Suppose you have two functions defined for you: Viterbi(M, x) that returns the most probable
path through an HMM M for an observed sequence x, and Train(x1 , x2 , . . . , xk , p1 , p2 , . . . , pk )
that takes in sequences and their paths and returns an HMM of the form above with all the
optimal parameters set so that the sequences x1 , . . . , xk are likely to be output by the HMM.
Explain how you can use these functions to perform a multiple sequence alignment (short pseudocodes would be helpful). Explain one reason why this approach may fail to produce the correct
alignment.
Given k sequences with lengths l1 , . . . , lk .
i. We calculate the average length of these sequence ¯l and define ¯l match states, ¯l deletion
states, ¯l deletion states and two additional states (”begin” and ”end”). In this way we define
ii.
iii.
iv.
v.
vi.
the number of states in the profile HMM we are going to use.
We could set up an initial group of parameters, in other words, M0 . (You could start by
initializing p1 , . . . , pk , and the following step will be a little different)
p1 = Viterbi(M0 , x1 ), . . . , pk = Viterbi(M0 , xk )
while the HMM doesn’t converge (Or p1 ,. . . ,pk are not stable) do
M ← Train(x1 , . . . , xk , p1 , . . . , xk )
p1 ← Viterbi(M, x1 ), . . . , pk ← Viterbi(M, xk )
end
return p1 , . . . , pk and M
Based on the output, we could generate the alignment profile correspondingly.
There is a high chance this work flow may stuck in the local optimum. Also there is no guarantee
on the convergence. You should be careful about the initial parameters.
• The lengths of these sequences may not be the same. So first we need to figure out how
many states we are going to use. The average length is one choice.
• Some of the reasons may still apply if you don’t use profile HMM, so be specific to argue
why the profile HMM may fail.
• Some reference material on the profile HMM:
HMM for sequence alignment: profile HMM, http://goo.gl/5qMXF7
Profile Hidden Markov Models, https://www.cs.princeton.edu/~mona/Lecture/HMM1.
pdf
(e) (2 points) True or false: (if true, give an one sentence justification; if false, give a counter
example.) When learning an HMM for a fixed set of observations, assume we do not know the
true number of hidden states (which is often the case), we can always increase the training data
likelihood by permitting more hidden states.
True. For the worst case, we could give one hidden sate for each output value in the training
sequence, and achieve perfect fitting. (Note: if you consider the scenario where the number of
states is even larger than the number of observed output values, and answered ’False’, you will
also get full credit.)
4. [12 points] Evolution and Phylogeny
Your task
(a) Suppose you are given two sequences that are of length 200 bases. They differ by 22 transitions
and 6 transversions.
• (2 points) Using Jukes-Cantor model, calculate the expected number of substitutions that
occurred since these sequences diverged from a common ancestor.
7
KJC = − 43 ln(1 − 43 PD ) = − 34 ln(1 − 43 × 22+6
200 ) = 0.155
The expected number of substitutions is 0.155 × 200 = 31
• (2 points) Redo the computation with Kimura two parameter model.
22
6
6
KK2P = − 12 ln(1−2P −Q)− 14 ln(1−2Q) = − 12 ln(1−2× 200
− 200
)− 14 ln(1−2× 200
) = 0.159
The expected number of substitutions is 0.159 × 200 = 32
• (4 points) Now you are given two sequences of length 200 which differ by 54 transitions
and 18 transversions. Repeat the computation with JC and Kimura two parameter models.
Based on these results, which model do you think is preferable when sequences are less and
more divergent respectively?
KJC = − 34 ln(1 − 43 PD ) = − 34 ln(1 − 43 × 54+18
200 ) = 0.490
The expected number of substitutions is 0.49 × 200 = 98
54
18
18
− 200
)− 14 ln(1−2× 200
) = 0.547
KK2P = − 21 ln(1−2P −Q)− 14 ln(1−2Q) = − 12 ln(1−2× 200
The expected number of substitutions is 0.547 × 200 = 109
When the sequences are less divergent, these two models provide similar results. When
the sequences are more divergent, Kimura two parameter models are preferred to capture
the real number of substitutions.
(b) (4 points) You are given the following set of aligned sequences:
1, TCAA
2, GCAT
3, TTTT
4, GATA
5, GAAC
5, ATAG
Find the parsimony score for the tree ((((1, 2),(3, 4)), 5), 6). Indicate the sequence at each
vertex of the tree.
The parsimony score is 11. You can calculate the parsimony score for each site and sum them
up to get the final value.
8