MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW1: Sequence alignment and Evolution Due: 24:00 EST, Feb 15, 2016 by autolab Your goals in this assignment are to 1. Complete a genome assembler 2. Implement sequence alignment algorithms (global and local) 3. Hidden Markov Model 4. Evolution and Phylogeny What to hand in. • One report (in pdf format) addressing each of following questions. • All source codes. If the skeleton is provided, you just need to complete the script and send it back. Submit a zip file containing the completed code (if any) and the pdf file (if any) to autolab. The zip file should have the following structure ./S2016HW1.pdf ./Q1/ put all codes related to Q1 here, if any ./Q2/ put all codes related to Q2 here, if any 1. [20 points] Genome Assembler In genome assembly, many short sequences (reads) from a sequencing machine is assembled into long sequences. By ordering overlapping reads we are likely to reconstruct the sequenced genome. For example, given three sequences AGGTCGTAG, CGTAGAGCTGGGAG and GGGAGGTTGAAA, order them by the overlapping parts as follows, AGGTCGTAG CGTAGAGCTGGGAG GGGAGGTTGAAA The final genomic sequence is likely to be, AGGTCGTAGAGCTGGGAGGTTGAAA Real genome assembler is more sophisticated but the idea is the same. We make two assumptions for simplification here. • There is no sequencing error and all the overlaps are perfect matches. • No reads are nested in any other read. In other words, we only have overlaps of type A, no type B. Type A AAAAAA AAAAAA Type B AAAAAAAA AAAA You task for this question is to read short reads from the input file, identify the overlaps, find the right order of these reads and reconstruct the sequenced genome. There is one text file reads.txt which contains all short reads and one python skeleton GenomeAssembler.py you need to complete. The input format of the text file is defined as follows, 1, ATCG . . . 2, TCGA . . . 3, CGAT . . . ... In each column, the first number will be the name (barcode) and the string following the barcode is the short read itself. There is no header in the input file. Barcodes and reads are separated by commas. Your task (a) (4 points) Complete the function ReadDataFromfile(filename) - Read the text file into memory and return one variable dat which contains all barcodes and corresponding reads. If you are not comfortable to use a single variable to record all the information, you could use your own variables. But make sure the output variables could be directly used as the input arguments in (b). Hint: You could define dat as a dictionary which records barcodes and sequences at the same time. 2 (b) (3 points) Complete the function MeanLengthofReads(dat) - You need to complete the function and report the mean length of the input reads. Include the mean length in your report. The mean length of the input reads is 226.9 bases. (c) (8 points) Compute the overlap lengths between all pairs of reads. Include the longest overlap segment and two corresponding barcodes in your report. Also report the first (left-most) read and its barcode. The longest overlap segment is 165-base long. It is formed by 1 → 4. ACATCTGTGAGTGAGAACAGGTGTCACCTTGAAGGTGGGAGTGATCAAAAGGACCT TGTACAAGAGCTTCAGGAAGAGAAACCTTCATCTTCACATTTGGTTTCTAGACCAT CTACCTCATCTAGAAGGAGAGCAATTAGTGAGACAGAAGAAAATTCAGATGAA The left-most read is read 9. ATGTGCAATACCAACATGTCTGTACCTACTGATGGTGCTGTAACCACCTCACAGA TTCCAGCTTCGGAACAAGAGACCCTGGTTAGACCAAAGCCATTGCTTTTGAAGTT ATTAAAGTCTGTTGGTGCACAAAAAGACACTTATACTATGAAAGAGGTTCTTTTT TATCTTGGCCAGTATATTATGACTAAACGATTATATGATGAGAAGCAACAACATA TTGTATATTGTTCAAATGATCTTCTAGGAGATTTGTTTGGCG (d) (4 points) Assemble the reads into the sequenced genome. Report the final genome and the order of barcodes in which all reads are assembled. 9 → 3 → 5 → 10 → 1 → 4 → 8 → 2 → 6 → 7 ATGTGCAATACCAACATGTCTGTACCTACTGATGGTGCTGTAACCACCTCACAGATTCCA GCTTCGGAACAAGAGACCCTGGTTAGACCAAAGCCATTGCTTTTGAAGTTATTAAAGTCT GTTGGTGCACAAAAAGACACTTATACTATGAAAGAGGTTCTTTTTTATCTTGGCCAGTAT ATTATGACTAAACGATTATATGATGAGAAGCAACAACATATTGTATATTGTTCAAATGAT CTTCTAGGAGATTTGTTTGGCGTGCCAAGCTTCTCTGTGAAAGAGCACAGGAAAATATAT ACCATGATCTACAGGAACTTGGTAGTAGTCAATCAGCAGGAATCATCGGACTCAGGTACA TCTGTGAGTGAGAACAGGTGTCACCTTGAAGGTGGGAGTGATCAAAAGGACCTTGTACAA GAGCTTCAGGAAGAGAAACCTTCATCTTCACATTTGGTTTCTAGACCATCTACCTCATCT AGAAGGAGAGCAATTAGTGAGACAGAAGAAAATTCAGATGAATTATCTGGTGAACGACAA AGAAAACGCCACAAATCTGATAGTATTTCCCTTTCCTTTGATGAAAGCCTGGCTCTGTGT GTAATAAGGGAGATATGTTGTGAAAGAAGCAGTAGCAGTGAATCTACAGGGACGCCATCG AATCCGGATCTTGATGCTGGTGTAAGTGAACATTCAGGTGATTGGTTGGATCAGGATTCA GTTTCAGATCAGTTTAGTGTAGAATTTGAAGTTGAATCTCTCGACTCAGAAGATTATAGC CTTAGTGAAGAAGGACAAGAACTCTCAGATGAAGATGATGAGGTATATCAAGTTACTGTG TATCAGGCAGGGGAGAGTGATACAGATTCATTTGAAGAAGATCCTGAAATTTCCTTAGCT GACTATTGGAAATGCACTTCATGCAATGAAATGAATCCCCCCCTTCCATCACATTGCAAC AGATGTTGGGCCCTTCGTGAGAATTGGCTTCCTGAAGATAAAGGGAAAGATAAAGGGGAA ATCTCTGAGAAAGCCAAACTGGAAAACTCAACACAAGCTGAAGAGGGCTTTGATGTTCCT GATTGTAAAAAAACTATAGTGAATGATTCCAGAGAGTCATGTGTTGAGGAAAATGATGAT AAAATTACACAAGCTTCACAATCACAAGAAAGTGAAGACTATTCTCAGCCATCAACTTCT AGTAGCATTATTTATAGCAGCCAAGAAGATGTGAAAGAGTTTGAAAGGGAAGAAACCCAA GACAAAGAAGAGAGTGTGGAATCTAGTTTGCCCCTTAATGCCATTGAACCTTGTGTGATT TGTCAAGGTCGACCTAAAAATGGTTGCATTGTCCATGGCAAAACAGGACATCTTATGGCC TGCTTTACATGTGCAAAGAAGCTAAAGAAAAGGAATAAGCCCTGCCCAGTATGTAGACAA CCAATTCAAATGATTGTGCTAACTTATTTCCCCTAG 3 (e) (1 points) What is this genome? Hint: Uniprot http://www.uniprot.org/blast/ • A8WFP2: MDM2 protein, Homo sapiens (Human) • Q00987: E3 ubiquitin-protein ligase Mdm2, Homo sapiens (Human) • Q00987-11: Isoform 11 of E2 ubiquitin-protein ligase Mdm2, Homo sapiens (Human) 2. [14 points] Sequence alignment Warning: You should implement the algorithm from scratch in Python. If you use some existing functions to complete your tasks, only partial grades will be assigned to your answer. Your task (a) (6 points) Write a program to perform global alignment on two DNA sequences in DNA.fasta. Use a match score +5, mismatch penalty of -4 and a gap penalty of -5. Run your algorithm, report your alignment and the score. The score for the alignment is 14. The alignment is not unique and you just need to report one of them. (b) (8 points) Write a program to find the best locally aligned region of two proteins in PROTEIN.fasta. Use BLOSUM62 and set gap open/extend penalties as -5. Run your algorithm and report the alignment result. The BLOSUM62 matrix has been extracted into BLOSUM62Matrix.txt. The score for the local alignment is 202. The alignment is not unique and you just need to report one of them. One way to check your algorithm is to compare its performance with EMBOSS Water (http://www.ebi.ac.uk/Tools/psa/emboss_water/) 3. [24 points] Hidden Markov Model Your task (a) (4 points) A transcription factor (TF) is a protein that binds to certain DNA sequences in promoters and affects transcription. The binding sites of a TF that binds sequences of length n can be described using a 4 × n matrix in which the ith column gives the probabilities for the different nucleotides at position i of the binding site. Describe an HMM for a promoter that can have multiple disjoint binding sites for a given TF, draw the topology and label all parameters. 4 Here we define three states: start, binding site, non-binding site. ps→b , ps→n , pn→n , pb→n and pn→b are all transition probabilities labelled. The emission probability matrices are defined for binding site state and non-binding site stage. • Each position in the binding site needs to be defined as a separate state with a transition probability 1. It’s also ok to define A, C, G, T as four separate states instead of a single non-binding site state. But the latter one is more preferred. • The start stage is not compulsory in your topology. It’s also good to include end stage or include the transition event ”binding site → binding site” which is unlikely to happen. (b) (2 points) Suppose we know that the distance between binding sites should be between 100 and 200 nucelotides. Can we construct an efficient HMM to capture this? Sketch an HMM topoplogy or explain why HMM is not suitable. The HMM proposed above is not able to satisfy the length constraint. Each state only depends on the adjacent element without the ability to incorporate higher order information. • You can also modify the existing HMM to make it work by playing the same trick as we do for the binding site. For the first 100 positions between binding sites, you can treat them as separate states with a transition probability 1. Starting from the 101 position, you can define another state similar with the non-binding site state above. • You could argue your HMM is efficient or not. We don’t have a strong opinion on this point. (c) Consider the following HMM, where transition probabilities are on the edges and emission probabilities are given in tables next to the nodes: 0.2 X 0.2 a=0.0 c=0.0 g=0.0 t=1.0 a=0.2 c=0.3 g=0.5 t=0.0 q0 q1 0.1 1.0 0.6 q3 0.5 a=0.1 c=0.8 g=0.1 t=0.0 0.1 q2 a=0.0 c=0.9 g=0.1 t=0.0 0.5 Figure 1: HMM • (4 points) Suppose we start in state q0 . Give two paths that could emit the string tagcat. What are their probabilities? First we need to evaluate the value of x. x = P (q1 → q1 ) = 1 − 0.1 − 0.1 = 0.8 In fact, there are only three possible paths. Path I: q0 (t) → q1 (a) → q1 (g) → q2 (c) → q3 (a) → q0 (t) P = 1 × 0.2 × 0.1 × 0.8 × 0.1 × 0.1 × 0.9 × 0.5 × 0.2 × 1 × 1 = 1.44 × 10−5 Path II: q0 (t) → q1 (a) → q2 (g) → q2 (c) → q3 (a) → q0 (t) P = 1 × 0.2 × 0.1 × 0.1 × 0.1 × 0.5 × 0.9 × 0.5 × 0.2 × 1 × 1 = 9 × 10−6 5 Path III: q0 (t) → q1 (a) → q1 (g) → q1 (c) → q1 (a) → q0 (t) P = 1 × 0.2 × 0.1 × 0.8 × 0.1 × 0.8 × 0.8 × 0.8 × 0.1 × 0.1 × 1 = 8.192 × 10−6 You could also calculate the probabilities as the posterior probability given the sequence tagcat. P (Path I|tagcat) = 1.44 × 10−5 = 0.456 1.44 × 10−5 + 9 × 10−6 + 8.192 × 10−6 • (8 points) Suppose we start in state q0 with probability 1. Compute and show the Vitterbi dynamic programming matrix for the string tacccgt. What is the highest probability path for this string? If you use probabilities, q0 q1 q2 q3 t 1 0 0 0 a 0 0.02 0 0 c 0 0.0128 0.0018 0 c 0 8.192 × 10−4 1.152 × 10−3 2.7 × 10−4 c 0 5.2429 × 10−3 7.3728 × 10−4 1.728 × 10−4 g 0 4.1943 × 10−4 5.2429 × 10−5 1.8432 × 10−4 t 1.8432 × 10−4 0 0 0 If you use log2 values of probabilities, t 0 -Inf -Inf -Inf q0 q1 q2 q3 a -Inf -5.64 -Inf -Inf c -Inf -6.29 -9.12 -Inf c -Inf -6.93 -9.76 -11.85 c -Inf -7.58 -10.41 -12.50 g -Inf -11.22 -14.22 -12.41 c -Inf -5.25 -7.21 -8.66 g -Inf -7.78 -9.86 -8.60 t -12.41 -Inf -Inf -Inf If you use loge /ln values of probabilities, q0 q1 q2 q3 t 0 -Inf -Inf -Inf a -Inf -3.91 -Inf -Inf c -Inf -4.36 -6.32 -Inf c -Inf -4.80 -6.77 -8.22 t -8.60 -Inf -Inf -Inf The most probable path is q0 → q1 → q1 → q1 → q2 → q3 → q0 (d) (4 points) Recall from class that a profile HMM looks something like this D1 D2 D3 D4 D5 D6 I0 I1 I2 I3 I4 I5 I6 Begin M1 M2 M3 M4 M5 M6 Figure 2: profileHMM 6 End where the I, D, M states are insertion, deletion, and match states, respectively. Suppose you have two functions defined for you: Viterbi(M, x) that returns the most probable path through an HMM M for an observed sequence x, and Train(x1 , x2 , . . . , xk , p1 , p2 , . . . , pk ) that takes in sequences and their paths and returns an HMM of the form above with all the optimal parameters set so that the sequences x1 , . . . , xk are likely to be output by the HMM. Explain how you can use these functions to perform a multiple sequence alignment (short pseudocodes would be helpful). Explain one reason why this approach may fail to produce the correct alignment. Given k sequences with lengths l1 , . . . , lk . i. We calculate the average length of these sequence ¯l and define ¯l match states, ¯l deletion states, ¯l deletion states and two additional states (”begin” and ”end”). In this way we define ii. iii. iv. v. vi. the number of states in the profile HMM we are going to use. We could set up an initial group of parameters, in other words, M0 . (You could start by initializing p1 , . . . , pk , and the following step will be a little different) p1 = Viterbi(M0 , x1 ), . . . , pk = Viterbi(M0 , xk ) while the HMM doesn’t converge (Or p1 ,. . . ,pk are not stable) do M ← Train(x1 , . . . , xk , p1 , . . . , xk ) p1 ← Viterbi(M, x1 ), . . . , pk ← Viterbi(M, xk ) end return p1 , . . . , pk and M Based on the output, we could generate the alignment profile correspondingly. There is a high chance this work flow may stuck in the local optimum. Also there is no guarantee on the convergence. You should be careful about the initial parameters. • The lengths of these sequences may not be the same. So first we need to figure out how many states we are going to use. The average length is one choice. • Some of the reasons may still apply if you don’t use profile HMM, so be specific to argue why the profile HMM may fail. • Some reference material on the profile HMM: HMM for sequence alignment: profile HMM, http://goo.gl/5qMXF7 Profile Hidden Markov Models, https://www.cs.princeton.edu/~mona/Lecture/HMM1. pdf (e) (2 points) True or false: (if true, give an one sentence justification; if false, give a counter example.) When learning an HMM for a fixed set of observations, assume we do not know the true number of hidden states (which is often the case), we can always increase the training data likelihood by permitting more hidden states. True. For the worst case, we could give one hidden sate for each output value in the training sequence, and achieve perfect fitting. (Note: if you consider the scenario where the number of states is even larger than the number of observed output values, and answered ’False’, you will also get full credit.) 4. [12 points] Evolution and Phylogeny Your task (a) Suppose you are given two sequences that are of length 200 bases. They differ by 22 transitions and 6 transversions. • (2 points) Using Jukes-Cantor model, calculate the expected number of substitutions that occurred since these sequences diverged from a common ancestor. 7 KJC = − 43 ln(1 − 43 PD ) = − 34 ln(1 − 43 × 22+6 200 ) = 0.155 The expected number of substitutions is 0.155 × 200 = 31 • (2 points) Redo the computation with Kimura two parameter model. 22 6 6 KK2P = − 12 ln(1−2P −Q)− 14 ln(1−2Q) = − 12 ln(1−2× 200 − 200 )− 14 ln(1−2× 200 ) = 0.159 The expected number of substitutions is 0.159 × 200 = 32 • (4 points) Now you are given two sequences of length 200 which differ by 54 transitions and 18 transversions. Repeat the computation with JC and Kimura two parameter models. Based on these results, which model do you think is preferable when sequences are less and more divergent respectively? KJC = − 34 ln(1 − 43 PD ) = − 34 ln(1 − 43 × 54+18 200 ) = 0.490 The expected number of substitutions is 0.49 × 200 = 98 54 18 18 − 200 )− 14 ln(1−2× 200 ) = 0.547 KK2P = − 21 ln(1−2P −Q)− 14 ln(1−2Q) = − 12 ln(1−2× 200 The expected number of substitutions is 0.547 × 200 = 109 When the sequences are less divergent, these two models provide similar results. When the sequences are more divergent, Kimura two parameter models are preferred to capture the real number of substitutions. (b) (4 points) You are given the following set of aligned sequences: 1, TCAA 2, GCAT 3, TTTT 4, GATA 5, GAAC 5, ATAG Find the parsimony score for the tree ((((1, 2),(3, 4)), 5), 6). Indicate the sequence at each vertex of the tree. The parsimony score is 11. You can calculate the parsimony score for each site and sum them up to get the final value. 8
© Copyright 2026 Paperzz