Sequence Statistics - Rowan University

BIOINFORMATICS
BIOINFORMATICS
ECE 402/504
Lecture 6-7
Sequence Statistics
ROBI POLIKAR
SIGNAL PROCESSING & PATTERN RECOGNITION
LABORATORY @ ROWAN UNIVERSITY
© 2011, All Rights Reserved, Robi Polikar.
These lecture notes are prepared by Robi Polikar. Unauthorized use,
including duplication, even in part, is not allowed without an explicit
written permission. Such permission will be given – upon request – for
noncommercial educational purposes if you agree to all of the following:
1. Restrict the usage of this material for noncommercial and nonprofit
educational purposes only; AND
2. The entire presentation is kept together as a whole, including this page
and this entire notice; AND
3. You include the following link/reference on your site:
© RobiUniversity
Polikar http://engineering.rowan.edu/~polikar.
Bioinformatics, © 2011 Robi Polikar, Rowan
BIOINFORMATICS
THIS WEEK IN BIOINFORMATICS
ECE 402/504
 Probabilistic Models of Genome Sequences
 Sequence Statistics
 Gene Finding
Photo / diagram credits
Courtesy: National Human Genome Research Institute
CH
N. Cristianini & M. W. Hahn, Int. Computational Genomics, Cambridge, © 2007
ZB
M. Zvelebil & J.O. Baum, Understanding Bioinformatics, Garland Sci., © 2008
RP
Robi Polikar – All Rights Reserved © 2011.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
DEFINITIONS
ECE 402/504
 A DNA sequence s is a finite string from the alphabet ΝDNA = {A, C, G, T} of
nucleotides. A genome is the set of all DNA sequences associated with an
organism. An RNA sequence, on the other hand, is a string from the alphabet
ΝRNA = {A, C, G, U}.
 Elements of a sequence can be represented as si or s(i)
 We can use Matlab – like syntax to refer to subsequences.
• s = GTCGAGCTAGGATCA  s(3:6) = CGAGC, s(10) = G
 Similarly, an amino acid sequence can then be obtained from the 20 letter
alphabet A = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}
 There are two commonly used models to represent the underlying genomic
phenomenon that generates the DNA sequences (both of which are actually
wrong!)
 Multinomial sequence model
 Markov sequence model
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
MULTINOMIAL MODEL
ECE 402/504
 DNA sequence is generated by randomly drawing letters from the alphabet
ΝDNA = {A, C, G, T} , where the letters are assumed to be independent and
identically distributed (i.i.d).
 For any given sequence position i , we independently draw one of these four letters,
from the same distribution over the alphabet NDNA.
 The constant distribution simply assigns a probability to each letter,
p=(pA, pG, pC, pT), such that the probability of observing any of the nucleotides at
position i of the sequence s is px=p(s(i)=x).
 Note that since there are a finite number of outcomes for each experiment (of
drawing a letter from this distribution), this is a discrete distribution defined by its
probability mass function.
# of times outcome 1
(an A) is observed
P  X 1  x1 ,..., X k  xk  
# of possible
outcomes (4)
pA  pC  pG  pT  1
k
p
i 1
i
 1,
n!
x1 !
k
x
i 1
i
xk !
p1x1
pkxk
n
Probability of
success for ith outcome
# of trials
(sequence length)
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
MULTINOMIAL MODEL
ECE 402/504
 Because of the iid assumption, the multinomial distribution allows us to
compute the likelihood of observing any given sequence simply by multiplying
the individual probabilities. Given s = s1s2…sn
n
P s    p s i 
i 1
 Note that the iid assumption makes this an unrealistic model, we also know
that the DNA sequence is not completely random.
 However, this model explains a quite a bit of the behavior of the DNA data.
 Finding the regions of DNA where this model is violated do in fact lead to interesting
findings.
 We can easily evaluate the validity of this assumption, by looking at the frequency
distributions of the letters in a specific regions, and checking whether such
distributions change over the regions.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
MARKOV SEQUENCE MODEL
ECE 402/504
 This is a more complex, and but possibly a more accurate model, based on the
Markov chains.
 A Markov chain is a series of discrete observations, called states, where the
probably of observing the next state is given by fixed transitional probabilities,
called the transition matrix T.
 When used in the context of bioinformatics, the states are the individual nuleotides
(or amino acids)
 In a Markov chain, the probability of observing any one of finite outcomes
depends on the previous observations; specifically, the transition probabilities
of observing a particular outcome after another one.
 If the process is in state x at step i, then the probability of being in state y at step i+1
is given by the transition probability pxy
pxy  P  si 1  y si  x 
 The transitional probabilities between every possible states then determines the
transition matrix.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
MARKOV CHAINS
ECE 402/504
 Formally, a first order Markov chain is a sequence of discrete valued random
variables, X1, X2, … that follows the Markov property:
P  X n1  x X1  x1 , X 2  x2 ,
, X n  xn   P  X n1  x X n  xn 
 i.e., the probability of observing the present state (outcome) depends only on the
previous state. In other words, given the present state, future and past states are
independent.
 The multinomial model can be interpreted as a zeroth order Markov chain, where
there is no dependence on the previous outcomes.
 A Markov chain of order m is a process where the next state depends on the m
previous states
 The probability of observing a particular sequence is then given by
P  s   P  sn sn1  P  sn1 sn2 
n
P  s2 s1    s1     s1   psi1si
where π(s1) is the probability for the starting state to be s1.
Bioinformatics, © 2011 Robi Polikar, Rowan University
i 2
BIOINFORMATICS
MARKOV CHAINS
ECE 402/504
 For the 4-letter DNA nucleotide sequence, we have the following model. Given
T and π, we can compute the probability of any given sequence.
pGG
pAA
pAG
pGA
pCG
pTG
pGC
pGT
pAT
pTA
pAC
 p AA
p
CA

T
 pGA

 pTA
p AC
pCC
pGC
pTC
p AG
pCG
pGG
pTG
   A  C  G  T 
pCA
p AT 

pCT 
pGT 

pTT 
pTC
pCT
pTT
RP
pCC
 Later, we will use Markov chains,
and specifically hidden Markov
models in detailed sequence
analysis.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
ECE 402/504
STATISTICAL SEQUENCE
ANALYSIS
 There are many sophisticated statistical analyses that can be performed on any
given sequence.
 Some of the simplest ones, however, still provide very important information:
 Base composition: This is the relative frequency (proportion) of observing the
individual nucleotides in a given sequence.
 The sequence can be the entire genome, giving us the global proportions of these
nucleotides, or can be a subsequence, giving us the local proportions. Plotting the
local fluctuations across the sequence can provide even more interesting clues
about a given species.




Nucleotide density: The running proportion of nucleotides across the genome
Dimer count: The number of each dimer (2-mer) in the genome
N-mer count: The number of each N-mer in the genome
Codon count: The number of each codon in the genome
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
INTRODUCING MATLAB’S
BIOINFORMATICS TOOLBOX
ECE 402/504
getgenbank Retrieve sequence information from GenBank database
Data = getgenbank(AccessionNumber)
getgenbank(..., 'PartialSeq', PartialSeqValue, ...)
getgenbank(..., 'FileFormat', FileFormatValue, ...)
getgenbank(..., 'ToFile', ToFileValue, ...)
getgenbank(..., 'SequenceOnly', SequenceOnlyValue, ...)
AccessionNumber: String specifying a unique alphanumeric identifier for a sequence record.
PartialSeqValue: Two-element array of integers [StartBP, EndBP] that specifies a subsequence to retrieve. StartBP is an integer between 1
and EndBP; EndBP is an integer between StartBP and the length of the sequence.
ToFileValue :
String specifying either a file name or a path and file name for saving the GenBank data. If you specify only a file
name, the file is saved to the MATLAB Current Directory.
FileFormatValue: String specifying the format for the file specified with the 'ToFile' property. Choices are 'GenBank' (default) or 'FASTA'.
SequenceOnlyValue: Controls the return of only the sequence as a character array. Choices are true or false (default).
getgenbank retrieves nucleotide information from the GenBank database, maintained by the National Center for Biotechnology Information
(NCBI) at http://www.ncbi.nlm.nih.gov/Genbank/
Data = getgenbank(AccessionNumber) searches for the accession number in the GenBank database and returns a MATLAB structure containing
information for the sequence.
getgenbank(..., 'PartialSeq', PartialSeqValue, ...) returns the specified subsequence in the Sequence field of the MATLAB structure.
getgenbank(..., 'ToFile', ToFileValue, ...) saves the data returned from the GenBank database to a file. To append GenBank data to an existing file,
specify that file name, and the data will be added to the end of the file
genbankread Read data from GenBank file
GenBankData = genbankread(File) reads in a GenBank-formatted file, File, and creates a structure, GenBankData, containing fields
corresponding to the GenBank keywords. Each separate sequence listed in the output structure GenBankData is stored as a separate element of
the structure.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
GETGENBANK - GENBANKREAD
ECE 402/504
getgenbank('nm_000520', 'ToFile', 'TaySachs_Gene.txt')
s = genbankread('TaySachs_Gene.txt')
s.SourceOrganism
ans =
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Euarchontoglires;
Primates; Haplorrhini; Catarrhini; Hominidae; Homo.
s.Definition
ans =
Homo sapiens hexosaminidase A (alpha
polypeptide) (HEXA), mRNA.
s.Sequence
ans =
agttgccgacgcccggcacaatccgctgcacgtagcaggagcctcaggtccaggccggaagtgaaagggcagggtgtgggtcctcctggggtcgcaggcgcagagccgcctctggtcacgtgattcgccgataagtcacgggggcgccgctcacctgaccagggtctcacgt
ggccagccccctccgagaggggagaccagcgggccatgacaagctccaggctttggttttcgctgctgctggcggcagcgttcgcaggacgggcgacggccctctggccctggcctcagaacttccaaacctccgaccagcgctacgtcctttacccgaacaactttcaattcc
agtacgatgtcagctcggccgcgcagcccggctgctcagtcctcgacgaggccttccagcgctatcgtgacctgcttttcggttccgggtcttggccccgtccttacctcacagggaaacggcatacactggagaagaatgtgttggttgtctctgtagtcacacctggatgtaac
cagcttcctactttggagtcagtggagaattataccctgaccataaatgatgaccagtgtttactcctctctgagactgtctggggagctctccgaggtctggagacttttagccagcttgtttggaaatctgctgagggcacattctttatcaacaagactgagattgaggactt
tccccgctttcctcaccggggcttgctgttggatacatctcgccattacctgccactctctagcatcctggacactctggatgtcatggcgtacaataaattgaacgtgttccactggcatctggtagatgatccttccttcccatatgagagcttcacttttccagagctcatgaga
aaggggtcctacaaccctgtcacccacatctacacagcacaggatgtgaaggaggtcattgaatacgcacggctccggggtatccgtgtgcttgcagagtttgacactcctggccacactttgtcctggggaccaggtatccctggattactgactccttgctactctgggtctg
agccctctggcacctttggaccagtgaatcccagtctcaataatacctatgagttcatgagcacattcttcttagaagtcagctctgtcttcccagatttttatcttcatcttggaggagatgaggttgatttcacctgctggaagtccaacccagagatccaggactttatgagg
aagaaaggcttcggtgaggacttcaagcagctggagtccttctacatccagacgctgctggacatcgtctcttcttatggcaagggctatgtggtgtggcaggaggtgtttgataataaagtaaagattcagccagacacaatcatacaggtgtggcgagaggatattccag
tgaactatatgaaggagctggaactggtcaccaaggccggcttccgggcccttctctctgccccctggtacctgaaccgtatatcctatggccctgactggaaggatttctacatagtggaacccctggcatttgaaggtacccctgagcagaaggctctggtgattggtggag
aggcttgtatgtggggagaatatgtggacaacacaaacctggtccccaggctctggcccagagcaggggctgttgccgaaaggctgtggagcaacaagttgacatctgacctgacatttgcctatgaacgtttgtcacacttccgctgtgaattgctgaggcgaggtgtccag
gcccaacccctcaatgtaggcttctgtgagcaggagtttgaacagacctgagccccaggcaccgaggagggtgctggctgtaggtgaatggtagtggagccaggcttccactgcatcctggccaggggacggagccccttgccttcgtgccccttgcctgcgtgcccctgtgct
tggagagaaaggggccggtgctggcgctcgcattcaataaagagtaatgtggcatttttctataataaacatggattacctgtgtttaaaaaaaaaagtgtgaatggcgttagggtaagggcacagccaggctggagtcagtgtctgcccctgaggtcttttaagttgaggg
ctgggaatgaaacctatagcctttgtgctgttctgccttgcctgtgagctatgtcactcccctcccactcctgaccatattccagacacctgccctaatcctcagcctgctcacttcacttctgcattatatctccaaggcgttggtatatggaaaaagatgtaggggcttggaggt
gttctggacagtggggagggctccagacccaacctggtcacagaagagcctctcccccatgcatactcatccacctccctcccctagagctattctcctttgggtttcttgctgcttcaattttatacaaccattatttaaatattattaaacacatattgttctcta
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
SEQDISP
ECE 402/504
seqdisp Format long sequence output for easy viewing
seqdisp(Seq) displays a sequence in rows, with a default row length of 60 and a default column width of 10.
seqdisp(Seq, ...'PropertyName', PropertyValue, ...) calls seqdisp with optional properties that use property name/property
value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single
quotation marks and is case insensitive. These property name/property value pairs are as follows:
seqdisp(Seq, ...'Row', RowValue, ...) specifies the length of each row for the displayed sequence.
seqdisp(Seq, ...'Column', ColumnValue, ...) specifies the number of letters to display before adding a space. RowValue must
be larger than and evenly divisible by ColumnValue.
seqdisp(Seq, ...'ShowNumbers', ShowNumbersValue, ...) controls the display of numbers at the start of each row. Choices
are true (default) to show numbers, or false to hide numbers
>> seqdisp(s.Sequence, 'Row', 100, 'Column', 10)
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
SEQDISP
ECE 402/504
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
ECE 402/504
 To get sequence data only, without the additional information, use
Hflu = getgenbank('NC_000907','SequenceOnly',true);
 This is the H.influenza virus
 Here are some additional useful matlab functions to do sequence analysis




basecount(): Counts nucleotides in a sequence
ntdensity(): nucleotide density
dimercount(): counts the dimers (2-mers) in a sequence
codoncount(): counts codons in a sequence
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
BASECOUNT
ECE 402/504
Basecount Count nucleotides in sequence
NTStruct = basecount(SeqNT) counts the number of each type of base in a SeqNT, a nucleotide sequence, and returns the counts in NTStruct, a
1-by-1 MATLAB structure containing the fields A, C, G, and T. For sequences with the character U, the number of U is added to the T field.
seqNT can be one of the following: String of codes specifying a nucleotide sequence OR Row vector of integers specifying a nucleotide sequence
OR MATLAB structure containing a Sequence field that contains a nucleotide sequence, such as returned by emblread, fastaread, genbankread,
getembl, or getgenbank.
• If a sequence contains ambiguous nucleotide characters (R, Y, K, M, S, W, B, D, H, V, or N), or gaps indicated by a hyphen (-), then these are
counted in the field Others, and the warning message appears: Warning: Ambiguous symbols appear in the sequence. These will be in Others.
• If a sequence contains undefined nucleotide characters (E, F, H, I, J, L, O, P, Q, X, or Z), then these characters are counted in the field Others, and
the following warning message appears. Warning: Unknown symbols appear in the sequence. These will be in Others.
• If the property 'Others' is set to 'full', ambiguous characters are listed separately in the fields R, Y, K, M, S, W, B, D, H, V, N, and Gap.
NTStruct = basecount(SeqNT, ...'Chart', ChartValue, ...) creates a chart showing the relative proportions of the nucleotides. ChartValue can be
'pie' or 'bar'.
NTStruct = basecount(SeqNT, ...'Others', OthersValue, ...) specifies how to count ambiguous characters (R, Y, K, M, S, W, B, D, H, V, and N), or gaps
indicated by a hyphen (-). Choices are 'full' (lists the ambiguous characters in separate fields) or 'bundle' (lists the ambiguous characters together
in the field Others). Default is 'bundle'.
NTStruct = basecount(SeqNT, ...'Structure', StructureValue, ...) suppresses the unknown characters warning when set to 'full'.
basecount(SeqNT) — Displays fields for four nucleotides, and, if there are ambiguous and unknown characters, add an Others field with the
ambiguous and unknown character counts.
basecount(SeqNT, 'Others', 'full') — Displays fields for 4 nucleotides, 11 ambiguous nucleotides, gaps, and, if there are unknown characters, adds
an Others field with the unknown counts.
basecount(SeqNT, 'Structure', 'full') — Displays fields for four nucleotides and an Others field. If there are ambiguous or unknown characters,
adds the counts to the Others field; otherwise displays 0 in the Others field.
basecount(SeqNT, 'Others', 'full', 'Structure', 'full') — Displays fields for 4 nucleotides, 11 ambiguous nucleotides, gaps, and an Others field. If
there are unknown characters, adds the counts to the Others field; otherwise displays 0 in the Others field.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
NUCLEOTIDE LETTER CODES
ECE 402/504
Int2nt
Convert nucleotide sequence from integer to letter representation
SeqChar = int2nt(SeqInt)
Ambiguous characters
nt2int
Convert nucleotide sequence from letter to integer representation
Syntax
SeqInt = nt2int(SeqChar)
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
NUCLEOTIDE DENSITY
ECE 402/504
ntdensity Plot density of nucleotides along sequence
ntdensity(SeqNT) plots the density of nucleotides A, C, G, and T in sequence SeqNT.
seqNT can be one of the following:
• String of codes specifying a nucleotide sequence.
• Row vector of integers specifying a nucleotide sequence.
• MATLAB structure containing a Sequence field that contains a nucleotide sequence, such as returned by emblread, fastaread, genbankread,
getembl, or getgenbank.
Density = ntdensity(SeqNT) returns a MATLAB structure with the density of nucleotides A, C, G, and T.
... = ntdensity(SeqNT, ...'PropertyName', PropertyValue, ...) calls ntdensity with optional properties that use property name/property value pairs.
You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive.
These property name/property value pairs are as follows:
... = ntdensity(..., 'Window', WindowValue, ...) uses a window of length WindowValue for the density calculation. Default WindowValue is
length(SeqNT)/20.
[Density, HighCG] = ntdensity(..., 'CGThreshold', CGThresholdValue, ...) returns indices for regions where the CG content of SeqNT is greater than
CGThresholdValue. Default CGThresholdValue is 5.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
DIMER / NMER COUNT
ECE 402/504
dimercount Count dimers in nucleotide sequence
Dimers = dimercount(SeqNT) counts the nucleotide dimers in SeqNT, a nucleotide sequence, and returns the dimer counts in Dimers, a MATLAB
structure containing the fields AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT.
For sequences that have dimers with the character U, these dimers are added to the corresponding dimers containing a T.
If the sequence contains ambiguous nucleotide characters (R, Y, K, M, S, W, B, D, H, V, or N), or gaps indicated by a hyphen (-), then these are
counted in the field Others, and the following message. Warning: Ambiguous symbols appear in the sequence. These will be in Others.
If the sequence contains undefined nucleotide characters (E, F, H, I, J, L, O, P, Q, X, or Z), then dimers containing these characters are counted in
the field Others, and the following warning message appears: Warning: Unknown symbols appear in the sequence. These will be in Others.
[Dimers, Percent] = dimercount(SeqNT) returns Percent, a 4-by-4 matrix with the relative proportions of the dimers in SeqNT. The rows
correspond to A, C, G, and T in the first element of the dimer, and the columns correspond to A, C, G, and T in the second element of the dimer.
... = dimercount(SeqNT, 'Chart', ChartValue) creates a chart showing the relative proportions of the dimers. ChartValue can be 'pie' or 'bar'.
nmercount Count n-mers in nucleotide or amino acid sequence
Nmer = nmercount(Seq, Length) counts the n-mers or patterns of a specific length in Seq, a nucleotide sequence or amino acid sequence, and
returns the n-mer counts in a cell array.
Nmer = nmercount(Seq, Length, C) returns only the n-mers with cardinality of at least C.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
CODON COUNT
ECE 402/504
codoncount Count codons in nucleotide sequence
Codons = codoncount(SeqNT) counts the codons in SeqNT, a nucleotide sequence, and returns the codon counts in Codons, a MATLAB structure
containing fields for the 64 possible codons (AAA, AAC, AAG, ..., TTG, TTT).
For sequences that have codons with the character U, these codons are added to the corresponding codons containing a T.
If the sequence contains ambiguous nucleotide characters (R, Y, K, M, S, W, B, D, H, V, or N), or gaps indicated by a hyphen (-), then these are
counted in the field Others, and the following message appears: Warning: Ambiguous symbols appear in the sequence. These will be in Others.
If the sequence contains undefined nucleotide characters (E, F, H, I, J, L, O, P, Q, X, or Z), then codons containing these characters are counted in
the field Others, and the following warning message appears: Warning: Unknown symbols appear in the sequence. These will be in Others.
[Codons, CodonArray] = codoncount(SeqNT) returns CodonArray, a 4-by-4-by-4 array containing the raw count data for each codon. The three
dimensions correspond to the three positions in the codon, and the indices to each element are represented by 1 = A, 2 = C, 3 = G, and 4 = T. For
example, the element (2,3,4) in the array contains the number of CGT codons.
... = codoncount(SeqNT, ...'Frame', FrameValue, ...) counts the codons in the reading frame specified by FrameValue, which can be 1 (default), 2,
or 3.
... = codoncount(SeqNT, ...'Reverse', ReverseValue, ...) controls the return of the codon count for the reverse complement sequence of SeqNT.
Choices are true or false (default).
... = codoncount(SeqNT, ...'Figure', FigureValue, ...) controls the display of a heat map of the codon counts. Choices are true or false (default).
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
BASIC SEQUENCE STATISTICS
ECE 402/504
Hflu = getgenbank('NC_000907','SequenceOnly',true);
[Hflu_counts] = basecount(Hflu);
Hflu_density = ntdensity(Hflu)
[Dimers, Percent] = dimercount(Hflu, 'chart', 'pie')
Others
A
T
AA
TT
GT
CA
Dimers =
AA: 219880
AC: 92410
AG: 88457
AT: 166837
CA: 121618
CC: 68014
CG: 72523
CT: 88551
GA: 94125
GC: 95529
GG: 66448
GT: 91314
TA: 131955
TC: 94745
TG: 119996
TT: 217512
Others: 223
Percent =
0.1202
0.0665
0.0514
0.0721
C
Others
0.0505
0.0372
0.0522
0.0518
CT
GA
AT
TA
AC
TG
CC
GG
GC
CG
TC
0.0483
0.0396
0.0363
0.0656
G
Bioinformatics, © 2011 Robi Polikar, Rowan University
0.0912
0.0484
0.0499
0.1189
AG
RP
BIOINFORMATICS
BASIC SEQUENCE STATISTICS
ECE 402/504
Nucleotide density
• Default window size:
1.8M/20 ~ 90000
• Try different windows
What is the advantage or
disadvantage of longer
windows?
0.35
A
C
G
T
0.3
0.25
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
6
x 10
Also, the variation along
of densities along the
sequence proves that the
iid assumption really is
in fact violated. But
significantly so?
A-T C-G density
0.7
A-T
C-G
0.6
So, flu is AT-heavy, CG-light!
0.5
0.4
RP
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
6
x 10
Bioinformatics, © 2011 Robi Polikar, Rowan University
What is happening here?
Horizontal gene transfer ?
BIOINFORMATICS
HORIZONTAL GENE TRANSFER
ECE 402/504
 In general, an organism receives its gene from its parents (ancestors). This is
called vertical transfer.
 Much of our knowledge of genetics and evolution is based on the vertical transfer
 However, an organism, usually a virus, can also insert its own genetic material
into the genome of another. This is called horizontal gene transfer.
 It is now understood that HGT plays – and in fact has done so from the beginning of
life – an important role in evolution.
 One way to find HGTs in a genome is to look for regions of the genome where
sequence statistics differ significantly from the rest of the genome.
 Note that H. influenza contains about 30Kb of unusual GC content at around 1.56Mb
– 1.59 Mb region of the genome.
 This is attributed to an ancient insertion of viral DNA into H. influenza genome.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
CHANGE POINT ANALYSIS
ECE 402/504
 Finding the regions of the genome where the statistical characteristics change
is called the change point analysis.
 The simpler way is to divide the genome into consecutive (or overlapping)
windows, and set a threshold for the quantity being measured (such as the GC
content). Any window where this quantity exceeds the threshold indicates a change
point.
 This requires appropriate choice of window length and threshold, both of which are
statistical problems, which can be handled by hypothesis testing (remember?)
 We will later see that hidden Markov models play an important role in change point
analysis.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
CODONCOUNT
ECE 402/504
Hflu = getgenbank('NC_000907','SequenceOnly',true);
codoncount(Hflu)
returns codon counts for the first reading frame, with the corresponding warning message:
Note that since the codons actually
refer to RNA sequences, we should in
fact be using uracil (U) instead of
thymine (T). However, using the DNA
sequences is common practice, hence
the “Ts” that appear in the codon list.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
CODONCOUNT
ECE 402/504
 We can also count the codons in each of the six reading frames, and plot the
results in a heat map.
for frame = 1:3
figure
subplot(2,1,1);
codoncount(mitochondria,'frame',frame,'figure',true); % the first three (forward) reading frames
title(sprintf('Codons for frame %d',frame));
subplot(2,1,2);
codoncount(mitochondria,'reverse',true,...
'frame',frame,...
'figure',true); %The last three (reverse) reading frames
title(sprintf('Codons for reverse frame %d',frame));
end
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
ECE 402/504
Codons for frame 1
AAA
ACA
AGA
ATA
GAA
GCA
GGA
GTA
AAC
ACC
AGC
ATC
GAC
GCC
GGC
GTC
AAG
ACG
AGG
ATG
GAG
GCG
GGG
GTG
AAT
ACT
AGT
ATT
GAT
GCT
GGT
GTT
CAA
CCA
CGA
CTA
TAA
TCA
TGA
TTA
Codons for frame 2
CAC
CCC
CGC
CTC
TAC
TCC
TGC
TTC
CAG
CCG
CGG
CTG
TAG
TCG
TGG
TTG
CAT
CCT
CGT
CTT
TAT
TCT
TGT
TTT
200
CAT
CCT
CGT
CTT
TAT
TCT
TGT
TTT
200
AAA
ACA
AGA
ATA
GAA
GCA
GGA
GTA
150
100
50
AAC
ACC
AGC
ATC
GAC
GCC
GGC
GTC
AAG
ACG
AGG
ATG
GAG
GCG
GGG
GTG
Codons for reverse frame 1
AAA
ACA
AGA
ATA
GAA
GCA
GGA
GTA
AAC
ACC
AGC
ATC
GAC
GCC
GGC
GTC
AAG
ACG
AGG
ATG
GAG
GCG
GGG
GTG
AAT
ACT
AGT
ATT
GAT
GCT
GGT
GTT
CAA
CCA
CGA
CTA
TAA
TCA
TGA
TTA
CAC
CCC
CGC
CTC
TAC
TCC
TGC
TTC
AAT
ACT
AGT
ATT
GAT
GCT
GGT
GTT
CAA
CCA
CGA
CTA
TAA
TCA
TGA
TTA
CAC
CCC
CGC
CTC
TAC
TCC
TGC
TTC
CAG
CCG
CGG
CTG
TAG
TCG
TGG
TTG
CAT
CCT
CGT
CTT
TAT
TCT
TGT
TTT
200
CAG
CCG
CGG
CTG
TAG
TCG
TGG
TTG
CAT
CCT
CGT
CTT
TAT
TCT
TGT
TTT
200
150
100
50
Codons for reverse frame 2
CAG
CCG
CGG
CTG
TAG
TCG
TGG
TTG
AAA
ACA
AGA
ATA
GAA
GCA
GGA
GTA
150
100
50
Codons for frame 3
AAA
ACA
AGA
ATA
GAA
GCA
GGA
GTA
AAC
ACC
AGC
ATC
GAC
GCC
GGC
GTC
AAG
ACG
AGG
ATG
GAG
GCG
GGG
GTG
AAA
ACA
AGA
ATA
GAA
GCA
GGA
GTA
AAC
ACC
AGC
ATC
GAC
GCC
GGC
GTC
AAG
ACG
AGG
ATG
GAG
GCG
GGG
GTG
AAT
ACT
AGT
ATT
GAT
GCT
GGT
GTT
CAA
CCA
CGA
CTA
TAA
TCA
TGA
TTA
CAC
CCC
CGC
CTC
TAC
TCC
TGC
TTC
AAC
ACC
AGC
ATC
GAC
GCC
GGC
GTC
AAG
ACG
AGG
ATG
GAG
GCG
GGG
GTG
CAG
CCG
CGG
CTG
TAG
TCG
TGG
TTG
CAT
CCT
CGT
CTT
TAT
TCT
TGT
TTT
200
CAG
CCG
CGG
CTG
TAG
TCG
TGG
TTG
CAT
CCT
CGT
CTT
TAT
TCT
TGT
TTT
200
150
100
50
Codons for reverse frame 3
AAT
ACT
AGT
ATT
GAT
GCT
GGT
GTT
CAA
CCA
CGA
CTA
TAA
TCA
TGA
TTA
CAC
CCC
CGC
CTC
TAC
TCC
TGC
TTC
Bioinformatics, © 2011 Robi Polikar, Rowan University
150
100
50
AAT
ACT
AGT
ATT
GAT
GCT
GGT
GTT
CAA
CCA
CGA
CTA
TAA
TCA
TGA
TTA
CAC
CCC
CGC
CTC
TAC
TCC
TGC
TTC
150
100
50
BIOINFORMATICS
GENOME ANNOTATION
GENE FINDING
ECE 402/504
 In genome annotation, we typically start with computing basic statistical quantities,
such as base counts, dimer counts, codon counts, etc., based on which we can make
some observations on the relative distributions of the nucleotides and codons.
 The second, and arguably, more important step is gene finding or gene prediction.
 Given millions of long letters, how to we find where the genes begin and end.
 The easiest approach is to look for start and stop codons; after all, the protein
coding regions begin and end with these special codons.
 However there are several challenges:
 Not all genes are protein coding genes.
 In eukaryotes, the protein coding region of a gene actually appears in distinct non-neighboring
segments, called exons, separated by non-coding regions, called introns. Hence in eukaryotes,
gene funding is much more difficult than in prokaryotes, where the genes appear in one
continuous region.
 Even in prokaryotes, where we can simply look for start and stop codons, these pairs can occur
randomly, resulting in segments that are not actually coding a gene.
• NOT every reading frame that starts and ends with a START and STOP codon is actually a gene.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
GENE FINDING
ECE 402/504
 Each coding region is preceded (on the 5’ end) by a promoter region that regulate
(enhance or inhibit) the transcription of a gene; and flanked by untranslated
regions (UTRs), where the transcription actually starts. Both need to be identified.
 In prokaryotes, promoters are two segments of consensus sequences: a 6nt region of
TATAAT at the -10 location (pribnow box) , and a 7nt region of TTGACAT at the -35 location.
 The consensus sequences are conserved on average , but do NOT necessarily appear intact in
all promoters
Exon
Intron
ZB
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
GENE FINDING VS. GENE PREDICTION
ECE 402/504
 Traditionally, gene finding is done through extrinsic approaches, called homology
based methods
 In homology based methods, the target genome is queried against known mRNA or protein
sequence.
 A high degree of similarity between a known gene and a segment of a new genome provides
significant evidence that the similar region is also a gene encoding region.
 However, this approach requires a previously sequenced and annotated genes, an expensive,
tedious and difficult procedure.
 Alternatively, we can use computational approaches to determine whether the
statistical properties of a given segment provides any evidence that it may
correspond to a gene. These are called ab initio approaches.
 In Latin, ab initio means “from the beginning.” When used in the context of science, it roughly
means “from its primary principles.”
 Ab initio approaches are often adequate for prokaryotic, but not for eukaryotic gene finding,
which itself requires more sophisticated approaches, such as hidden Markov models.
 Ab initio approaches are in fact “gene prediction” approaches, rather than gene finding, as the
predicted genes still need to be verified that they are functional using in vivo techniques.
Bioinformatics, © 2011 Robi Polikar, Rowan University
GENE PREDICTION
OPEN READING FRAME
BIOINFORMATICS
ECE 402/504
 While we mentioned ORF previously, we are now ready to make a formal
definition:
 An open reading frame is any subsequence whose length L is a multiple of 3, starting
with the start codon ATG, ending with one of the three stop codons {TAA, TAG,
TGA}, with no additional stop codons in the middle.
 There are three possible reading frames on each strand of the DNA, giving a total of
6 possible reading frames.
 Hence, any gene prediction algorithms needs to locate ORFs on both strands
 So, how long an ORF can be?
 It is possible to have many random stretches of DNA that start and end with such
codons, just by chance, and hence not actually be a gene.
 It is understood that the longer the ORF, the less likely that it is NOT a protein
coding segment.
 The traditional – and simplest - approach is then to compute the probability of
observing an ORF of length L in a random sequence that is generated by a
probability model (such as the multinomial distribution).
• The more strict the conditions, the fewer ORFs we will find.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
FINDING ORFS
ECE 402/504
 Here is a simple ORF Finding Algorithm (See pg. 31 Cristianini):
 Given a DNA sequence s and k  Z+, for each of the 6 reading frames, decompose s
into triplets, and find all stretches of triplets starting with the start codon ATG, and
end with one of the stop codons {TAA, TAG, TGA}. Repeat for the reverse sequence.
Output all ORFs whose lengths are greater than the predetermined threshold k.
 How do we choose k?
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
SEQSHOWORFS()
ECE 402/504
Seqshoworfs Display open reading frames in sequence
seqshoworfs identifies and highlights all open reading frames using the standard or an alternative genetic code.
seqshoworfs(SeqNT) displays the sequence with all open reading frames highlighted, and it returns a structure of start and stop positions for
each ORF in each reading frame. The standard genetic code is used with start codon 'AUG' and stop codons 'UAA', 'UAG', and 'UGA'.
seqshoworfs(SeqNT, ...'PropertyName', PropertyValue, ...) calls seqshoworfs with optional properties that use property name/property value
pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotes and is case insensitive. These
property name/property value pairs are as follows:
seqshoworfs(SeqNT, ...'Frames', FramesValue, ...) specifies the reading frames to display. The default is to display the first, second, and third
reading frames with ORFs highlighted in each frame.
seqshoworfs(SeqNT, ...'GeneticCode', GeneticCodeValue, ...) specifies the genetic code to use for finding open reading frames.
seqshoworfs(SeqNT, ...'MinimumLength', MinimumLengthValue, ...) sets the minimum number of codons for an ORF to be considered valid. The
default value is 10.
seqshoworfs(SeqNT, ...'AlternativeStartCodons', AlternativeStartCodonsValue, ...) uses alternative start codons if AlternativeStartCodons is set to
true. For example, in the human mitochondrial genetic code, AUA and AUU are known to be alternative start codons. For more details on
alternative start codons, see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=t#SG1
seqshoworfs(SeqNT, ...'Color', ColorValue, ...) specifies the color used to highlight the open reading frames in the output display. The default
color scheme is blue for the first reading frame, red for the second, and green for the third frame.
seqshoworfs(SeqNT, ...'Columns', ColumnsValue, ...) specifies how many columns per line to use in the output. The default value is 64.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
DIFFERENT GENETIC CODES?
ECE 402/504
 Yes, while most organisms use the standard genetic code, there are some structures that
use alternative codes. These are indicated in Matlab by the following codes:
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
STANDARD VS. MITOCHONDRIAL CODE
ECE 402/504
Standard Code
Vertebrate Mitochondrial Code
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
AN EXAMPLE
ECE 402/504
 Let’s take a look at M. genitalium genome (NC_000908)
Mgen = getgenbank('NC_000908','SequenceOnly',true);
s1=seqshoworfs(Mgen,'Frames', 1); %Find the ORFs of any length in Frame 1
s1_90=seqshoworfs(Mgen,'Frames', 1, 'MinimumLength', 90);%Find the ORFs of length >90 in Frame 1
s2_90=seqshoworfs(Mgen,'Frames', 2, 'MinimumLength', 90);%Find the ORFs of length >90 in Frame 2
s3_90=seqshoworfs(Mgen,'Frames', 3, 'MinimumLength', 90);%Find the ORFs of length >90 in Frame 3
s1c_90=seqshoworfs(Mgen,'Frames', -1, 'MinimumLength', 90);%Find the ORFs of length >90 in Frame -1
s2c_90=seqshoworfs(Mgen,'Frames', -2, 'MinimumLength', 90);%Find the ORFs of length >90 in Frame -2
s3c_90=seqshoworfs(Mgen,'Frames', -3, 'MinimumLength', 90);%Find the ORFs of length >90 in Frame -3
 The number of ORF returned:





s1: 1074
s1_90: 114 ; s1c_90: 70
s2_90: 92; s2c_90: 92
s3_90: 95; s3c_90: 82
Total: 545
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
AN EXAMPLE
ECE 402/504
Two of the many open reading frames when no threshold is set.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
AN EXAMPLE
ECE 402/504
One of the many, but far
fewer, open reading frames
found, when the threshold
is set to 90.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
HOW TO CHOOSE K?
ECE 402/504
 This is done by the statistical procedure than no other – that you are all too familiar with –
hypothesis testing.
 We ask the question: what is the probability that an ORF will appear by chance and not due
to a genomic structure?
 Null hypothesis: The ORF occurs by chance (and not really a coding region)
 Alternative hypothesis: The ORF is a coding region.
 We first choose a significance level, , which determines our willingness to commit a type 1
error: a false positive, i.e., rejecting the null hypothesis when in fact it is true
 In our context, the probability that we will call the sequence a coding region, when in fact it is
not (hence, false alarm).
 We then compute the p-value: the probability that we can get an ORF with a length as
extreme as k by chance, i.e., if the null hypothesis were true.
 If the p-value is less than , we call the result significant, and reject the null hypothesis in favor
if the alternative hypothesis.
 Traditionally, the significance level is selected as 0.05.
 We can reduce type I error by reducing , but that increases type II error, a false negative, i.e.,
the probability of accepting a false null hypothesis, when in fact alternative hypothesis is true:
we fail to recognize a genuine an ORF, and call it a random chance!
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
COMPUTING P-VALUE
ECE 402/504
 Computing the p-value depends on the probability distribution from which the data (the
sequence) are generated.
 In most classical applications, the data are assumed to be drawn from a normal (Gaussian)
distribution, hence the probability of observing such an extreme value of the measured
variable is computed under the assumption that the null hypothesis is true, and that the
data do in fact come from a normal distribution.
 In our context, the data come from a multinomial distribution, which makes the calculations
much easier:
 Let’s ask the question again: what is the probability of an ORF of length k or mode codons to
appear by chance, i.e., the nulll hypothesis to be true?
 If the distribution of each codon is uniform (PAAA = PAAC = PACG =… =PTTT), the probability of picking a
stop codon at some point is 3/64, hence picking a non-stop codon is 61/64.
 The probability of a run of k non-stop codons – under the iid condition – is then (61/64)k
 If  = 0.05, we can compute that (61/64)62=0.051. This is the probability of having a random length
k = 62 sequence ending with a codon. Any other k>62  smaller probability of being by chance.
 Now we have a start codon and a stop codon to count, so if discard all ORFs with length k<64
(62+1+1), we will have removed 95% of all ORFs that appear by chance!
 According to this analysis, we should only select those ORFs that are longer than 64.
 For k=102, we have an  = 0.01  An ORF with k>102 is significant with 99% confidence.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
NON-UNIFORM CODON DISTRIBUTION
ECE 402/504
 Of course, not all codons appear at equal probability, a phenomenon known as
codon bias.
 If an organism has far more ATs in its nucleotide sequence, the codons that include
ATs will also appear more often, which mind you, include the start (ATG) and stop
codons {TAA, TGA, TAG}.
 Instead of assuming equal probabities, we can very easily compute the probability
of obtaining a stop codon as P  stop   P TAA  P TAG   P TAG 
 Then the probability of a sequence of k non-stop codons (following a start codon),
and assuming DNA is generated in an iid fashion, is
P Sequence of k non-stop codons   1  P  stop 
k
 We then use this expression, equate to  and solve for k.
 How to compute P(stop)?
 Homework: Compute P(stop) and the appropriate value of k, then estimate the total
number genes based on the total number of ORFs larger than k for the M. genitalium
and H. influenza genomes.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
ECE 402/504
%Compute the total probability of a stop codon occuring in a sequence.
Mgen = getgenbank('NC_000908','SequenceOnly',true);
Nmer=nmercount(Mgen, 3);
Total_3mer=0;
for i=1:64
Codon(i,:) = Nmer{i,1};
Total_3mer = Total_3mer + Nmer{i,2};
end
TAA_index=strmatch('TAA', Codon);
TAA_count = Nmer{TAA_index, 2};
Prob_TAA = TAA_count / Total_3mer;
TAG_index=strmatch('TAG', Codon);
TAG_count = Nmer{TAG_index, 2};
Prob_TAG = TAG_count / Total_3mer;
TGA_index=strmatch('TGA', Codon);
TGA_count = Nmer{TGA_index, 2};
Prob_TGA = TGA_count / Total_3mer;
Total_stop_prob = Prob_TAA+Prob_TAG+Prob_TGA;
%k=input('Enter k:')
%Prob_of_k_run = (1-Total_stop_prob)^k
alpha = input ('Enter alpha');
k = ceil(log(alpha)/log(1-Total_stop_prob));
Bioinformatics, © 2011 Robi Polikar, Rowan University
RP
BIOINFORMATICS
ESTIMATING THE P-VALUE
ECE 402/504
 What if we do not wish to assume a particular distribution on DNA sequence
generation?
 Then computing the p-value becomes little more involved.
 Need to replicate and estimate the sequence distribution, which can be done
in a few different ways. Two popular approaches:
 Randomization: Randomly permute the original DNA sequence, by shuffling its
nucleotides (one or three at a time), creating a different sequence of the same
letters (or codons) each time.
 Bootstrapping: Sample the original DNA sequence many times, drawing N letters
from the N-long sequence at random with replacement.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
RANDOMIZATION
ECE 402/504
 Example (Example 2.6 from your text)
 Permute the DNA sequence for M. genitalium
 Compute the total number of ORFs in the original and the permuted sequence,
record the length of each ORF.
• 11922 in the original, 17367 in the randomized one.
• The list of ORFs in the random sequence becomes the null distribution, i.e., the
distribution that generated the sequence by chance.
 Keep those candidate genes from the original DNA only if their length is longer than
most (or all) of the candidate genes found in the randomized DNA.
• Longest ORF in random sequence: 402
• Using this as a threshold, you will find 326 ORFs in the original DNA longer than this.
 This will miss short genes, however. So you may want to use a looser (lower)
threshold, say the longest 1% or 5% of the ORFs in the random sequence.
 You will then find shorter ORFs, but will also have false alarms.
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
RANDOMIZATION EXAMPLE
ECE 402/504
% Randomly permute the Mgen sequence
Index = randperm(length(Mgen);
Rand_gen = Mgen(Index);
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
RANDOMIZATION EXAMPLE
ECE 402/504
Original M. Genitalius Nucleotide density
Randomized M. Genitalius Nucleotide density
0.5
0.4
A
C
G
T
0.4
0.3
A
C
G
T
0.35
0.3
0.25
0.2
0.1
RP
0.2
0
1
2
3
4
5
6
0
1
2
3
4
5
AA
CA
AA
GA
TA
CA
CT
GA
AT
TA
TG
AC
CC
GG
CG
GC TC AG
TT
GT
CT
AT
TG
AC
GG
CC
CG
GC TC AG
Bioinformatics, © 2011 Robi Polikar, Rowan University
6
x 10
TT
GT
5
5
x 10
BIOINFORMATICS
RANDOMIZATION EXAMPLE
ECE 402/504
%Randomly permute the Mgen sequence
Index = randperm(length(Mgen);
Rand_gen = Mgen(Index);
L_Mgen = 1960
2162
1928
Numbers in black Dr. P implementation
Numbers in magenta: textbook implementation
1998
1861
2015
Total_length_Mgen = 11924 11922
L_Rand_gen = 2961
2884
2898
2826
Total_length_Rand_gen = 17311 17367
Max_length_Rand_gen_ORF = 369
Max_length_Mgen_ORF = 3150
402
Number_significant_ORFs=length(find (L_Mgen_ORF >Max_length_Rand_gen_ORF));
Number_significant_ORFs = 366
326
Bioinformatics, © 2011 Robi Polikar, Rowan University
2892
2850
BIOINFORMATICS
SEQTOOL
ECE 402/504
seqtool Open Sequence Tool window to interactively explore biological sequences
seqtool opens the Sequence Tool window. For examples of using this window, see Sequence Tool.
seqtool(Seq) opens the Sequence Tool window and loads Seq, a sequence, into the window.
seqtool('close') closes the Sequence Tool window.
seqtool(Seq, 'Alphabet', AlphabetValue) specifies an alphabet for the sequence, Seq. Default is 'AA', except when all of the symbols in the
sequence are A, C, G, T, and -, then AlphabetValue defaults to 'NT'. Use 'AA' when you want to force an amino acid sequence alphabet.
Try seqtool(‘NM_000520’)
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
ECE 402/504
Bioinformatics, © 2011 Robi Polikar, Rowan University
BIOINFORMATICS
LAB ASSIGNMENTS
ECE 402/504
 Compute P(stop) and the appropriate value of k, then estimate the total number genes
based on the total number of ORFs larger than k for the M. genitalium (NC_000908) and
H. influenzae (NC_000907) genomes.
 Now, assume that you cannot compute the p-value for these DNA sequences, and that
you want to estimate k based on the longest length of ORF in a randomized DNA
sequence. How many ORFs did you find (repeat for each of 907 and 908)?
 Find all ORFs in human (NC_012920), chimp (NC_001643) and mouse (NC_005089)
mtDNA. Note that the genetic code for mitochondrial DNA is slightly different than the
standard code. What is the appropriate threshold k, and how do you determine this
value? How many ORFs do you get for different values of k? What fraction of sequence
represents (potential) protein coding sequences for k=1, k=optimal and k=others?
 Repeat the ORF search on randomized mtDNA sequences. How long is the longest ORF
in the randomized sequence?
Bioinformatics, © 2011 Robi Polikar, Rowan University