Single DNA Sequence Analysis Tools

Single DNA Sequence
Analysis Tools
BME 110: CompBio Tools
Todd Lowe
May 1, 2012
Today’s topics
•  Getting sets of sequences: The UCSC
Genome Table Browser
•  General Toolbox: EMBOSS tool suite
•  ORF prediction
–  gORF at NCBI (single sequences)
–  GeneMark (full-genome)
•  Specialized site:
–  Restriction enzyme searches
–  PCR primer design
Basic ORF Finding
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
Look for “long” open reading frames
Scan sequence at nucleotide #1 (Frame #1), begin
ORF at first start codon: ATG, GTG, TTG
Continue scanning to first stop codon: TAA, TGA,
or TAG
That is your ORF!
Repeat, starting at nuc #2 (Frame #2) , then again,
starting at nuc #3 (Frame #3)
Take reverse complement of sequence ->
(i.e. 5’-CGAAC -> 5’-GTTCG)
Scan sequence starting at nuc #1 (Frame –1), nuc
#2 (Frame –2), nuc #3 (Frame –3)
On-line ORF Finding @ NCBI
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
•  Shows all ORFs of min. length 50, 100, or 300
nucleotides long
•  Also shows all start (green) & stop (red) codons
•  Allows “alternate” start codons (TTG, GTG, CTG)
•  When you’ve selected an ORF you like, click on “Accept”
button, then can display nucleotide or protein sequence
for further analyses
•  Example: “Missing gene region” in Pyrobaculum
calidifontis: chr:1487337-1491109
Historical Note: Annotating the
Yeast Genome
•  Yeast is a eukaryote, but does not have many
introns
•  A strict cutoff for ORF length was used:
minimum of 300 nucleotides ORF required to be
considered a gene in original genome annotation
(1996)
•  Since then, many smaller ORFs have been
found experimentally
EMBOSS Tools
•  Package (large) of on-line tools
•  Go to http://mobyle.pasteur.fr/
Transform sequence
–  Convert file format types
–  Reverse complement (revseq)
– 
– 
– 
– 
Extract a portion of sequence (extractseq)
Search & replace subsequences (biosed)
Translate (transeq, prettyseq) or back-translate (backtranseq/backtranambig)
Randomize sequence (shuffleseq)
Analyze sequences
– 
– 
– 
– 
– 
G/C content (geecee)
Codon usage (cusp), codon adaptation (cai), codon bias (chips)
Word composition (wordcount)
Needleman-Wunsch global alignment (needle)
Smith-Waterman local alignment (water)
Many others…
Demo
•  Get the DNA sequence for PAE1265 in
Pyrobaculum aerophilum
•  Using EMBOSS at Mobyle site, calcuate:
–  G/C percentage (DNA)
–  What is the most common 4-letter word?
–  Take the reverse complement
–  Extract nucleotides 100-150; What is the G/C
content?
–  Translate the DNA sequence in frame 2
–  What species have the 4 most similar gene
sequences? (Use NCBI BLAST)
More Sophisticated ORF
Prediction
•  Can analyze entire genome at once, use codon
frequencies, not just one gene
•  GeneMark (http://opal.biology.gatech.edu/GeneMark)
–  Two modes:
•  Genemark.hmm (based on previous genome information)
•  GenemarkS (uses only your sequence info, when no other similar genome
models are available)
Cutting DNA:
Restriction enzymes
•  enzymes isolated from prokaryotes that break
DNA at very specific sequence-specific
positions
•  in nature, act as host-defense against viruses
examples
Nla III:
BamH I:
Dra III:
5’ ... CATG^ ... 3’
5’ ... G^GATCC ... 3’
5’ ... CACNNN^GTG... 3’
> 3,000 RE’s found to date with >200 specificities!
Many RE’s Create “Sticky”
Ends
Before cutting:
5’-ATTGATGG^AATTCTTATGGATAG-3' 3'3’-TAACTACCTTAA^GAATACCTATC-5'
After cutting, “sticky ends”:
5’-ATTGATGG
AATTCTTATGGATAG-3'
3’-TAACTACCTTAA +
GAATACCTATC-5‘
•  Useful to increase efficiency & specificity of rejoining ends
New England BioLabs Tools
• 
http://tools.neb.com/
NEBcutter – Display which restriction
enzyme cut in your sequence, and where
REBsites – Display a “virtual” digest of your
DNA, showing how it would look on an
agarose gel
Primer3
http://frodo.wi.mit.edu/
Key Input:
1.  Sequence
2.  Targets (region to be amplified)
3.  Product Size Ranges
4.  Primer size (usually 18-22 bp)
5.  Primer Tm (annealing temperature)
(Rest are usually OK as defaults)