Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 北京林业大学计算生物学中心 www.bjfuccb.com Download and install programs • Unzip or untar • unzip • If file.tar.gz, tar xvfz file.tar.gz • Go to the directory and “./configure” • Then “make” System subroutine system ("ls –ltr"); sub ReadFasta { my ($fname) = @_; open(FILE, $fname) or die "Cannot open $fname\n"; my $data = ""; my @dnas = (); while(my $line = <FILE>) { if ($line =~ /^>/) { if ($data ne "") { push(@dnas, $data); } $data = ""; } $data .= $line; } if ($data ne "") { push(@dnas, $data); } close FILE; return @dnas; } print "Please input file name:\n"; my $fname = <STDIN>; my @dnas = ReadFasta($fname); my $len = $#dnas + 1; for (my $i = 0; $i < $len; $i++) { for (my $j = $i+1; $j < $len; $j++) { for (my $k = $j+1; $k < $len; $k++) { $fname = "$i\_$j\_$k"; print $fname; open(OUT, ">$fname"); print OUT $dnas[$i]; print OUT $dnas[$j]; print OUT $dnas[$k]; close OUT; system ("./clustalw2 $i\_$j\_$k"); } } } Debug • Notice there are problems in a program is hard • Find the source of the problem is even harder • Good debug tool: print • Better tool: debugger Perl debugger • • • • • • perl –d program arguments n: next line s: step in r: run until the end of the current sub <RETURN>, repeat c: continue to the next breakpoint Check source • l • List next several lines • l 8-10 • List line 8-10 • l 100 • List line 100 • l subname • List subroutine subname • f restrcit.pl • Switch to view restrict.pl Breakpoint • b 100 • Add a breakpoint at line 100 of the current file • b subname • Add a breakpoint at this subroutine •B • Remove a break point • B 100 will remove a breakpoint at line 100 • B * will remove all breakpoints See variable • p $var • Print the value of the variable • y var • Display my variable • V display variables • V var • w $var • Watch this var, stop when the value is changed Working with Single DNA Sequences Learning Objectives • Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR • Find out about gene-prediction methods, their potential, and their limitations • Understand how genomes and sequences and assembled Outline 1. Cleaning your DNA of contaminants 2. Digesting your DNA in the computer 3. Finding protein-coding genes in your DNA sequence 4. Assembling a genome Cleaning DNA Sequences • In order to sequence genomes, DNA sequences are often cloned in a vector (plasmid, YAC, or cosmide) • Sequences of the vector can be mixed with your DNA sequence • Before working with your DNA sequence, you should always clean it with VecScreen VecScreen • http://www.ncbi.nlm.nih.gov/VecScreen /VecScreen.html • Runs a special version of Blast • A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin What to do if hits found • If hits are in the extremity, can just remove them • If in the middle, or vectors are not what you are using, the safest thing is to throw the sequence away Computing a Restriction Map • It is possible to cut DNA sequences using restriction enzymes • Each type of restriction enzyme recognizes and cuts a different sequence: • EcoR1: GAATTC • BamH1: GGATCC • There are more than 900 different restriction enzymes, each with a different specificity • The restriction map is the list of all potential cleavage sites in a DNA molecule • You can compile a restriction map with www.firstmarket.com/cutter Cannot get it work! http://biotools.umassmed.edu/tacg4 Making PCR with a Computer • Polymerase Chain Reaction (PCR) is a method for amplifying DNA • PCR is used for many applications, including • Gene cloning • Forensic analysis • Paternity tests • PCR amplifies the DNA between two anchors • These anchors are called the PCR primer Designing PCR Primers • PCR primes are typically 20 nucleotides long • The primers must hybridize well with the DNA • On biotools.umassmed.edu, find the best location for the primers: • Most stable • Longest extension Analyzing DNA Composition • DNA composition varies a lot • Stability of a DNA sequence depends on its G+C content (total guanine and cytosine) • High G+C makes very stable DNA molecules • Online resources are available to measure the GC content of your DNA sequence • Also for counting words and internal repeats http://helixweb.nih.gov/emboss/html/ Counting words • • • • ATGGCTGACT A, T, G, G, C, T, G, A, C, T AT, TG, GG, GC, CT, TG, GA, AC, CT ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT www.genomatix.de/cgi-bin/tools/tools.pl EMBOSS servers • European Molecular Biology Open Software Suite • http://pro.genomics.purdue.edu/emboss/ ORF • EMBOSS • NCBI ncbi.nlm.nih.gov/gorf/gorf.html Internal repeats • A word repeated in the sequence, long enough to not occur by chance • Can be imperfect (regular expression) • Dot plot is the best way to spot it arbl.cvmbs.colostate.edu/molkit Predicting Genes • The most important analysis carried out on DNA sequences is gene prediction • Gene prediction requires different methods for eukaryotes and prokaryotes • Most gene-prediction methods use hidden Markov Models Predicting Genes in Prokaryotic Genome • In prokaryotes, protein-coding genes are uninterrupted • No introns • Predicting protein-coding genes in prokaryotes is considered a solved problem • You can expect 99% accuracy Finding Prokaryotic Genes with GeneMark • GeneMark is the state of the art for microbial genomes • GeneMark can • Find short proteins • Resolve overlapping genes • Identify the best start codon • Use exon.gatech.edu/GeneMark • Click the “heutistic models” Predicting Eukaryotic Genes • Eukaryotic genes (human, for example) are very hard to predict • Precise and accurate eukaryotic gene prediction is still an open problem • ENSEMBL contains 21,662 genes for the human genome • There may well be more genes than that in the genome, as yet unpredicted • You can expect 70% accuracy on the human genome with automatic methods Finding Eukaryotic Genes with GenomeScan • GenomeScan is the state of the art for eukaryotic genes • GenomeScan works best with • Long exons • Genes with a low GC content • It can incorporate experimental information • Use genes.mit.edu/genomescan Producing Genomic Data • Until recently, sequencing an entire genome was very expensive and difficult • Only major institutes could do it • Today, scientists estimate that in 10 years, it will cost about $1000 to sequence a human genome • With sequencing so cheap, assembling your own genomes is becoming an option • How could you do it? Sequencing and Assembling a Genome (I) • To sequence a genome, the first task is to cut it into many small, overlapping pieces • Then clone each piece Sequencing and Assembling a Genome (II) • Each piece must be sequenced • Sequencing machines cannot do an entire sequence at once • They can only produce short sequences smaller than 1 Kb • These pieces are called reads • It is necessary to assemble the reads into contigs Sequencing and Assembling a Genome (III) • The most popular program for assembling reads is PHRAP • Available at www.phrap.org • Other programs exist for joining smaller datasets • For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php
© Copyright 2026 Paperzz