Phred/Phrap/Consed Analysis A User’s View International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001 Arthur Gruber Faculty of Veterinary Medicine and Zootechny University of São Paulo BRAZIL What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing. Why to assemble? • Current DNA sequencing methods generate reads of 500-700 bp – resolution limit of electrophoresis • Whole genomes or large clones need to be fragmented - clone library • Short fragments are randomly sequenced (shotgun approach) assembled to form sequence – reads are final consensus Whole genome BAC/cosmid clone DNA fragmentation sonic disruption nebulization Small fragments 1.0 - 2.0 kb Clone library pUC18 DNA sequencing random clones Partial Assembly contigs Finishing quality both stands coverage gap filling Whole genome BAC/cosmid clone final consensus sequence How to deal with the enormous amount of reads generated by the high throughput DNA sequencers? Sanger Centre Phred Genome Research 8: 175-185, 1998 Phred Genome Research 8: 186-194, 1998 Phred Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files. Trace File High quality region – no ambiguities (Ns) Trace File Medium quality region – some ambiguities (Ns) Trace File Poor quality region – low confidence Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases) The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t a a a g c c t g g g g t g c c t a a t g 24 24 22 27 25 19 12 19 12 15 19 23 33 36 44 44 39 39 34 35 34 2221 2232 2245 2261 2272 2286 2302 2314 2324 2331 2346 2363 2378 2390 2404 2419 2433 2446 2460 2470 2482 t g t c g n c t t c t c c c t c g g a g g 16 8191 19 8200 13 8211 13 8229 4 8241 4 8253 4 8263 10 8276 9 8286 12 8301 16 8313 12 8329 12 8336 15 8343 19 8356 9 8371 13 8386 14 8397 7 8417 9 8427 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE Phrap Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus! Phrap Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data d. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual files e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs Phrap output files • *.contigs – fasta file containing the contigs - Contigs with more than one read - Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) • *.singlets – fasta file of the singlet reads - Reads with no match to other read • *.ace – allows for viewing the assembly using Consed • *.view – required for viewing the assembly using Phrapview Consed Genome Research 8: 195-202, 1998 Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, singlestrand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. Phred/Phrap/Consed Pipeline Input chromatogram files Quality (confidence) values assignment Phred phd files - *.phd Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seqs_fasta quality values - seqs_fasta.screen.qual Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seqs_fasta.screen Directories: Chromat_dir Assembly Phrap assembled contigs - seqs_fasta.screen.contigs assembly file - seqs_fasta.screen.ace# Assembly viewing/editing Consed Phd_dir Edit_dir Finishing Problems Finishing can be a boring and difficult task due: DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye). Finishing Problems Finishing can be a boring and difficult task due: DNA assembly problems a. High content of repeats – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data.
© Copyright 2026 Paperzz