NGS BioinformaGcs Lab 8-‐23

NGS Bioinforma-cs Lab 8-­‐23-­‐2013 • 
• 
• 
• 
File formats galore Sequencing data QC Data visualiza-on Opera-ng on genomic intervals SoTware, Sites, Materials Course Materials: hJp://dldcc-­‐web.brc.bcm.edu/lilab/benji/MBRB_2013/index.html Most up to date slides I will upload for all three of my lectures SoTware: (Install for 8-­‐23-­‐2013) hJp://www.bioinforma-cs.babraham.ac.uk/projects/fastqc/ hJp://www.java.com/en/ hJps://projects.gnome.org/gedit/ Decent text editor Browsers: hJp://genome.ucsc.edu/ hJp://epigenomegateway.wustl.edu/ Web-­‐based analysis: hJps://main.g2.bx.psu.edu hJp://david.abcc.ncifcrf.gov File Formats •  Fastq – raw data •  SAM – aligned reads, universal format •  BAM – binary sam file (inter-­‐conver-ble) hJp://genome.ucsc.edu/FAQ/FAQformat.html •  BED •  bigBed •  GTF •  WIG •  bigWig Fastq files contain machine and base call informa-on •  Raw data should be cleaned •  Low quality scores •  Illumina Pass Filter flag (signal over the first 25 cycles) •  Adapter contamina-on •  Big files, no need to decompress them •  Back them up and submit to NCBI Short Read Archive What are base quality scores? •  Defined on base calls •  Each call an es-mate of the true nucleo-de •  Random variable, it can be wrong! •  Phred is Sanger •  Phred quality (Q) = -­‐10 * log( P(Error) ) / log(10) •  At Q=30, 1 in 1000 bases is wrong on average •  Illumina quality = Q + 33, represented in ASCII characters •  ‘!’ = 33, Phred = Q0 L ‘@’ = 64, Phred = Q31 J •  You need to care about base quality •  FastQC analysis exercise The SAM file format! •  Standard format for short read alignment data •  Wealth of info for those skilled in the art GTF
format
GTF Format •  GTF"stands"for"Gene"Transfer"Format.""
•  The"tabRdelimited"file"includes"fields"below:""
• Tab delimited file includes fields below: –  <seqname>"<source>"<feature>"<start>"<end>"<score>"
seqname, source, feature, start, end, score, strand, frame, aJributes <strand>"<frame>"[a\ributes]"[comments]"
"
(h\p://mblab.wustl.edu/GTF22.html)"
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
"
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"Cufflinks
"transcript"84015
"exon
"84015
"transcript"6268
"exon
"6268
"exon
"16287
"exon
"18353
"exon
"19705
"exon
"20015
"exon
"20399
"exon
"20595
"84983
"84983
"21192
"6477
"16386
"18470
"19806
"20196
"20505
"21192
"1
"1
"1000
"1000
"1000
"1000
"1000
"1000
"1000
"1000
"R
"R
"+
"+
"+
"+
"+
"+
"+
"+
".
".
".
".
".
".
".
".
".
".
"gene_id""ENSGALG00000009775";"transcript_id""ENSGALT00000015896";""
"gene_id""ENSGALG00000009775";"transcript_id""ENSGALT00000015896";"exon_number""1";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"FPKM""26.6821513228";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""1";"FPKM""26.6821513228";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""2";"FPKM""26.6821513228";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""3";"FPKM""26.6821513228";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""4";"FPKM""26.6821513228";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""5";"FPKM""26.6821513228”;"
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""6";"FPKM""26.6821513228";""
"gene_id""CUFF.1";"transcript_id""ENSGALT00000015891";"exon_number""7";"FPKM""26.6821513228";""
"
"
Format oTen used in RNA-­‐seq analysis workflows 13"
BED Format • 
• 
BED format provides a flexible way to define the data lines that are displayed in an annota-on track BED lines have up to 12 tab-­‐delimited fields •  required fields: chrom, chromStart, chromEnd •  op-onal fields: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts. •  Important, lower-­‐numbered fields must always be populated if higher-­‐numbered fields are used. First ten lines of our mouse promoter file. The header line iden-fies the track name. Why am I using the first three op-onal fields? If my promoters are all the same size, what do you suppose is the score field? BedGraph Format •  Allows display of con-nuous-­‐valued data in track format •  Useful for probability scores and transcriptome data •  Similar to the wiggle (WIG) format, but differs in that data exported in bedGraph format are preserved in their original state BedGraph files are a preferred file format in the Epigenome Browser They are very easy to work with, in my opinion bigWig Format For display of dense, con-nuous data Elements must be equally sized bigWig files are in an indexed binary format Created ini-ally from wiggle (wig) type files Only the por-ons of the files needed to display a par-cular region are transferred to UCSC •  bigWig file remains on your web accessible server • 
• 
• 
• 
• 
The processed data we will work with today are in bigWig format Evaluate Raw Data Quality with FastQC •  In this exercise, we will examine QC metrics at three stages of a dataset’s life cycle •  First, fresh off the sequencer •  Second, aTer base quality / adapter trimming •  Third, aTer short read alignment •  We will use a java-­‐based applica-on called FastQC •  Installa-on op-ons •  Windows binary executable •  Mac OSX dmg •  Graphical User Interface J •  View immediately, create html reports Evaluate Raw Data Quality with FastQC •  Above shows Mac OSX version, Basic Sta-s-cs tab for the whole paired-­‐end file aligned to mouse genome •  11 metrics provided in total •  What changes between each stage? Epigene-c profiling of HSC and LSC: Data Visualiza-on, Opera-ng on Genomic Intervals hJp://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29130 GMP, granulocyte-­‐
macrophage progenitor, a myeloid precursor for monoblasts and myeloblasts Bernt KM, Zhu N, Sinha AU, Vempa- S, Faber J, Krivtsov AV, Feng Z, Punt N, Daigle A, Bullinger L, Pollock RM, Richon VM, Kung AL, Armstrong SA. MLL-­‐rearranged leukemia is dependent on aberrant H3K79 methyla-on by DOT1L. Cancer Cell. 2011 Jul 12;20(1):66-­‐78.PMID: 21741597 Epigene-c profiling of HSC and LSC: Data Visualiza-on, Opera-ng on Genomic Intervals hJp://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29130 GMP, granulocyte-­‐
macrophage progenitor, a myeloid precursor for monoblasts and myeloblasts In this exercise, our focus is MLL-­‐AF9 fusion methyltransferase and HEK79me2 experiments. We will examine the chroma-n signals and interrogate enrichment at gene promoter regions. Epigene-c profiling of HSC and LSC: Data Visualiza-on, Opera-ng on Genomic Intervals Where did our bigWig files come from? SRA -­‐> fastq -­‐> sam -­‐> bam -­‐> bed -­‐> bedgraph -­‐> bigWig get.GSE29130.Chip-­‐seq.job fastq-­‐dump.job extraczastq.job alignbow-e2.job samtools.sirdu.job btools.bamToBed.job btools.extendBed.job btools.sortBed.job btools.bambgbw.job I created them from scratch, so to speak The nine job files, from top to boJom, represent the different steps 9 Jobs x 4 Experiments: MLL-­‐AF9 H3K79me2_mLSC H3K79me2_mGMP H3K79me2_mHSC Adding Custom Tracks From mm9 genome browser, choose Tools -­‐> Table Browser Click on “add custom tracks” From a separate browser window, copy the bigWig and bed file “UCSC Genome Browser Tracks” lines from hJp://dldcc-­‐web.brc.bcm.edu/lilab/benji/MBRB_2013/
GSE29130.track.list.txt and paste them into “Paste URLs” box Click submit to load the tracks The mouse gene promoter bed and ChIP-­‐seq bigWig tracks should now appear on Manage Custom Tracks Custom track files can also be uploaded via the “Choose File” op-on To upload many large files, you want to use a web server as we did above Data Visualiza-on: Changing Track Display Se}ngs The snapshot depicts the Tg~1 promoter region Signal intensi-es appear comparable, but the axes have different display scales by default We need a common scale for the H3K79me2 samples There are several ways to access individual track se}ngs Data Visualiza-on: Changing Track Display Se}ngs Data Visualiza-on: Changing Track Display Se}ngs Hint: use the tab key to cycle through boxes quickly Before ATer On the right, the three H3K79me2 experiments have V-­‐max = 5 Now you can see H3K79me2 enrichment greater in GMP than LSC In contrast, MLL-­‐AF9 binding appears minimal (V-­‐max = 1.5) MLL-­‐AF9 to Meis1 Promoter Region Meis1 is a MLL-­‐fusion target iden-fied by Bernt et al as well as a previous study in Genes and Development Let’s use the Meis1 promoter to make a quick and dirty cut-­‐off to separate MLL-­‐AF9 signal from noise. Quick and Dirty MLL-­‐AF9 Signal Filter 2 1 3 From Table Browser, select “create filter” [ 1 ] to bring up “Filter on Fields” [ 2 ] Set dataValue > 0.318759 (mean of Meis1 promoter) Set data output 10E7 lines Press submit [ 3 ] Returning to Table Browser, set output format to “custom track” Quick and Dirty MLL-­‐AF9 Signal Filter 3 Output custom track with with a name (no spaces), descrip-on Select “BED format” for output Get custom track in browser (table or genome) Quick and Dirty MLL-­‐AF9 Signal Filter 2 1 3 Genome-­‐wide summary sta-s-cs show 1.16 million of 2.62 billion bases have signal > mean of the Meis1 promoter Custom MLL-­‐AF9 Signal Track chr6:52,155,000-­‐52,187,500 3 HOXA10
HOXA3
HOXA7
HOXA9
The four Hoxa genes and Mir196b were predicted by the Bernt et al paper’s empirical null distribu-on model PreJy cool for arbitrary, eh? Opera-ng on genomic intervals From Table Browser, choose mm9_promoter track Click on “create intersec-on” to bring up the Intersect window Select your MLL-­‐AF9 custom track Select “all records” overlap op-on Click submit Screen returns to Table Browser Click on summary sta-s-cs to see the number of intersec-ng promoters. Press back to return to Table Browser On Table Browser, select output format “BED – browser extensible data” Click on “get output” On the next screen, choose “get BED” Opera-ng on genomic intervals •  Our BED file of MLL-­‐AF9 bound promoters contains the informa-on necessary for func-onal enrichment analyses as well as addi-onal intersec-ons with other data, such as H3K79me2 levels •  BED files can be created from almost any annota-on track in the UCSC browser •  Propose a query you would like to make on the MLL-­‐AF9 promoters and plan your aJack •  Alterna-vely, take the fiTh column of the promoter file (Entrez gene iden-fiers) and run an enrichment analysis at hJp://david.abcc.ncifcrf.gov