Genomic Data Manipulation

Genomic Data Manipulation BIO508 Spring 2012 Problems 06 Genomes and Sequencing 1.
Congratulations! After shelling out about $300 to the Broad, which based on the back of my envelope is about what it would cost these days, you'ʹve just acquired your first sequenced bacterial genome. They'ʹve provided some raw 100bp paired end Illumina reads for you to download here, along with a few tools you'ʹll use later during the assignment: http://huttenhower.sph.harvard.edu/moodle/mod/resource/view.php?id=248 Recall that Illumina reads are (currently) about 100bp in length, and they can be sequenced in pairs from both ends of a longer DNA fragment. Libraries of fragments on the order of 160-­‐‑180bp in length are typically generated so that paired end reads from either end overlap a bit to doubly sequence the middle (since the sequencing error rate increases toward the end of each read). The PE reads are thus provided in two FASTQ files, each containing one end of each pair. The first entry in both files is the first read, ends 1 and 2 in the two different files; the second entry is the second read, ends 1 and 2; and so forth. NB: For this problem set, you'ʹll be submitting a .zip or .tar.gz file containing a smattering of files you create throughout the assignment. It should be named either problems06.zip or problems06.tar.gz, and it should contain each of the *files_starred_like_this.txt* in the assignment, plus one additional problems06.doc, problems06.docx, problems06.txt, or similar file containing the answers to the written questions (also *starred like this*). a. (0) Look at your FASTQ files using a text editor or less, and read about the FASTQ format so you'ʹre familiar with the data: http://en.wikipedia.org/wiki/FASTQ_format b. (2) There are a dozen ways to answer this easily, so if you find yourself doing something complicated, stop! *What genus is your bacterium?* c.
2.
(6 ) Optional but useful: after looking at the FASTQ files, write a Python script *fastq2fasta.py* that will convert a FASTQ file to a standard FASTA. This is easy using re and can be done in order-­‐‑of-­‐‑
magnitude 10 lines of code. Depending on how much you pay the Broad (and most sequencing centers), they can perform more or less than the most basic processing for sequencing data, specifically the conversion from image files (the gigantic raw files that come straight from the sequencer) to reads with quality scores and some other metadata. We'ʹre hardcore and will do the rest ourselves, so our first steps will thus be to clean up and filter our genome. a. (0) Illumina assigns a quality score to each base in the same spirit as earlier Phred quality scores for Sanger sequencing data. The exact numerical scoring system can vary, but the general idea is that low means bad and high means good. Typical values range from -­‐‑5 to 40, and indicate the probability p of a base being miscalled, p = 10quality/-­‐‑10. b. (0) A common method for quality trimming was popularized by the BWA aligner (http://bio-­‐‑
bwa.sourceforge.net/) and shortens any sequence of original length L with quality scores qi at each position i to a new length l to accommodate a minimum quality value Q: arg max l
L
∑Q − q
i
i = l +1
P04-­‐‑1 All that gibberish aside, Joseph Fass at UC Davis has been kind enough to implement this algorithm for us already, and I'ʹve converted it into the TrimBWAstyle.py script in the packet you downloaded above. Run this script twice on your two paired end files to trim their reads; the syntax is: python TrimBWAstyle.py < bug_1.fastq > bug_trimmed1.fastq
3.
a.
(2) Run two head commands (or do the equivalent with a text editor if you don'ʹt want to install Cygwin on Windows; this goes for all the subsequent commands) to save the first 100 lines of bug_trimmed1.fastq and bug_trimmed2.fastq as *head_trimmed1.fastq* and *head_trimmed2.fastq*, respectively. Submit these two files so we can see your progress so far. Ok, back to Python. The trimming script above will mark low-­‐‑quality sequences, but it won'ʹt remove their pairs from the two PE FASTQ files. We'ʹll just have to do that ourselves. a. (0) Use paste to turn the two FASTQ files, bug_trimmed1.fastq and bug_trimmed2.fastq, into a single quasi-­‐‑FASTQ file, bug.fasterq, in which each tab-­‐‑delimited line contains the two corresponding lines from the individual inputs. This should be one quick-­‐‑and-­‐‑easy command; again, if you don'ʹt want to install Cygwin, you can do this reasonably easily in Excel or OpenOffice. b. (5) Create a script named *remove_bad_seqs.py* that you'ʹll use to filter this quasi-­‐‑FASTA file; you want to retain only sequences of at least a given length, so we'ʹll A) take a command line argument to set that length and B) use 75 when we actually run it (a good value for mapping to bacterial genomes using 100bp reads). Note that this should not have a __name__ == "main" block like your previous exercises, but should be a single-­‐‑purpose file that you create entirely by filling in the following blanks: #!/usr/bin/env python
import ___
import ___
if len( sys.argv ) != 2:
raise _________( "Usage: remove_bad_seqs.py <n> < <data.fasterq>" )
iN = int(sys.argv[1])
aastrLines = []
for astrLine in csv.reader( sys.stdin, csv.excel_tab ):
aastrLines.append( astrLine )
if len( aastrLines ) == 4:
fBad = False
for strLine in aastrLines[1]:
if len( strLine ) < __:
fBad = ____
break
if not ____:
print( "\n".join( "\t".join( a ) for a in aastrLines ) )
aastrLines = __
c.
(0) Run this script with the command line argument 75 for your length and bug.fasterq as input to create bug_filtered.fasterq as output. d. (0) Run two cut commands to re-­‐‑split out the first and second columns of this quasi-­‐‑FASTQ file as bug_filtered1.fastq and bug_filtered2.fastq. This should be two quick-­‐‑and-­‐‑easy commands. P04-­‐‑2 e.
4.
(3) Run two head commands to save the first 100 lines of bug_filtered1.fastq and bug_filtered2.fastq as *head_filtered1.fastq* and *head_filtered2.fastq*, respectively. Submit these two files. You should have a pretty good idea what sort of bug you'ʹre looking at after answering Problem #1, so let'ʹs download a related reference genome to compare it with. Go to the GenBank FTP site for Bifidobacterium longum infantis ATCC 15697 at: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bifidobacterium_longum_infantis_ATCC_15697_uid17189/ Download the raw finished genome sequence file (CP001095.fna) and its annotated features in GFF format (CP001095.gff). a. (0) It took me longer to figure this step out than the entire rest of the assignment put together, so pay attention! Open up the .fna file (which is just a FASTA file -­‐‑ stands for "ʺFASTA Nucleic Acid"ʺ) in a text editor. See the first header line where is says: >gi|213522389|gb|CP001095.1| Bifidobacterium longum subsp. infantis ATCC 15697, complete genome
Delete everything except the CP001095.1 part, so it reads: >CP001095.1 For whatever reason, the whitespace in this FASTA header inscrutably confuses some of the software we'ʹll use later on. Trust me, I just saved you hours of heartbreak. (2) Run head again to save 100 lines in *head_cp001095.fasta*. 5.
b.
Remember that an aligner is a program that maps short reads to the location(s) in a longer sequence identical or near-­‐‑identical to them. This is very much like a BLAST local alignment search, but specifically for the case when the query sequence is very short, the target is long, and the match is expected to be near-­‐‑identical; this allows the process to run much faster than BLAST. a. (0) Notice above where I mentioned the BWA aligner? That'ʹs not the one we'ʹre going to use. Instead, download the Bowtie aligner from: http://sourceforge.net/projects/bowtie-­‐‑bio/files/bowtie/0.12.7/ If at any point you'ʹre interested in why we do what we do with Bowtie, its tutorial is excellent and includes all of the commands we'ʹll be using here: http://bowtie-­‐‑bio.sourceforge.net/tutorial.shtml b. (0) Bowtie (and BWA, and most other short read manipulation tools) save read alignment positions in a format called SAM (Sequence Alignment/Map) or BAM (the binary, computer-­‐‑readable version; SAM is a plain text format like FASTA or GFF). In addition to Bowtie for alignment, we'ʹll be using a standard set of programs for manipulating its .sam and .bam output files, Samtools; download them here: http://sourceforge.net/projects/samtools/files/samtools/0.1.12/ Note that if you'ʹre on a Mac, you'ʹll either have to build these yourself (easy if you have the standard developer tools installed; just run make) or switch to Windows for a bit. c. (0) Downloading these two programs will give you a .zip or .tar.gz or .tar.bz2 file, which you can expand using unzip or tar -xzf <filename.tar.gz> or tar -xjf <filename.tar.bz2>, respectively. Note that you should thus end up with a directory named whatever you want (probably something like problems06) containing your data files (problems06.docx, CP001095.fna, and so forth) and two subdirectories, bowtie-0.12.7 and samtools-0.1.12a_i386-win32 (or something P04-­‐‑3 similar). I'ʹll assume that you have your files laid out like this when I provide commands to run, but pay attention and make changes accordingly if you choose otherwise. Also notice that I use MacOS/Linux path separators /, and you should use \ on Windows. d. (0) Bowtie and Samtools both require an index to run, which is a way of quickly looking up specific sequences inside a FASTA text file. This operates very much like a Python dictionary, which lets us look up values very quickly given their keys -­‐‑ an index lets us look up locations very quickly given the sequence we'ʹre looking for. The format for the two programs'ʹ indices are different though, so generate them both separately by running the following commands: ./bowtie-0.12.7/bowtie-build CP001095.fna CP001095
./samtools-0.1.12a_i386-win32/samtools faidx CP001095.fna
The first will generate a bunch of .ebwt files; the second a .fai file. Documentation on these two steps can be found on the Bowtie and Samtools web pages, respectively. e. (3 ) *Why do the Bowtie index files end in .ebwt?* f. (4) Let'ʹs first generate a text .sam file aligning our bacterium'ʹs reads to the B. longum infantis 15697 genome so you can take a look with less and see what it'ʹs like. Run the following command: ./bowtie-0.12.7/bowtie -S CP001095 -1 bug_filtered1.fastq -2 bug_filtered2.fastq > bug_CP001095.sam
g.
Note that Bowtie is automatically aware that the end pairs have to align together. Take a look at your output bug_CP001095.sam, and run head to get us another 100 line file *head_bug_CP001095.sam*. (0) Now let'ʹs turn this into a binary BAM file that the computer can use. Run: ./samtools-0.1.12a_i386-win32/samtools view -bS -o bug_CP001095.bam bug_CP001095.sam
h. (0) By default, the alignments in SAM/BAM files are listed in the order they were found; most visualization programs, however, require them to be sorted in the genomic order of the target sequence. Fortunately, Samtools can do that for us; run: ./samtools-0.1.12a_i386-win32/samtools sort bug_CP001095.bam bug_CP001095_sorted
Finally, we need index the alignments for rapid access just like we indexed the original sequence: i.
./samtools-0.1.12a_i386-win32/samtools view -H bug_CP001095_sorted.bam > header_bug_CP001095_sorted.txt 6.
./samtools-0.1.12a_i386-win32/samtools index bug_CP001095_sorted.bam
This generates the end goal of this entire shenanigan, bug_CP001095_sorted.bam.bai, an index file for the binary sorted alignments of your brand new genome against the Bli15697 reference. (4) Samtools not only converts .sam files to .bam, but vice versa, so you can see them again by eye. So we can give you some credit for getting this far, run the following command to submit the *header_bug_CP001095_sorted.txt* file: Let'ʹs take a look at what we'ʹve got here! Download the nice Broad GenomeView tool from: http://sourceforge.net/projects/genomeview/files/GenomeView/1.9991/ P04-­‐‑4 Note that this is Java and will run anywhere; feel free to use the Java Web Start version if you'ʹd prefer, and keep an eye on the memory usage (you may need -Xmx1024m or similar if you'ʹre running from the command line). There'ʹs extensive documentation available at: http://genomeview.org/content/quick-­‐‑start-­‐‑guide a.
b.
(0) Start up GenomeView. It should look like this: (0) Go to File/Load features... Click OK for "ʺLocal file"ʺ, then browse to your Bli15697 reference genome, CP001095.fna. Open it up, and GenomeView should look like this: For the love of anything, pay attention to that "ʺEntry"ʺ field at the top! If it doesn'ʹt look right, go back and read my instructions above about editing the .fna file header. The next few steps will not work if the various FASTA and BAM file headers don'ʹt match. P04-­‐‑5 c.
(0) Go to File/Load features... Click OK for "ʺLocal file"ʺ, then browse to the Bli15697 genome annotations in the CP001095.gff file. Open that up, and GenomeView should add in those tracks to look like this: d. (0) Notice that for these particular annotations, the "ʺCDS"ʺ and "ʺgene"ʺ tracks are mostly redundant, but the CDS IDs are a bit more readable. Likewise, the "ʺsource"ʺ, "ʺstart_codon"ʺ, and "ʺstop_codon"ʺ tracks are basically useless. Turn off "ʺgene"ʺ, "ʺsource"ʺ, "ʺstart_codon"ʺ, and "ʺstop_codon"ʺ by clicking the green √ beside them in the "ʺTrack list"ʺ (mid-­‐‑upper right) so it turns into a red X. Finally, "ʺmisc_feature"ʺ, "ʺexon"ʺ, and "ʺsig_peptide"ʺ are too chatty as full tracks, so click on the rightmost red X beside their names in the "ʺTrack list"ʺ (the "ʺCollapse"ʺ column) so they turns into green √s. After all that, if you zoom in a bit, GenomeView should look something like this: P04-­‐‑6 e.
(0) Get your genome reads in here! One more time: File/Load features..., OK for "ʺLocal file"ʺ, browse to your sorted BAM index bug_CP001095_sorted.bam.bai. Congratulations, you can see each and every one of your genome reads that passed QC aligned against the Bli15697 reference! f. (2) *What do the blue, green, and orange colors in the GenomeView read track mean?* g. (2) *What do the red, green, yellow, and blue markers in the GenomeView reference genome track mean? Why is this useful?* h. (3) *What does it mean when many reads pile up in a particular region? When there are no reads in a particular region? What genomic features tend to be enriched in regions with lots of reads? Are there any features enriched in regions with few or no reads?* i. (3) Some polymorphism arise from sequencing errors and others from honest-­‐‑to-­‐‑goodness genetic variation. Find an interesting legitimate-­‐‑looking SNP and provide a screenshot as *screenshot_snp.png* or similar (.jpg, .gif, whatever). I'ʹve provided an example below of an apparent SNP in the active site of inosine 5-­‐‑monophosphate dehydrogenase (note that this obviously means you can'ʹt use IMPDH for your SNP!) P04-­‐‑7 (4 ) *Find something cool in the genome, and convince me it'ʹs cool!* You can answer this question as many times as you'ʹd like, as long as it'ʹs convincing -­‐‑ provide details in your doc file, and you'ʹll get four extra points a pop. (2 ) *Based on the evidence you'ʹve seen so far, where do you think your bug was isolated from?* P04-­‐‑8 j.
k.
7.
(
) It would be great if we could run a whole genome assembly from these reads, but A) I haven'ʹt provided you with enough coverage to get a half-­‐‑decent assembly because B) there'ʹs no good assembly software that runs on Windows. But fret not! I (or really, the Human Microbiome Project) have conveniently provided a draft assembly of your bug (probably run using SOAPdenovo, for those who are curious). Rather than mapping individual reads, let'ʹs align the entire set of resulting contigs against the Bli15697 reference to see what synteny and conservation looks like (we'ʹll discuss comparative genomics in more detail later on in the class). a.
(0) It turns out that your bug is Bifidobacterium longum infantis ATCC 55813; from now on, for clarity'ʹs sake, I'ʹll refer to it as Bli55813 (not to be confused with the reference strain, Bli15697). You can download its assembled contigs from: ftp://ftp.ncbi.nih.gov/genbank/genomes/HUMAN_MICROBIOM/Bacteria/Bifidobacterium_longum_infantis_ATCC_55813_uid31437/ b.
Note that you need only the assembled contigs as plain FASTA files, ACHI00000000.contig.fna.tgz, and you can ignore all the other stuff in the directory (although it is interesting, if you want to poke around!) Also note that the GenBank FTP site is occasionally mysteriously unavailable -­‐‑ your tax dollars at work. In such a case, you can alternatively retrieve a near-­‐‑
identical genomic FASTA file from: http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ACHI01 by scrolling down and clicking on ACHI01.fasta.gz.. This lets you skip the next step, too, by the way. (0) Untar this package of 140 contigs into files ACHI01000001.fna through ACHI01000140.fna. Using cat or a text editor, combine them into a single FASTA file: cat ACHI01000*.fna > ACHI01000000.fna
c.
(0) Let'ʹs get a fully-­‐‑annotated version of the reference strain Bli15697 from Ensembl, which has its own GenBank-­‐‑like format. Check out the list of Ensembl bacteria at: http://www.ebi.ac.uk/genomes/bacteria.html Find "ʺBifidobacterium longum subsp. infantis ATCC 15697"ʺ and click on it, taking you to: http://www.ebi.ac.uk/ena/data/view/Taxon:391904 Click the + to expand "ʺAssembled & Annotated Sequences (EMBL-­‐‑Bank)"ʺ, then click the 1 under "ʺTaxon & its descendants"ʺ to go to the genome page. Finally, right click on the "ʺTEXT"ʺ link by "ʺDownload:"ʺ in the upper right, choose "ʺSave Link As"ʺ, and save it as the file CP001095.embl (instead of just .txt, so as not to confuse some programs later). d. (0) Grab and install UGENE from: http://ugene.unipro.ru/download.html Fire it up, create a new project wherever you'ʹd like when prompted, and then go to Tools/Build dotplot. Click on the "ʺ..."ʺ button for the "ʺFile with first sequence"ʺ, browse to your reference genome CP001095.emble, and select it. Then click on "ʺ..."ʺ for the "ʺFile with second sequence"ʺ, browse to your concatenated ACHI01000000.fna file, and select it. Check "ʺJoin all sequences found in file"ʺ to end up with a dialog something like this (then hit Next): P04-­‐‑9 e.
(0) When UGENE asks you to configure the dotplot, let'ʹs enable inverted as well as direct repeats (check "ʺSearch for inverted repeats"ʺ) and make them red. Click on the black box beside inverted repeats, select a nice red, and click OK. Also crank down the stringency to only 95% identity. You should end up with something like this (then click OK): f.
g.
(4) Whoah, colors! UGENE is showing you a variety of important things: the reference vs. query dotplot on top, the reference strain Bli15697 with annotated features on top, and the query strain Bli55813 contigs below. It would take a lot of typing to explain what all of these views are, but feel free to poke around, zoon in and out, and most importantly read the UGENE documentation liberally as needed: http://ugene.unipro.ru/documentation.html If you'ʹd like us to check this, provide a screenshot file *screenshot_dotplot.png* when you get this far. (0) Note that there are a zillion features annotated by Ensembl in the reference genome, but nothing but the contig boundaries in your query. There are sophisticated algorithms for calling Open Reading Frames (ORFs) based on machine learning, but it'ʹs easy to call them just by looking for appropriately separated P04-­‐‑10 start and stop codons on the same strand. Let'ʹs do that: click on the query sequence panel (on the bottom, where it says "ʺSequence [dna]"ʺ), then click on the "ʺFind ORFs"ʺ button in the toolbar (a magnifying glass with a little horizontal blue bar). Make sure to search both strands using the standard genetic code, with a minimum length of 300bp and using the whole "ʺSequence range"ʺ. When you'ʹre done, it should look like this (and hit Search): h. (3) Click "ʺSave as annotations"ʺ, create a new table file wherever you'ʹd like, and click Create to get these new annotations to show up in UGENE. You should now have a bright yellow ORF track on your query genome listing every possible putative ORF (including a bunch of false positives): Again, if you'ʹd like us to check it, submit a screenshot called *screenshot_orfs.png*. P04-­‐‑11 i.
(0) If you read the Bli15697 genome paper, you might have noticed they were particularly excited about the β-­‐‑galactosidase operon anchored around reference ORF Blon_2334. What does that operon look like in our Bli55813 genome? Hint: if you search around, you should end up with a view something like this: j.
(2) *Is Blon_2334 conserved?* k. (3) *Is the β-­‐‑gal operon conserved? Syntenically?* l. (3) *What'ʹs unfortunate about our contigs? What might be some solutions to remedy the situation?* m. (4) *What does this potentially imply about the putative IS3 and SBP family cluster in Bli15697?* n. (5) *What are some more ways in which you could investigate this more directly than by using a semi-­‐‑
qualitative dotplot?* P04-­‐‑12