Transcriptional Regulation File

Genomic Data Manipulation
BIO508 Spring 2014
Problems 08
Transcriptional Regulation
1.
(0) The central dogma's been around for 50 years: transcription factors activate transcription, transcripts are
translated, and the proteins go out to do the work of the cell. At least in unicellular microbes like
Saccharomyces cerevisiae - the genetics of which have also been studied for nearly as long - enough of the cell's
regulation is transcriptional that unraveling promoter binding sites and interactions should be no problem.
So why haven't we solved yeast regulatory biology yet?
The goal of this assignment will be to look at regulatory interactions under a very well-defined set of yeast
growth conditions, attempt to delineate regulatory modules, look for transcription factor interactions, and
reconstruct these data into a regulatory network. Over the course of these questions, you should be able to see
A) several of the steps that are still used in regulatory network reconstruction (often in conjunction with
additional data describing epigenetics or TF binding) and B) why this is a hard problem, even in yeast!
You know the drill: upload a .zip or .tar.gz file named problems08.zip or.tar.gz containing each of
the *files_starred_like_this.txt* and one problems08.doc/.docx/.txt file answering any
written questions *starred like this*.
2.
3.
First, let's build some groups of coexpressed genes. We'll go back to our old familiar
dilution_rate_02_knn.txt file:
http://growthrate.princeton.edu/data/dilution_rate_02_knn.txt
Recall that this contains 36 S. cerevisiae expression conditions, 6 different nutrient limitations (carbon, nitrate,
phosphate, sulfate, leucine, and uracil), each at 6 different growth rates (controlled by continuous chemostat
culture).
a.
(0) Open this file up in MeV as a two-channel array. Adjust the data at a variance filter with a standard
deviation cutoff of 0.5, and make sure to open up the new data filter on the left, right click, and "Set as
Data Source" the resulting filtered expression matrix. Also right click on the heatmap on the right side
and save this filtered expression matrix as dilution_rate_02_knn_filtered.txt, which should
contain a bit over 1,500 genes.
b.
(4) Run a FOM from Clustering, calculating medians up to a maximum cluster count of 25. Take a look at
the plot - there should be two obvious candidates for "good" cluster numbers, one just below 15 and one
just above 20. Using the latter (larger) cluster count, run k-means from Clustering using median
calculations (and default settings otherwise, except for the number of clusters). Expand the resulting table
view on the left, select any cluster, right click on the table to the right, and "Save All Clusters". For the
cluster count n you chose, this should result in n output text files that you should submit as *k-1.txt*
through *k-n.txt*.
Is there any reason to believe that these coregulated modules are telling us about real biology? Let's find out!
As we discussed last week, one of the oldest and easiest ways to do this is to test whether any previously
characterized pathways or functional modules are enriched in your gene set(s) of interest. We'll use the
GOrilla online GO TermFinder implementation at:
http://cbl-gorilla.cs.technion.ac.il
This uses the hypergeometric distribution as discussed briefly in class.
P08-1
a.
(0) Set your organism to Saccharomyces cerevisiae and make sure you're running in "Two unranked lists of
genes" mode so that you can provide a background gene list below.
b.
(0) Copy/paste the gene list from the largest coexpressed module (should be order-of-magnitude 150
genes) into the GOrilla "Target set" box.
c.
(2) Do the same thing for the gene list from your filtered input PCL file (should be ~1,500 genes) into the
"Background set" box. Click on "Paste a ranked..." above this box to read its help text. *Why is providing
the proper background gene set vital when performing any TermFinder or enrichment analysis?*
d. (0) Click "Search Enriched GO terms" to get the ball rolling, and after a second or two you should have
an enrichment figure and a table of enriched GO terms.
e.
(2) *How many GO Terms were significantly enriched in your gene set using these (default)
parameters, and what were they? List at most the names of the first 5 Biological Process terms.*
f.
(1) Right click on the graph of GO terms, select "Save image as..." (or equivalent), and save that PNG file
as *scerevisiae_go_large.png*.
g.
(2) *Does this look like a biologically plausible gene cluster? Why or why not?*
h. (2x5) Repeat steps a-f for a cluster of ~100 genes and one of ~50 genes, making sure to answer questions
d
and
f
and
saving
these
views
as
*scerevisiae_go_medium.png*
and
*scerevisiae_go_small.png*. If these three gene sets look too similar to each other (which they
might - keep an eye out for very repetitive GO terms), poke around a bit using larger or smaller sets to
find three distinct and interesting gene lists. Keep track of these cluster IDs!
i.
4.
(0) Hint: don't worry if the answer to d is zero at most once (do worry if it's zero two or three times).
Now let's go fish out some upstream regulatory sequences for these putatively coregulated modules. Head to:
http://www.biomart.org
which is my default for obtaining conveniently formatted DNA sequences (and other gene ID information).
a.
(0) Go to Bio Portal/Sequence retrieval/Ensembl/Saccharomyces cerevisiae genes.
b.
(0) Request the transcript flank and enter 500 bases for Upstream Flank. Expand "Header Information",
and uncheck "Ensembl Transcript ID".
c.
(1) Run cut -f2 dilution_rate_02_knn_filtered.txt | grep '^Y' | grep -v YORF >
dilution_rate_02_knn_filtered_genes.txt, or perform the equivalent cut and filter using Excel.
Save and submit the resulting file *dilution_rate_02_knn_filtered_genes.txt*.
d. (0) Expand Filters, limit to genes with EntrezGene IDs, leave the type set to Ensemble genes, and "upload
file" your dilution_rate_02_knn_filtered_genes.txt ID list file.
e.
(3) Click Go down at the bottom. Congratulations, you should be previewing a bunch of sequence in
FASTA format! Of course, we don't want to view this in your web browser, so click the Download buttom
to download a FASTA file containing the 500b upstream promoter sequences of every S. cerevisiae
transcript in your array data. Save and rename this file to *scerevisiae_filtered_up500b.fasta*
for submission and further investigation.
P08-2
5.
(2) The first further investigation we're going to do is to throw out repetitive and low-complexity DNA
sequences. These are things like long stretches of AAAA or TTTT or CGCGC and junk like that. They have a
tendency to "look" unusual to regulatory motif finders just because they're unusual, but not because they're
actually interesting for transcriptional regulation. Let's run your scerevisiae_filtered_up500b.fasta
file through the Tandem Repeats Finder program at:
http://tandem.bu.edu/trf/trf.download.html
Download the appropriate program for your platform (the online web service is too slow), open your FASTA
file, check Options/"Masked Sequence File", and Run/Start Search using the default parameters.
Mac users, you should save the command line version of the executable, run "chmod 755 trf407b.macleopard.exe" if necessary, and use the following default values:
File
= path to your sequence file
Match
=2
Mismatch = 7
Delta
=7
PM
= 80
PI
= 10
Minscore = 50
MaxPeriod = 500
And also append -m to generate your masked output file.
Wait a while, exit the program, delete all those unnecessary HTML files, and rename the resulting masked
sequence
file
scerevisiae_filtered_up500b.fasta.2.7.7.80.10.50.500.mask
to
*scerevisiae_filtered_up500b_masked.fasta* for submission.
6.
(5) Now it's time for some Python so that we can extract just sequences of interest from this file, e.g. the
upstream regions of all genes in a particular cluster. We'll write a script that takes two command line
arguments, a text file containing the gene IDs whose sequences should be extracted, and an optional count
representing the maximum number of sequences to extract (for reasons that will become clear below). Fill in
the blanks in the following Python, and submit it as *grep_fasta.py*.
#!/usr/bin/env python
______ ___
if ___( ___.____ ) < 2:
_____ _________( "Usage: grep_fasta.py <genes.txt> [n] < <seqs.fasta>" )
strGenes = ___.____[_]
iN = ___(___.____[_]) if ( ___( ___.____ ) > 2 ) else 0
setstrGenes = set()
for _______ in ____( strGenes ):
___________.add( _______.strip( ) )
strID = strSeq = iHits = 0
for _______ in ___._____:
if _______[0] == ">":
if strID in ___________:
print( ">" + strID )
print( ______ )
_____ += 1
if iN and ( _____ >= __ ):
break
P08-3
strID = _______[1:].strip( )
strSeq = __
else:
strSeq += strLine
7.
Remember those three clusters you set aside earlier? Perform the following steps with all three of them; each
point score is per cluster i.
a.
(1x3) Run cut -f2 k-i.txt | grep '^Y' | grep -v YORF > k-i_genes.txt for your
cluster number i (or create the equivalent output using Excel/jEdit). Save and submit *ki_genes.txt*.
b.
(1x3) Run python grep_fasta.py k-i_genes.txt < scerevisiae_filtered_up500b_
masked.fasta > k-i.fasta. Save and submit *k-i.fasta*.
c.
(0) Visit MEME at:
http://meme.nbcr.net/meme/cgi-bin/meme.cgi
Enter your email and Browse to your k-i.fasta file. Report a maximum of 10 motifs, leave the other
parameters on their defaults, and Start Search.
NB: If MEME complains about the length of your input sequence file, just delete genes from the bottom
until it's short enough.
d. (2x3)
8.
e.
(3x3) *How many motifs were significant (max. 10)? How many of the genes in the ith group include
these motifs in their upstream sequences (and what percentage of the group is that)?* Note that this
should be easy to answer using grep '^>' k-i_meme.txt | sed 's/\t.*//' | sort | uniq
| wc -l.
f.
(1x3) *After a quick eyeball, how many of these motifs actually look unique and not just copies of
each other?*
Wouldn't it be nice to identify whether any of these putative TFBS motifs correspond to known consensus
sequences? Let's take a quick look using the TOMTOM motif search and comparison tool at:
http://meme.sdsc.edu/meme/cgi-bin/tomtom.cgi
a.
9.
When the MEME email arrives, save and submit the file as *k-i_meme.txt*.
(2x2x3) Select two reasonably different motifs from each of these three files, upload them to the
TOMTOM MEME search box, and Start the search. For each matrix, *what was its original motif
sequence from MEME, what known motif(s) does TOMTOM think it matches (if any), and do you
believe it?*
(5) Summarize your findings from this study: *What have you discovered about transcriptional regulators
of growth rate and/or nutrient utilization in S. cerevisiae? What have you discovered about transcription
factor binding site discovery and regulatory network inference?*
P08-4
10. ( ) In order to build these coregulated modules into a network, let's get a list of putative transcription factors
in S. cerevisiae that could be doing the regulating. Visit my favorite list of yeast TFs and TFBSs at:
http://fraenkel.mit.edu/improved_map/
Download the v1.tamo file by clicking on "Motif data in TAMO format".
a.
(0) Run grep Source v1.tamo | sed 's/Source: *//' | sort | uniq >
v1_symbols.txt (or an equivalent jEdit command) to generate a list of gene symbols, not IDs, that are
included as TFs in this dataset.
b.
(2) Visit YEASTRACT (my second favorite list of yeast TFs) at:
http://www.yeastract.com/formorftogene.php
Paste your list of TF symbols into the box (should be exactly 124), click Transform, copy/paste the
resulting ORF ID list (the things starting with Y and some letters and numbers) into a new text file, and
search-and-replace the lower case "c" and "w" into upper case. Save the resulting file as
*v1_genes.txt*.
c.
(8) Write and submit a Python script *clusters2cytoscape.py* that will convert your list of
potential TFs and your k-i.txt cluster files into a pair of edge and node descriptor files appropriate for
import into Cytoscape.
i. Takes at least two command line arguments: an output file of the form nodes.txt and one or
more input cluster files of the form k-i.txt. The script should take your v1_genes.txt file as
input on standard in, making the complete execution call clusters2cytoscape.py nodes.txt
k-*[0-9].txt > edges.txt.
ii. nodes.txt should contain as output two tab-delimited columns, with one gene identifier per line
in the first column. The second column should contain a color string, either #FF0000 for
transcription factors (from your v1_genes.txt list) or #00FFFF for all other genes. An example
from mine is:
YLR228C #FF0000
YOR372C #FF0000
YAL012W #00FFFF
YDR384C #00FFFF
...
iii. edges.txt should contain as output two tab-delimited columns, the left being a TF and the right a
target gene. You should output an edge for each TF/gene pair that co-occur together in at least one
cluster. For example, in my MeV output, the TF YNL103W and the genes YOL032W, YER158C, and
YDL147W all appear together in cluster #11. My edges.txt file thus starts with:
YNL103W YOL032W
YNL103W YER158C
YNL103W YDL147W
...
d. (4) Create and submit *nodes.txt* and *edges.txt*.
e.
(4) Fire up Cytoscape. Choose File/Import/"Network from Table", browse to your edges.txt file, and
make your Source Interaction column 1 and your Target Interaction column 2. Import, then go to
File/Import/"Attribute from Table", browse to your nodes.txt file, and Import. Under the VizMapper,
enable a passthrough mapping for Node Color from Column 2, set the default
EDGE_TGTARROW_SHAPE to an arrow, and run Layout/yFiles/Organic. You should get something that
looks like the screenshot below, which you can save (File/Export/"Current Network View as Graphics")
and submit as *network.png*.
P08-5
f.
(4) *Explain why this network is horribly wrong.*
g.
(10) Refine your network generation process to take the presence/absence of sequence motifs into account.
Rerun this entire clustering and motif discovery process using the entire non-filtered
dilution_rate_02_knn.txt file (all ~5,500 genes), change clusters2cytoscape.py so that it only
connects TFs to genes whose promoters include their TFBSs (using either MDScan output and searching
your scerevisiae_filtered_up500b_masked.fasta file or using YEASTRACT's promoter
search), and regenerate and submit your new *better_nodes.txt*, *better_edges.txt*, and
*better_network.png* files. If anyone actually gets this far, seriously consider using a process like
this for your final project in Your Favorite Organism rather than S. cerevisiae!
P08-6