Genomic Data Manipulation BIO508 Spring 2014 Problems 08 Transcriptional Regulation 1. (0) The central dogma's been around for 50 years: transcription factors activate transcription, transcripts are translated, and the proteins go out to do the work of the cell. At least in unicellular microbes like Saccharomyces cerevisiae - the genetics of which have also been studied for nearly as long - enough of the cell's regulation is transcriptional that unraveling promoter binding sites and interactions should be no problem. So why haven't we solved yeast regulatory biology yet? The goal of this assignment will be to look at regulatory interactions under a very well-defined set of yeast growth conditions, attempt to delineate regulatory modules, look for transcription factor interactions, and reconstruct these data into a regulatory network. Over the course of these questions, you should be able to see A) several of the steps that are still used in regulatory network reconstruction (often in conjunction with additional data describing epigenetics or TF binding) and B) why this is a hard problem, even in yeast! You know the drill: upload a .zip or .tar.gz file named problems08.zip or.tar.gz containing each of the *files_starred_like_this.txt* and one problems08.doc/.docx/.txt file answering any written questions *starred like this*. 2. 3. First, let's build some groups of coexpressed genes. We'll go back to our old familiar dilution_rate_02_knn.txt file: http://growthrate.princeton.edu/data/dilution_rate_02_knn.txt Recall that this contains 36 S. cerevisiae expression conditions, 6 different nutrient limitations (carbon, nitrate, phosphate, sulfate, leucine, and uracil), each at 6 different growth rates (controlled by continuous chemostat culture). a. (0) Open this file up in MeV as a two-channel array. Adjust the data at a variance filter with a standard deviation cutoff of 0.5, and make sure to open up the new data filter on the left, right click, and "Set as Data Source" the resulting filtered expression matrix. Also right click on the heatmap on the right side and save this filtered expression matrix as dilution_rate_02_knn_filtered.txt, which should contain a bit over 1,500 genes. b. (4) Run a FOM from Clustering, calculating medians up to a maximum cluster count of 25. Take a look at the plot - there should be two obvious candidates for "good" cluster numbers, one just below 15 and one just above 20. Using the latter (larger) cluster count, run k-means from Clustering using median calculations (and default settings otherwise, except for the number of clusters). Expand the resulting table view on the left, select any cluster, right click on the table to the right, and "Save All Clusters". For the cluster count n you chose, this should result in n output text files that you should submit as *k-1.txt* through *k-n.txt*. Is there any reason to believe that these coregulated modules are telling us about real biology? Let's find out! As we discussed last week, one of the oldest and easiest ways to do this is to test whether any previously characterized pathways or functional modules are enriched in your gene set(s) of interest. We'll use the GOrilla online GO TermFinder implementation at: http://cbl-gorilla.cs.technion.ac.il This uses the hypergeometric distribution as discussed briefly in class. P08-1 a. (0) Set your organism to Saccharomyces cerevisiae and make sure you're running in "Two unranked lists of genes" mode so that you can provide a background gene list below. b. (0) Copy/paste the gene list from the largest coexpressed module (should be order-of-magnitude 150 genes) into the GOrilla "Target set" box. c. (2) Do the same thing for the gene list from your filtered input PCL file (should be ~1,500 genes) into the "Background set" box. Click on "Paste a ranked..." above this box to read its help text. *Why is providing the proper background gene set vital when performing any TermFinder or enrichment analysis?* d. (0) Click "Search Enriched GO terms" to get the ball rolling, and after a second or two you should have an enrichment figure and a table of enriched GO terms. e. (2) *How many GO Terms were significantly enriched in your gene set using these (default) parameters, and what were they? List at most the names of the first 5 Biological Process terms.* f. (1) Right click on the graph of GO terms, select "Save image as..." (or equivalent), and save that PNG file as *scerevisiae_go_large.png*. g. (2) *Does this look like a biologically plausible gene cluster? Why or why not?* h. (2x5) Repeat steps a-f for a cluster of ~100 genes and one of ~50 genes, making sure to answer questions d and f and saving these views as *scerevisiae_go_medium.png* and *scerevisiae_go_small.png*. If these three gene sets look too similar to each other (which they might - keep an eye out for very repetitive GO terms), poke around a bit using larger or smaller sets to find three distinct and interesting gene lists. Keep track of these cluster IDs! i. 4. (0) Hint: don't worry if the answer to d is zero at most once (do worry if it's zero two or three times). Now let's go fish out some upstream regulatory sequences for these putatively coregulated modules. Head to: http://www.biomart.org which is my default for obtaining conveniently formatted DNA sequences (and other gene ID information). a. (0) Go to Bio Portal/Sequence retrieval/Ensembl/Saccharomyces cerevisiae genes. b. (0) Request the transcript flank and enter 500 bases for Upstream Flank. Expand "Header Information", and uncheck "Ensembl Transcript ID". c. (1) Run cut -f2 dilution_rate_02_knn_filtered.txt | grep '^Y' | grep -v YORF > dilution_rate_02_knn_filtered_genes.txt, or perform the equivalent cut and filter using Excel. Save and submit the resulting file *dilution_rate_02_knn_filtered_genes.txt*. d. (0) Expand Filters, limit to genes with EntrezGene IDs, leave the type set to Ensemble genes, and "upload file" your dilution_rate_02_knn_filtered_genes.txt ID list file. e. (3) Click Go down at the bottom. Congratulations, you should be previewing a bunch of sequence in FASTA format! Of course, we don't want to view this in your web browser, so click the Download buttom to download a FASTA file containing the 500b upstream promoter sequences of every S. cerevisiae transcript in your array data. Save and rename this file to *scerevisiae_filtered_up500b.fasta* for submission and further investigation. P08-2 5. (2) The first further investigation we're going to do is to throw out repetitive and low-complexity DNA sequences. These are things like long stretches of AAAA or TTTT or CGCGC and junk like that. They have a tendency to "look" unusual to regulatory motif finders just because they're unusual, but not because they're actually interesting for transcriptional regulation. Let's run your scerevisiae_filtered_up500b.fasta file through the Tandem Repeats Finder program at: http://tandem.bu.edu/trf/trf.download.html Download the appropriate program for your platform (the online web service is too slow), open your FASTA file, check Options/"Masked Sequence File", and Run/Start Search using the default parameters. Mac users, you should save the command line version of the executable, run "chmod 755 trf407b.macleopard.exe" if necessary, and use the following default values: File = path to your sequence file Match =2 Mismatch = 7 Delta =7 PM = 80 PI = 10 Minscore = 50 MaxPeriod = 500 And also append -m to generate your masked output file. Wait a while, exit the program, delete all those unnecessary HTML files, and rename the resulting masked sequence file scerevisiae_filtered_up500b.fasta.2.7.7.80.10.50.500.mask to *scerevisiae_filtered_up500b_masked.fasta* for submission. 6. (5) Now it's time for some Python so that we can extract just sequences of interest from this file, e.g. the upstream regions of all genes in a particular cluster. We'll write a script that takes two command line arguments, a text file containing the gene IDs whose sequences should be extracted, and an optional count representing the maximum number of sequences to extract (for reasons that will become clear below). Fill in the blanks in the following Python, and submit it as *grep_fasta.py*. #!/usr/bin/env python ______ ___ if ___( ___.____ ) < 2: _____ _________( "Usage: grep_fasta.py <genes.txt> [n] < <seqs.fasta>" ) strGenes = ___.____[_] iN = ___(___.____[_]) if ( ___( ___.____ ) > 2 ) else 0 setstrGenes = set() for _______ in ____( strGenes ): ___________.add( _______.strip( ) ) strID = strSeq = iHits = 0 for _______ in ___._____: if _______[0] == ">": if strID in ___________: print( ">" + strID ) print( ______ ) _____ += 1 if iN and ( _____ >= __ ): break P08-3 strID = _______[1:].strip( ) strSeq = __ else: strSeq += strLine 7. Remember those three clusters you set aside earlier? Perform the following steps with all three of them; each point score is per cluster i. a. (1x3) Run cut -f2 k-i.txt | grep '^Y' | grep -v YORF > k-i_genes.txt for your cluster number i (or create the equivalent output using Excel/jEdit). Save and submit *ki_genes.txt*. b. (1x3) Run python grep_fasta.py k-i_genes.txt < scerevisiae_filtered_up500b_ masked.fasta > k-i.fasta. Save and submit *k-i.fasta*. c. (0) Visit MEME at: http://meme.nbcr.net/meme/cgi-bin/meme.cgi Enter your email and Browse to your k-i.fasta file. Report a maximum of 10 motifs, leave the other parameters on their defaults, and Start Search. NB: If MEME complains about the length of your input sequence file, just delete genes from the bottom until it's short enough. d. (2x3) 8. e. (3x3) *How many motifs were significant (max. 10)? How many of the genes in the ith group include these motifs in their upstream sequences (and what percentage of the group is that)?* Note that this should be easy to answer using grep '^>' k-i_meme.txt | sed 's/\t.*//' | sort | uniq | wc -l. f. (1x3) *After a quick eyeball, how many of these motifs actually look unique and not just copies of each other?* Wouldn't it be nice to identify whether any of these putative TFBS motifs correspond to known consensus sequences? Let's take a quick look using the TOMTOM motif search and comparison tool at: http://meme.sdsc.edu/meme/cgi-bin/tomtom.cgi a. 9. When the MEME email arrives, save and submit the file as *k-i_meme.txt*. (2x2x3) Select two reasonably different motifs from each of these three files, upload them to the TOMTOM MEME search box, and Start the search. For each matrix, *what was its original motif sequence from MEME, what known motif(s) does TOMTOM think it matches (if any), and do you believe it?* (5) Summarize your findings from this study: *What have you discovered about transcriptional regulators of growth rate and/or nutrient utilization in S. cerevisiae? What have you discovered about transcription factor binding site discovery and regulatory network inference?* P08-4 10. ( ) In order to build these coregulated modules into a network, let's get a list of putative transcription factors in S. cerevisiae that could be doing the regulating. Visit my favorite list of yeast TFs and TFBSs at: http://fraenkel.mit.edu/improved_map/ Download the v1.tamo file by clicking on "Motif data in TAMO format". a. (0) Run grep Source v1.tamo | sed 's/Source: *//' | sort | uniq > v1_symbols.txt (or an equivalent jEdit command) to generate a list of gene symbols, not IDs, that are included as TFs in this dataset. b. (2) Visit YEASTRACT (my second favorite list of yeast TFs) at: http://www.yeastract.com/formorftogene.php Paste your list of TF symbols into the box (should be exactly 124), click Transform, copy/paste the resulting ORF ID list (the things starting with Y and some letters and numbers) into a new text file, and search-and-replace the lower case "c" and "w" into upper case. Save the resulting file as *v1_genes.txt*. c. (8) Write and submit a Python script *clusters2cytoscape.py* that will convert your list of potential TFs and your k-i.txt cluster files into a pair of edge and node descriptor files appropriate for import into Cytoscape. i. Takes at least two command line arguments: an output file of the form nodes.txt and one or more input cluster files of the form k-i.txt. The script should take your v1_genes.txt file as input on standard in, making the complete execution call clusters2cytoscape.py nodes.txt k-*[0-9].txt > edges.txt. ii. nodes.txt should contain as output two tab-delimited columns, with one gene identifier per line in the first column. The second column should contain a color string, either #FF0000 for transcription factors (from your v1_genes.txt list) or #00FFFF for all other genes. An example from mine is: YLR228C #FF0000 YOR372C #FF0000 YAL012W #00FFFF YDR384C #00FFFF ... iii. edges.txt should contain as output two tab-delimited columns, the left being a TF and the right a target gene. You should output an edge for each TF/gene pair that co-occur together in at least one cluster. For example, in my MeV output, the TF YNL103W and the genes YOL032W, YER158C, and YDL147W all appear together in cluster #11. My edges.txt file thus starts with: YNL103W YOL032W YNL103W YER158C YNL103W YDL147W ... d. (4) Create and submit *nodes.txt* and *edges.txt*. e. (4) Fire up Cytoscape. Choose File/Import/"Network from Table", browse to your edges.txt file, and make your Source Interaction column 1 and your Target Interaction column 2. Import, then go to File/Import/"Attribute from Table", browse to your nodes.txt file, and Import. Under the VizMapper, enable a passthrough mapping for Node Color from Column 2, set the default EDGE_TGTARROW_SHAPE to an arrow, and run Layout/yFiles/Organic. You should get something that looks like the screenshot below, which you can save (File/Export/"Current Network View as Graphics") and submit as *network.png*. P08-5 f. (4) *Explain why this network is horribly wrong.* g. (10) Refine your network generation process to take the presence/absence of sequence motifs into account. Rerun this entire clustering and motif discovery process using the entire non-filtered dilution_rate_02_knn.txt file (all ~5,500 genes), change clusters2cytoscape.py so that it only connects TFs to genes whose promoters include their TFBSs (using either MDScan output and searching your scerevisiae_filtered_up500b_masked.fasta file or using YEASTRACT's promoter search), and regenerate and submit your new *better_nodes.txt*, *better_edges.txt*, and *better_network.png* files. If anyone actually gets this far, seriously consider using a process like this for your final project in Your Favorite Organism rather than S. cerevisiae! P08-6
© Copyright 2026 Paperzz