5-hydroxymethylcytosine in Eukaryotic Genomes by Thomas John Haas A thesis in partial satisfaction of the requirement for the degree of Master of Science in Plant Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Daniel Zilberman, Chair Professor Robert L. Fisher Professor Patricia Zambryski Spring 2013 5-hydroxymethylcytosine in Eukaryotic Genomes Copyright 2013 by Thomas John Haas [email protected] Abstract 5-hydroxymethylcytosine in Eukaryotic Genomes by Thomas John Haas Master of Science in Plant Biology University of California, Berkeley Professor Daniel Zilberman, Chair The genomes of some eukaryotic organisms contain a modified cytosine, 5-methylcytosine (5mC), as an epigenetic mark to help control genetic elements. In some mammals, the TET dioxygenases catalyze modification of 5-mC to 5-hydroxymethylcytosine (5-hmC). These enzymes and the resultant modified base likely act as a part of the epigenetic system of mammals and play important roles in demethylation and genetic regulation. Genomic analysis shows that putative TET-like dioxygenases exist in other eukaryotic organisms, but the presence of 5-hmC in the genomes and the roles that it and the TET-like dioxygenases may play in the epigenomic landscapes of these organisms are unknown. In this thesis, the presence of 5-hmC and the activity of putative TET-like dioxygenases in selected fungi, green algae and plants are studied. First, several 5-hmC detection methods, including immunodetection, mass spectrometry, and oxidative bisulfite sequencing, are attempted. 5-hmC is likely present in some fungi and green algae, and 5-hmC production is decreased in selected fungi and green algae by competitive inhibition of dioxygenases with 2hydroxyglutarate. Additionally, some fungal and green algal TET-like dioxygenases may be components of novel putative transposons, and the potential for putative transposon-associated dioxygenases to act in an autoderepressive manner is hypothesized. Furthermore, the structure of these putative transposons is used as a template for the speculation and hypothetical design of sequence-specific epigenetic modification systems. Finally, an appendix details a new mapping technique termed “shotgun mapping.” Shotgun mapping is a unique application and analysis of high-throughput sequencing data from mapping populations to quickly and accurately determine the physical location of genetic mutations. Shotgun mapping proof-of-concept is shown by mapping a novel era1 genetic mutation in Arabidopsis thaliana. The shotgun mapping approach is validated by including PCR-based mapping and Sanger sequencing data of the discovered era1 mutation. 1 Table of Contents Chapter 1 Thesis Introduction 1 Chapter 2 Methods for and results of chemical, biochemical and sequencing-based detection of 5-hydroxymethylcytosine in various eukaryotes 8 Chapter 3 Hypothesis for function of 5-hydroxymethylcytosine in putative fungal transposons and schemata for engineering fused and modular epigenetic modification systems 26 Chapter 4 Thesis Summary 32 References Appendix A 35 “Shotgun mapping”: Mapping genetic mutations with high-throughput sequencing data i 40 Figure List 1.1 Overview of cytosine modification reactions 3 2.1 Analysis of 5-hmC specificity of PvuRts1I 11 2.2 5-hmC antibody test blots 12 2.3 Immunodetection of 5-hmC 13 2.4 2-HG inhibition of algal and fungal dioxygenases 14 2.5 Overview of bisulfite-based sequencing methods 17 2.6 Testing of methylated DNA standards libraries 21 3.1 Overview of putative transposon-associated TET-like dioxygenases 27 3.2 Example scheme of a modular, sequence-specific epigenetic modification system 30 A.1 Explanation of ratios of col to ler alleles from wild-type and mutant chromosomes for mapping 42 A.2 “Shotgun mapping” analysis 44 A.3 Alignments of Sanger sequencing data 46 ii Table List 2.1 Summary of 5-hmC immunodetection dot blot experiments 14 2.2 Comparison of bases as seen after different sequencing methods 18 2.3 Fisher’s exact test contingency table and formula for BS/oxBS paired datasets with “signed p-value” 19 A.1 Summarized, selected PCR-based mapping data 45 iii Acknowledgements First, I’d like to thank Daniel and the rest of the Zilberman lab. Yvonne, thanks for your work, help and advice. A second thanks to you for allowing me to work on the mapping project and include it in this thesis. Toshiro, thanks for your perl-fu, whisk(e)y and ribaldry. Jessica, Jason, Assaf and Devin, thanks for letting me take advantage of copious amounts of your materials and time. Ping-Hung and Xiao, thanks for your moral support, which has been much needed. I’m also grateful to my thesis committee members Pat Zambryski (who also helped me through my first teaching experience) and Bob Fischer (and the rest of the Fischer lab). I’d also like to thank my fellow classmates, especially Rie Uzawa, Sara Sirivanchai, Jake Brunkard and Megan Cohen. Thank you all for discussions and support, especially in the past year. You all helped me through a very challenging time. Finally, I’d like to thank my parents, Dan and Lisa, and my great friends in the Bay Area, Nate, Paul, Alison and Alyssa, who have all been wonderfully supportive over the past two and a half years. iv Chapter 1: Thesis Introduction 1 1.1: Overview of nucleic acid methylation DNA base modifications are known to play a number of important roles in the protection and regulation of an organism’s genetic material. In many bacteria, methylation of cytosine and adenine is often an important component of an organism’s restriction modification system. For example, methylation can be used as a self-recognition mechanism. By methylating its own DNA, the organism can then target exogenous, non-methylated DNA for degradation with methylation-sensitive restriction enzymes1. This mechanism is advantageous to the survival of the organism because it allows for the destruction of non-endogenously methylated viral DNA with restriction enzymes that do not cut endogenously methylated sequences. Some viruses have appropriated the methylation strategies of host restriction methylation systems. For example, T4 phage and other viruses encode for DNA methylases, employing them to modify their own DNA to imitate their bacterial hosts’ endogenously methylated DNA, thus repelling the binding or activity of host methylation-sensitive restriction enzymes2. These viral “antirestriction” tactics help evade the self-recognition strategy of the host, executed via its restriction modification systems. In eukaryotes, the modified DNA base 5-methylcytosine (5-mC) plays important roles in epigenetic regulation3. In organisms that employ it, cytosine methylation is one element in the complex landscape of the genetic regulatory system. DNA methylation is counted among several other common genetic and epigenetic factors used in the larger regulatory scheme, which includes the modification of histones with small functional groups and larger protein elements and the binding of trans-acting transcriptional factors to chromatin4. Eukaryotic organisms that do employ cytosine methylation generally use 5-mC as a mark through which genetic elements can be repressed, including silencing of transposons, imprinting of genes or entire chromosomes, and regulation of cell type identity and developmental stages through repression of specific genetic elements5. While not universal, methylated cytosine is found at significant levels in many branches of the tree of life, including many vertebrates and invertebrate animals, plants and green algae, and fungi6. The biochemical mechanism for methylation of nucleic acids is conserved in all known eukaryotic, prokaryotic, and viral systems and uses an S-adenosylmethionine (SAM)-dependent DNA methyltransferase. During cytosine methylation, the thio-bound methyl group of SAM is transferred to the 5’ carbon of the cytosine (C), generating the modified base 5-mC and Sadenosylhomocysteine as products (Fig. 1.1A)7. In eukaryotes, not just any cytosine will be methylated. In many organisms, including mammals, predominantly cytosine-guanine dinucleotide (CG) sites are methylated in a symmetric fashion, in which the C on both the parallel and antiparallel strands are methylated3,6. Symmetric methylation can then be reproduced after DNA replication by targeting “maintenance” DNA methyltransferases to hemimethylated sites. These maintenance methyltransferases then methylate daughter strand cytosines at CG sites in which parent strand cytosines are methylated. This mechanism for symmetric methylation allows CG methylation patterns to be faithfully and reliably propagated through cell generations. 2 Fig. 1.1: Overview of cytosine modification reactions. (A) DNA methyltransferase catalyzes the reaction of cytosine with S-adenosylmethionine (SAM) to produce 5methylcytosine (5-mC) and S-adenosylhomocysteine (SAH). (B) 5-mC is hydroxylated by TET-like dioxygenases to produce 5-hydroxymethylcytosine in a 2ketoglutarate (2-KG)- and Fe(II)-dependent co-oxidation reaction. Additionally, some organisms have an ability to target methylation to sites other than CG dinucleotides3. For example, in plants, significant amounts of methylation are found at CHG and CHH trinucleotide sites, where “H” is anything other than guanine. This de novo methylation is targeted, in part, by RNA-directed DNA methylation pathways (RdDM), wherein RdDM machinery is directed to specific genomic sites by small RNAs8. 1.2: Hydroxymethylation of pyrimidines Further modification of methylated nucleic acids is possible. In the mid-20th century, researchers found a hydroxymethylated pyrimidine, 5-hydroxymethylcytosine (5-hmC), as a component of T4 phage DNA9,10. Since this time, this and other analogous modifications have been found in viruses and members of Kinetoplastida and Mammalia, including humans and mice11–15. Hydroxylation of 5’ methyl groups of pyrimidines is known to occur via three different reactions. First, hydroxylation occurs randomly through reactive oxygen species. This reaction is likely to occur at a regular but low level in all organisms, creating a trace amount of 5hydroxymethylated pyrimidines in all organisms, including 5-hmC in organisms that contain large amounts of 5-mC. Second, in vitro experiments have shown that it is possible for SAMdependent DNA methylases to directly add alkyl aldehydes, such as formaldehyde or 3 acetaldehyde, to cytosine, producing a 5-hydroxyalkylcytosine derivative16. Using formaldehyde, this reaction produces 5-hmC. However, whether or not this reaction occurs in living systems is not known. Lastly, enzymes containing a dioxygenase domain may hydroxylate 5’ methylated pyrimidines11,17. This last reaction will be the focus of further discussion into the creation of 5hmC. 1.3: Dioxygenases and 5-hmC production The dioxygenase domain is a component of the TET and TET-like proteins in humans and mice and the JBP enzymes in some members of the Kinetoplastida, as well as many other proteins in many varied organisms18. These dioxygenase domains in the TET-like subset have been specifically shown to catalyze the creation of 5-hmC in humans and mice11. The basic dioxygenase reaction is a co-oxidation of two substrates with molecular oxygen. In the TET-like dioxygenases used for oxidation of 5-mC to 5-hmC, the domain is dependent on Fe(II) and the co-oxidant, 2-ketoglutarate (2-KG), in which the 2’ carbon is oxidized and the 1’ carboxylic acid leaves, generating carbon dioxide and the dicarboxylic acid succinate19. These dioxygenase domains have also been shown to be competitively inhibited by a 2-KG analog, 2hydroxyglutarate (2-HG), which will be revisited later20. Many other organisms have predicted TET-like proteins or dioxygenases with homology to the dioxygenase domains of TET-like enzymes. In addition to those mentioned above, organisms that have putative dioxygenase-containing enzymes predicted to act on 5-mC include members of the basidiomycete fungi, chlorophyte algae, Heteroloboseans, and certain viruses18. Additionally, these and many other organisms have predicted dioxygenase domains in other proteins families and may be used for pyrimidine modification in RNA or play roles in DNA base excision repair. 1.4: Glucosylation of 5-hydroxymethylated pyrimidines The hydroxyl group opens up possibilities for further modification of hydroxymethylated pyrimidines in DNA. Two well-known natural modifications, in which a glucose moiety is added, have been found. These glucosylated bases are the J base (β-Dglucosylhydroxymethyluracil) in certain trypanosomes and relatives and β-Dglucosylhydroxymethylcytosine (5-gmC) in certain viruses (e.g., T4 phage)9,21. In T4 phage, 5mC and 5-hmC are likely intermediates in the pathway to 5-gmC, which is produced by a βglucosyltransferase. Like viral methylation and hydroxymethylation of cytosine, glucosylation of already-modified cytosines is likely another antirestriction strategy employed by viruses, and in Leishmania spp. acts as a marker to stop transcriptional read-through13. This glucosyl modification has already been used for detection of 5-hmC via transfer of a chemically modified or isotopically labeled glucose. Because of the specificity of this reaction, it may be possible to further label or functionalize DNA containing 5-hmC with engineered T4 βglucosyltransferases or with other reactions using chemically modified glucosyl moieties via “click chemistry” approaches22. 1.5: 5-hydroxymethylcytosine in the epigenetic landscape Curiously, 5-hmC has been detected at relatively high levels in certain tissues and cell types in mice and humans and is therefore likely to occur in other vertebrates containing TET-like 4 enzymes12,22–24. Because 5-hmC is a modification of 5-mC, an important part of the epigenetic systems of metazoans, fungi and plants, 5-hmC likely also plays some role in the epigenetic systems of the organisms that contain it. What this role is, exactly, is not clear, but several putative and hypothetical functions have been recently discussed and tested in mammalian systems. Three putative functions will be discussed here: First, as a method for passive demethylation; second, as a signal for active demethylation; third, as an epigenetic mark. 1.5.1: Passive demethylation Passive demethylation, also known as replication-dependent or –coupled demethylation, relies on the dilution of symmetrically-methylated (i.e., CG) sites via inaction of maintenance methyltransferases at those sites after DNA replication. Thus, after one cell division of a progenitor cell, the passively demethylated loci will be hemimethylated in the daughter cells and thereafter unmethylated in all but two hemimethylated descendant cells. 5-hmC is a possible mark for bypass of hemimethylated CG sites by maintenance methylases, in which 5-hmC at a CG site “hides” the methyl mark from being read and prevents the methylation of the daughter strand25. As a mechanism for demethylation of large amounts of 5-mC, such as in imprint erasure, passive demethylation is a more efficient demethylation mechanism than proposed active demethylation mechanisms discussed below. This passive mechanism has recently been shown in mouse primordial germ cells (PGCs), whereby removal of 5-mC via a 5-hmC intermediate can occur in a passive manner. In these cells, TET dioxygenases act to hydroxylate the majority of methylated sites in the genome. Following cell division, hemihydroxymethylated CG sites are not maintained by methyltransferases, leading to a dilution of 5-hmC and effectively removing 5-mC from the population of descendant cells, thus removing imprinting marks26. Whether this mechanism for demethylation occurs in other cell types or in instances other than imprint erasure in mouse PGCs is not known. 1.5.2: Active demethylation Active demethylation is the enzymatic removal and/or replacement of methylated cytosine or its 5’ methyl group. Active demethylation via removal of bases by DNA glycosylase and replacement of residues by base excision repair machinery plays important roles in both plants and animals27–30. 5-hmC may act as a mark for active demethylation in some organisms through three (related) mechanisms, briefly described below, which are direct removal and replacement of 5-hmC bases, deamination of 5-hmC bases, which triggers replacement by recognition of T:G mismatch, and continued oxidation of 5-hmC, which is followed by decarboxylation. While active demethylation via a 5-hmC intermediate may be plausible for small regions, it would not be as efficient a mechanism for genome-wide demethylation as is passive methylation. If using base excision repair systems, active demethylation would also increase chances of double-stranded breaks during genome-wide demethylation events during developmental transitions. 1.5.2.1: Direct removal and repair It is plausible that 5-hmC itself could be recognized by components of an organism’s base excision repair system. Once recognized, 5-hmC residues could be removed by a DNA 5 glycosylase and replaced by cytosine using base excision repair machinery, generating an unmethylated site. 1.5.2.3: Deamination, removal and repair Alternatively, it is plausible that 5-hmC could be recognized by cytosine deaminases, such as members of the AID/APOBEC family, and their interactors31,32. The 4’ amino group of the 5hmC could then be removed by the cytosine deaminase, creating 5-hydroxymethyluridine. This T:G mismatch could then be recognized by base excision repair machinery, which would remove the offending thymine and replace it with a cytosine, generating an unmethylated site. 1.5.2.4: Oxidation-based mechanisms Mammalian TET-like dioxygenases are able to further oxidize 5-hmC to 5-formylcytosine and then to 5-carboxycytosine (5-caC), and these modified bases have been found in mouse DNA33,34. These groups themselves can trigger deamination and DNA glycosylase-mediated base excision repair by recognition of oxidized forms35,36. It has also been proposed but not shown that 5-caC may be decarboxylated, returning the site to an unmethylated cytosine without the need for deamination, excision and repair. This decarboxylation mechanism, while simple and elegant, has been experimentally shown to be unlikely to occur in significant amounts24. 1.5.3: As an epigenetic mark It is plausible that 5-hmC may itself be a novel epigenetic mark and play important roles in an organism’s epigenetic system outside of simply being an intermediate in demethylation. Recent work using pull-down and mass spectrometry-based proteomics techniques has shown that highly specific 5-hmC “reader” proteins likely exist, though whether or not these 5-hmC interactors act as intermediaries between 5-hmC-marked loci and regulatory components or other chromatin modifying systems has not been strictly shown37. Curiously, in mouse embryonic stem cells, high concentrations of 5-hmC do tend to overlap with bivalent histone marks (wherein both active and repressive histone marks are present), which mark transcriptionally “poised” genes, or those genes that can readily be expressed at high levels12,38. However, the concentration of 5hmC at these sites may simply be due to demethylation intermediate, where repressive 5-mC marks are in the process of being removed. Interestingly, 5-hmC is a candidate for a component of an epigenetic oxygen sensor system39. This hypothetical system uses the dioxygenase-catalyzed reaction, which requires both molecular oxygen as well as 2-KG as a co-factor. 2-KG, a product of the tricarboxylic acid cycle, also requires high oxygen concentration for its production. Thus, dioxygenase, acting as a biochemical oxygen sensor, may transmit this information to the epigenome via 5-hmC production. While no specific evidence exists for this mechanism, it does remain an intriguing and plausible epigenetic mechanism for some organisms, and may extend to other chromatin modifying systems that require 2-KG as a co-factor. 1.6: Introduction to thesis project 6 My work in this thesis attempts to investigate the presence of 5-hmC in a diverse selection of eukaryotic genomes and to discover the genomic distribution of the modification in some 5hmC-containing genomes. Additionally, I hypothesize about the presence of specific genetic constructs found in putative transposons that may use 5-hmC production to disguise themselves from host repression mechanisms as well as use these constructs as a template to propose novel molecular tools for epigenetic editing of specific loci. Finally, an appendix detailing the mapping of an Arabidopsis thaliana mutant using an analysis of high-throughput genomic sequencing data is added as an example of other work I have completed in the course of my graduate studies. 7 Chapter 2: Methods for and results of chemical, biochemical and sequencing-based detection of 5-hydroxymethylcytosine in various Eukaryotes 8 2.1: Introduction to 5-hmC detection methods Several chemical and biochemical methods have been used to detect and quantify 5-hmC. The first discovery of 5-hmC, published in 1953, used paper chromatography and analysis of UV spectra to detect the base in phage DNA10. Since that time, analytical chemical methods (thin layer chromatography and liquid chromatography/mass spectrometry), biochemical methods (biochemical modification of 5-hmC, 5-hmC-sensitive restriction enzyme digestion, and immunodetection), and sequencing-based methods have been used for 5-hmC detection. In this section, I will first briefly outline modern methods for 5-hmC detection and quantification not used in the experiments completed for this study. Then, I will discuss attempted detection methods and results in detail. 2.2: Unused methods 2.2.1: Thin layer chromatography Thin layer chromatography (TLC) is a technique used to separate the components of a solution of chemicals by the movement of the chemical components in a solvent (mobile phase) through a solid matrix (the adsorbent, or stationary phase). Depending on a chemical’s solubility in the solvent and its attraction to the adsorbent, the chemicals will move through the adsorbent at different rates, thereby separating the components of the mixture. To detect 5-hmC and other bases using TLC, enzyme-fractionated DNA is end labeled with a radioactive isotope (32P or 33P) using polynucleotide kinase and then digested to nucleotides using phosphodiesterase and nuclease. This mixture is run on a TLC plate, often twodimensionally to allow for greater separation of bases, and then exposed to electron-sensitive film or camera to create an autoradiograph. This autoradiograph can be compared to standard mixtures to detect the presence of 5-hmC and quantify the amount present in the mixture. 2.2.2: High-performance liquid chromatography Like TLC, high-performance liquid chromatography (HPLC) is used to separate components of a solution of chemicals by passing the mobile phase through a column-based stationary phase. Also like TLC, chemicals are separated based on their interaction with the stationary phase. 5hmC and other bases can be detected and quantified using HPLC by DNA digestion to nucleotides. The mixture of nucleotides is then run on HPLC equipment and individual bases are detected by using various spectrographic-based detectors and comparing this data to known standards. 2.2.3: Glucosylation and chemical labeling As discussed in Section 1.4, the hydroxyl group on 5-hmC opens up many possibilities for chemical labeling. β-glucosyltransferase (β-GT) from T4 phage has been used to biochemically label 5-hmC with glucose in vitro. In this reaction, β-GT transfers the glucose moiety from a UDP-glucose to 5-hmC, producing 5-gmC. This specific modification is an important reaction and useful for 5-hmC detection for several reasons. First, this reaction is specific to 5-hmC, so only 5-hmC bases are glucosylated. Second, the glucose moiety is large and chemically distinct. These properties allow glucosylation to be 9 used in tandem with other methods, such as various chromatographic and spectroscopic methods, allowing 5-gmC resides to be more easily observed. Additionally, 5-gmC can be used as a blocking group for restriction enzyme-based detection, as it is naturally used in some phages, and as an epitope that can be specifically detected with antibodies. Lastly, a modified glucose can be transferred. This glucose can be radioisotopically labeled for specific detection and quantification of 5-hmC, or it can be chemically modified so that other chemistries may be used to detect, quantify and/or pull down 5-hmC, such as has been used in azide-labeled glucose/biotin click chemistry22. 2.3: 5-hmC-specific restriction enzymes 2.3.1: Introduction to 5-hmC detection with restriction enzymes Restriction enzymes sensitive to the presence of 5-hmC or a modified 5-hmC are a convenient method for potential genome-wide quantification and partial single base resolution of 5-hmC. Two basic approaches have been proposed or used for this method of 5-hmC detection, briefly outlined here. In the first approach, two restriction enzymes are used. These restriction enzymes, called isoschizomers, recognize the same base sequence but, whereas one restriction enzyme will be “methylation sensitive” (i.e., it will usually not cut sites containing 5-mC or 5-hmC), the other will be “methylation insensitive” (i.e., it will usually cut site containing 5-mC or 5-hmC)40. Using these isoschizomers with β-GT-modified DNA, in which the 5-hmC residues contain a large “blocking group” that prevents the cutting of sites normally recognized by the methylation insensitive restriction enzyme, cytosines at restriction enzyme-recognized sites can be determined to be C, 5-mC, or 5-hmC41. For example, the pair MspI and HpaII are isoschizomers that recognize the sequence C/CGG; however, while HpaII is sensitive to 5-mC, MspI will cut sequences in which the internal cytosine is 5-mC or 5-hmC40. To differentiate between methylated and hydroxymethylated sites, a blocking group is added to 5-hmC via β-GT, creating 5-gmC. Thus, the 5’ modification status of the internal C can be determined: CCGG sites are cut by both HpaII and MspI, C(5-mC)GG sites are cut by only MspI, and C(5-hmC/5-gmC)GG sites are not cut by either, due to the blocking action of the glucose moiety added to the hydroxyl group. By size separation and bulk DNA quantification of both digestions, the number of hydroxymethylated CCGG sites can be determined. By single or paired end shotgun sequencing, highly hydroxymethylated CCGG sites can be determined by dataset comparison, as CCGG sites that are shown as “cut” in both datasets are likely unmethylated, CCGG sites that are often cut in MspI but not HpaII datasets may be methylated, and CCGG sites that are often not cut in either may be hydroxymethylated. A simpler second approach requires the use of a restriction enzyme with particular activity at recognition sites that contain 5-hmC. Due to the groups’ similar size, nearly all known methylation insensitive restriction enzymes are also hydroxymethylation insensitive42. However, a restriction enzyme, PvuRts1I, has been shown to be hydroxymethylation insensitive but methylation and unmethylated C sensitive in certain positions of its recognition sequence, (5hmC)N11-12/N9-10G42. In other words, PvuRts1I will only cut the recognition sequence when the first cytosine is hydroxymethylated but not when it is methylated or unmethylated. Assuming this action of PvuRts1I, genomic DNA could be digested and analyzed for bulk quantification of 10 5-hmC based on fragment size distribution compared to undigested sample, or be used for shotgun sequencing library construction to map recognition sites that may contain 5-hmC, without the use of isoschizomers and multiple dataset analysis. 2.3.2: Testing 5-hmC-sensitive restriction enzyme detection techniques To test the feasibility of using this restriction enzyme method for 5-hmC analysis, PvuRts1I was first tested against a set of standards. These 338 bp double-stranded DNA standards contained either all unmethylated C, all 5-mC, or all 5-hmC bases. Before experimentation, the standard sequence was calculated to contain 25 recognition sequences, such that 5-hmC standards should be digested at many of these sites to produce sub-75 bp fragments. The experiment was repeated twice. In both cases, agarose gel electrophoresis showed that, while 5-hmC standard was indeed digested to small fragments, there was significant digestion of 5-mC standard that was not reported in the original paper (Fig 2.1A)42. To confirm this result, the same samples were run on a high-sensitivity Bioanalyzer DNA chip. This analysis confirmed the significant nuclease activity of PvuRts1I on the 5-mC standards (Fig. 2.1B). Due to this finding, it was decided that PvuRts1I is not specific enough to discern between methylated and hydroxymethylated sequences and thus is not suitable for bulk quantification or sequencing-based analysis of 5-hmC. Fig. 2.1: Analysis of 5-hmC specificity of PvuRts1I. PvuRts1I-digested methylated DNA standards were analyzed by (A) agarose gel electrophoresis and (B) Agilent 2100 Bioanalyzer. Red brackets indicate areas of fragments that are not visible in image. Size difference in (B) is due to molecular weight and charge difference of highly-modifed standards. C, unmethylated cytosine standard; M, 5-mC standard; H, 5-hmC standard; MOCK, mock digestion; DIG, digestion with PvuRts1I. 11 2.4: Immunodetection 2.4.1: Introduction Immunodetection techniques are relatively fast and convenient methods to reveal the presence of specific antigens in a sample. Common immunodetection methods, such as those used here, employ one antibody, called the primary antibody, to target an antigen of interest. Often another antibody, called the secondary antibody, is used to target the primary antibody. The secondary (or sometimes primary) antibody is conjugated to a signaling molecule or enzyme. This signaling molecule acts as a generator of a fluorescent, chemiluminescent, chromogenic, or light-/electronopaque signal and signal amplifier, allowing for the visualization of the location and density of the antigen of interest. Antibodies are often raised against fragments of large biopolymers and bind to an epitope of those polymers; however, antibodies can also be raised against much smaller molecules. Three commercially-available antibodies that detect 5-hmC have been produced. Additionally, it is possible to detect 5-hmC by 5-hmC modification and use of antibodies against those specific modifications. These two published modified immunodetection techniques include the GLIB “click chemistry” technique and the cytosine 5-methylenesulfonate (CMS) technique. The GLIB technique, mentioned previously, transfers an azide-modified glucose moiety to 5-hmC via βGT, which is then further modified with a biotin-containing group, creating biotin-N3-5-gmC22. The biotin is then detected using a biotin antibody or streptavidin. The CMS technique makes use of CMS, produced from the reaction of sodium bisulfite with 5-hmC (explained in more detail below)43,44. A CMS antibody is then used for detection. However, for all work described here, only immunodetection of underivatized 5-hmC was performed. Fig. 2.2: 5-hmC antibody test blots. Dot blots of methylated DNA standards (10 ng each) were used to test efficacy of several 5-hmC antibodies. Each column represents a different test. The schematics in the last row shows membrane layout. MB, mouse brain DNA (500 ng). 12 2.4.2: 5-hmC antibody testing To test 5-hmC antibodies effectiveness against one another, dot blots with 10 ng of doublestranded DNA standards containing either all C, all 5-mC, or all 5-hmC bases were conducted. While none of the antibodies produced a signal from C and 5-mC dots, the Active Motif 5-hmC antibody produced the strongest signal from the 5-hmC dot (Fig. 2.2). For all work described below, only the Active Motif 5-hmC antibody was used. Fig. 2.3: Immunodetection of 5-hmC. Examples of immunodetection of 5-hmC using genomic DNA of various Eukaryotes. Each column (A-C) shows a separate dot blot experiment with methylated DNA standard controls. Organisms or standards used are listed next to each dot. See Table 2.1 for a summary of immunodetection experiments. (Images of blots were edited by cut and paste for placement in figure. Neutral gray represents whitespace after cut and paste.) 2.4.3: Immunodetection results To detect 5-hmC in genomes, dot blotting of DNA from various organisms with 5-hmC antibody was completed. Each blot contained three DNA standard dots, with C and 5-mC standard serving as negative controls and 5-hmC as positive control. Additionally, many blots also contained Mus musculus (mouse) brain, mouse liver, and Apis mellifera (bee) worker-caste head DNA. Mouse brain and liver have previously been reported to contain significant amounts of 5-hmC, while bee worker head contains only traces of 5-mC, and thus should contain no 5-hmC. These three samples were used as further controls. Samples with unknown amounts of 5-hmC were chosen based on the existence of putative TET/JPB family dioxygenases18. These samples were composed of DNA from several species of Ascomycete and Basidiomycete fungi and the Chlorophyte green algae. Other species not known to contain putative TET/JPB family dioxygenases were also tested, such as the Zygomycete fungus Phycomyces blakesleeanus and several plant species. Based on these dot blots, 5-hmC is present in the genomes of all tested fungal species and three of four green algal species (Fig. 2.3B,C; Table 2.1). Though these dot blots are not quantitative, 13 in some cases the amount of 5-hmC present in these genomes appears to rival the magnitude of the amount of 5-hmC present in mouse brain, which has been shown to be 0.3 to 0.6% of the genome, depending on cell type24,45. Additionally, while some plant genomes (Selaginella moellendorffii, Physcomitrella patens) did not show 5-hmC-positive signal on dot blots, others did sometimes register weak signals (Arabidopsis thaliana, Oryza sativa) (Fig. 2.3; Table 2.1). Positive tests / Times tested Lowest positive test amount (ng) 5-hmC present? all (>5) 50 Yes all (>5) 50 Yes all (>5) 50 Yes 1/1 1000 Inconclusive all (>5) 200 Yes Weak signal at 200ng all (>5) 200 Yes Weak signal at 200ng 2/2 200 Yes Weak signal at 200ng 2/2 1/1 1/1 2/2 1/1 1/1 50 500 1000 40 500 1000 Yes Yes Yes Yes Yes Inconclusive Arabidopsis thaliana 2/4 1000 Inconclusive Physcomitrella patens Selaginella moellendorffii 0/1 N/A No 0/1 N/A No all (>5) 500 Yes all (>5) 500 Yes Used as genomic positive control; consistently stronger signals than liver Used as genomic positive control none (>10) N/A No Used as genomic negative control Organism Chlamydomonas reinhardtii Chlorella sorokiniana Chlorella sp. NC64A Ostreococcus lucimarinus Neurospora tetrasperma Neurospora crassa Phycomyces blakesleeanus Agaricus bisporus Uncinocarpus reesii Postia placenta Laccaria bicolor Coprinopsis cinerea Oryza sativa Mus musculus, brain Mus musculus, liver Apis mellifera, worker head Notes Usually stronger than both Chlorella; weak signal at 50ng Weak signal at 50ng Usually stronger than C. sorokiniana; weak signal at 50ng DATA NOT SHOWN. Weak signal; single test: limited amount of DNA. Weak signal at 50ng Single test: limited amount of DNA Single test: limited amount of DNA Strong signal at 40ng Single test: limited amount of DNA Very weak signal DATA NOT SHOWN. Weak signals; different tissue samples for each test DATA NOT SHOWN Table 2.1: Summary of 5-hmC immunodetection dot blot experiments. Fig. 2.4: 2-HG inhibition of algal and fungal dioxygenases. DNA from cultures grown with L2-hydroxyglutarate (2-HG) was tested for inhibition of TET-like dioxygenases by dot blot-based immunodetection of 5-hmC. Control row, no 2-HG added before harvest and DNA extraction; +2-HG row, 2-HG added to culture ~18h before harvest and DNA extraction. 14 2.4.4: Dioxygenase inhibition results A 2-KG analog, 2-HG, can be used to inhibit 2-KG-dependent dioxygenases. To show that the 5hmC in these organisms is produced by a 2-KG-dependent dioxygenase and not by random chemical oxidation or by a modified SAM-dependent DNA methyltransferase mechanism described in Chapter 1, 2-HG was used to inhibit hydroxylation of 5-mC in three species of green algae (Chlorella sp. NC64A, Chlorella sorokiniana, Chlamydomonas reinhardtii) and two species of fungi (Neurospora tetrasperma, N. crassa). To test for dioxygenase inhibition in these organisms, 2-HG and additional growth medium were added to liquid cultures of selected green algae and fungi. For controls, only additional growth medium was added to liquid cultures. After 16 to 20 hours of inhibition, DNA was extracted from samples and dot blotted using 5-hmC antibody. In 2-HG-treated cultures, a marked decrease in 5-hmC signal was seen (Fig. 2.4). This signal decrease resulted from a decrease in the amount of 5-hmC in the genome, likely due to inhibition by 2-HG of TET-like dioxygenase-mediated hydroxylation of 5-mC. It should be noted that no macroscopic phenotype was noticed in cultures after up to 24 hours of 2-HG treatment and that similar amounts of DNA were always extracted from control and treated cultures. 2.5: Mass spectrometry 2.5.1: Introduction Mass spectrometry (MS) has been used to discover the presence of and quanify 5-hmC in mammalian genomes. For MS methods, a large quantity of very clean DNA is required to reliably detect 5-hmC, which, even in relatively 5-hmC-rich eukaryotic genomes and tissues, is known to make up only a fraction of a percent of the total cytosines. Total digestion of DNA can be accomplished by enzymatic means, using a mixture of nucleases and phosphodiesterases to produce a mixture of nucleotides and/or nucleosides, or by chemical means with formic acid to produce cleaved nucleobases. While enzymatic digestion produces heavier entities, which can be easier to detect with many MS equipment set-ups, digestions leave salt and buffering agent residues that often interfere with clean and accurate detection. On the other hand, digestions with pure formic acid/water mixtures leave no undesirable reaction residues but create smaller molecular entities, which can be more difficult to detect, and under some conditions can cause random oxidation of nucleobases. 2.5.2: MS detection attempt Based on the above criteria and after consultation with MS specialists, the formic acid digestion method was chosen21,46,47. Genomic DNA samples from Neurospora crassa and N. tetrasperma, both of which had previously been determined to have significant amounts of 5-hmC based on antibody detection methods (see previous section), were digested with 88% formic acid v/v H2O and digested residues were analyzed by MS. Unfortunately, MS results were inconclusive. While the expected mass of 5-hmC nucleobase was not detected, neither were masses for C, 5-mC, T, G, or A (data not shown). These results could be the result of oxidation of nucleobases by formic acid under non-ideal digestion reaction conditions or the presence of large amounts of unknown contaminants. 2.5.3: Method alterations for future MS experiments 15 Due to time constrains, this experiment has not been able to be repeated or optimized. However, future experimental protocols might be altered to produce cleaner results. First, samples originally produced on the benchtop could be produced in a clean hood to reduce the possibility of contamination with air. Second, the formic acid digestion protocol could be altered to better remove any present atmospheric oxygen, which hastens base oxidation. Like other detection methods, before cleaning samples to produce digestible genomic DNA, samples could be modified with β-GT, creating 5-gmC or click chemistry-modified bases that would be significantly heavier and thus more easily detected. Lastly, other chemical modifications could be employed, such as silylation. Silylation, which is the derivatization of a molecule with a -SiR3 moiety, could be used to, for example, produce 5-(trimethylsiloxy)methyldeoxycytosine via the reaction of trimethylsilylchloride with the hydroxyl group of 5-hmC48. This much heavier derivative would be much more easily detected than the naked nucleobase. 2.6: Sequencing-based detection methods 2.6.1: Introduction to bisulfite sequencing Normal DNA sequencing methods are not able to distinguish 5-mC from an unmethylated C, and high-throughput methods that rely on immunoprecipitation of 5-mC-containing DNA fragments yield relatively low resolutions. However, whole-genome, single-base resolution of 5-mC is possible using a technique called bisulfite sequencing (BS-seq). In BS-seq, libraries of genomic DNA are bisulfite converted before PCR amplification and sequencing. In bisulfite conversion, denatured DNA is reacted with sodium bisulfite (NaHSO3) under low pH. Unmethylated C is sulfonated at the 6’ position, which facilitates deamination of the 4’ carbon, replacing the amino moiety with a hydroxyl moiety. A second desulfonation reaction at high pH removes the 6’ sulfite and facilitates the hydrogenation of the nitrogen at the 3’ position and conversion of the 4’ hydroxyl into a keto moiety. Overall, this reaction turns the unmethylated C into deoxyuracil (U)43,44,49. However, when 5-mC reacts with sodium bisulfite and is desulfonated, 5-methylU (thymine) is not produced. Instead, the 5’ methyl group interferes with deamination and the 6’ sulfite is removed, leaving the 5-mC unmolested43,44,49. (See Fig. 1 of Huang et al. for an overview of the reaction44.) After amplification of the DNA library with a DNA polymerase tolerant of template-strand U, this process effectively replaces unmethylated C with T, whereas 5-mC in the template strand is read by the polymerase as C and amplified as such (Fig. 2.5A). It is therefore possible to determine the position of 5-mC when bisulfite-treated DNA is sequenced and compared to a reference sequence. In this comparison, bases that are sequenced as T in the bisulfite-converted sample but are C in the reference were unmethylated, whereas bases that are sequenced as C in the bisulfite-converted sample and also read as C in the reference were methylated. 16 Fig. 2.5: Overview of bisulfite-based sequencing methods. (A) In BS-seq, unmethylated C is converted to U during bisulfite conversion but 5-mC and 5-hmC are not. Comparision of BS-seq data to a reference sequences yields which bases were modified and and which were not. (B) Potassium perruthenate oxidation of 5-hmC yields a formylated intermediate (5-fC). This base is converted to U during bisulfite converison. (C) oxBS-seq is a form of BS-seq modified with the oxidation step outlined in (B). Comparison of oxBS-seq data to a reference sequence yields which cytosines were methylated and which were either unmethylated or hydroxymethylated. 17 2.6.2: Single-base resolution of 5-hydroxymethylcytosine? Until recently, it was not possible to distinguish 5-hmC from 5-mC using bisulfite-based sequencing methods. Like 5-mC, 5-hmC is not converted to U after standard bisulfite treatment. During 5-hmC reaction with sodium bisulfite, the hydroxyl moiety on the 5’ methyl group is replaced with sulfite, creating cytosine 5-methylene sulfonate (CMS)43,44. This sulfonate group does not leave after the deamination step and, during amplification and sequencing, CMS is read as a C. While CMS can be detected by other methods, this fact means that 5-hmC cannot be distinguished from 5-mC during bisulfite sequencing44. In fact, the equivalency of 5-mC and 5hmC under bisulfite sequencing methods means that methylation datasets from organisms that do contain significant amounts of 5-hmC are contaminated with significant amounts of 5-hmC signal that has been interpreted as 5-mC. Another issue with determining the position of 5-hmC bases is a difference in how the base is read by DNA polymerases after bisulfite conversion. In qRT-PCR and primer extension assays of bisulfite-treated DNA, putative CMS adduct has been shown to stall DNA polymerases and produce significantly less full-length product from 5-hmC-containing templates compared to 5mC-containing or C-only templates44. In addition to contamination from 5-hmC signal, 5-hmCcontaining genomic regions may actually be underrepresented in high-throughput bisulfite sequencing datasets. 2.6.3: Oxidative bisulfite sequencing While enzyme kinetics-based epigenomic sequencing and the advent of semiconductor- and nanopore-based sequencing may yield exciting and accurate methods to detect all modified bases at single-base resolution, these emerging technologies are not currently accurate enough to produce reliable data for most full-length eukaryotic genomes23,50. To overcome the issues with single-base resolution of 5-hmC, a technique called oxidative bisulfite sequencing (oxBS-seq) has recently been developed51. oxBS-seq uses a oxidative step with potassium perruthenate (KRuO4) prior to bisulfite conversion, and like BS-seq, bisulfite conversion is followed by library amplification and sequencing. KRuO4 is a strong oxidizer of primary and secondary alcohols and causes conversion of these hydroxyl groups to carbonyls, producing an aldehyde or ketone. In DNA, this reaction allows for the oxidation of 5-hmC to 5-formylcytosine (5-fC) (Fig. 2.5B). Table 2.2: Comparison of bases as seen after difference sequencing methods Actual Base in Genome: C 5-mC 5-hmC appears as… …in Reference Sequence: C C C …in BS-seq Dataset: T C C …in oxBS-seq Dataset: T C T Unlike 5-mC and 5-hmC, 5-fC is converted to U during bisulfite conversion, with the 5’ carbonyl leaving the base during desulfonation (ostensibly by decarboxylation after being 18 converted to a carboxyl group). This method’s conversion of both unmethylated C and 5-hmC allows determination of 5-hmC sites through base-by-base comparison of an oxBS-seq and BSseq dataset pair (Table 2.2). In the BS-seq dataset, unmethylated C bases will be converted to T, while 5-mC and 5-hmC bases will remain C. In the oxBS-seq dataset, both unmethylated C and 5-hmC bases will be converted to T, while 5-mC bases will remain C (Fig. 2.5C). Thus, positions at which significant amounts of C (from the BS-seq dataset) have changed to T in the oxBS-seq dataset are likely consistent 5-hmC bases in the genome, whereas positions that have shown conversion in both datasets and positions that have not shown conversion in both datasets are likely unmethylated C and 5-mC bases in the genome, respectively. 2.6.4: oxBS-seq of Ostreococcus and Chlorella: Processing Genomic DNA of two species of green algae, Ostreococcus lucimarinus and Chlorella sp. NC64A, were used to create BS-seq and oxBS-seq library pairs (as explained above) for each organism. These two organisms were chosen for oxBS-seq for several reasons. First, both organisms have previously sequenced, published genomes. Second, 5-hmC was immunodetected in the genome Chlorella at moderately high levels compared to other organisms used in the assays. (O. lucimarinus, which did not have 5-hmC signal at high enough levels to be counted as definitely present, did show a weak, low level signal.) Third, both genomes are small for eukaryotic organisms, allowing very high sequencing coverage of both organisms. This high coverage gives the possibility of high statistical confidence in determining the consistent presence of 5-hmC at a particular location. Fourth, cytosine methylation in Ostreococcus occurs in a discrete and periodic manner, in which particular cytosines in a genomic library are more-orless either unmethylated or nearly fully methylated (Jason Huff, UC-Berkeley, unpublished data). This digital-type methylation of particular cytosines means that, if present, 5-hmC may also occur in a discrete manner and thus be more easily detectable. Additionally, before library creation, each sample was spiked with a 1:1:1 mixture of the same standards used in the restriction enzyme testing. After sequencing, standards were filtered from sequencing data and datasets were mapped to appropriate genomes and processed for methylation determination. To determine the significance of differences between sequencing data at a given position, a Fisher’s exact test was run on data from every cytosine position in the genome. With two datasets (BS-seq and oxBS-seq) and the potential modification status of a position from each dataset, we can tally the values for each category and construct a 2 × 2 contingency table at each cytosine position (Table 2.2, 2.3). Using Fisher’s exact test, a p-value can then be computed for each position. Table 2.3: Fisher’s exact test contingency table and formula for BS/oxBS paired datasets with “signed p-value” For each base: 𝑝= BS-seq oxBS-seq Converted TB TO Not Converted CB CO (𝑇𝐵 + 𝑇𝑂 )! (𝐶𝐵 + 𝐶𝑂 )! (𝑇𝐵 + 𝐶𝐵 )! (𝑇𝑂 + 𝐶𝑂 )! 𝑇𝐵 ! 𝑇𝑂 ! 𝐶𝐵 ! 𝐶𝑂 ! (𝑇𝐵 + 𝑇𝑂 + 𝐶𝐵 + 𝐶𝑂 )! 19 𝑇𝑂 𝑇𝐵 < (𝑇𝑂 + 𝐶𝑂 ) (𝑇𝐵 + 𝐶𝐵 ) 𝑝= 𝑇𝑂 𝑇𝐵 ⎨ 𝑝, ≥ (𝑇𝑂 + 𝐶𝑂 ) (𝑇𝐵 + 𝐶𝐵 ) ⎩ ⎧−𝑝, However, this p-value is unsigned – that is, it gives us information about each position, but it does not tell us if a given position may be favored to have a 5-hmC at the position. For example, to say that a given position may have a good chance of often being 5-hmC and not C or 5-mC, the position must have a low p-value (below our chosen significance cutoff). The position also must show a shift to an increase in converted bases (T) in the oxBS dataset compared to the BS dataset, as positions at which 5-hmC bases are often found will show as converted in the oxBS dataset due to the chemistry of oxidative bisulfite conversion. We can sign our p-value by checking the percentage of converted bases in both datasets at a given position. Positions at which the percentage of converted bases in the oxBS dataset is higher than in the BS dataset are given a positive sign; position at which the percentage of converted bases in the BS dataset is higher than in the oxBS dataset are given a negative sign. This “signed p-value” allows filtering for genomic positions at which 5-hmC may occur by searching for p-values between 0 and the chosen significance cutoff. 2.6.5: oxBS-seq of Ostreococcus and Chlorella: Results 2.6.5.1: Standards The inclusion of 1:1:1 methylated DNA standards (described above) into each library was used as a method to check for oxidative bisulfite conversion using the recently developed technique. Assuming full conversion of the standards, standards sequences filtered from bisulfite-converted libraries are expected to show that ~33% of C positions in standard reads have been CT converted, while standards filtered from oxidative bisulfite-converted libraries are expected to show that ~67% of C positions in standard reads have been CT converted, per chemistries described above. Standards filtered from these sequenced libraries did not show these expected conversion frequencies. For example, analyzing the first few bases of the standard reads in the Chlorella BSseq library, ~33% of C positions in standards were expected to be CT converted but actually ~50% were converted, while in its oxBS-seq library pair, ~67% of C positions in standards were expected to be CT converted but actually ~43% were converted. These numbers were consistent in the Ostreococcus library pairs as well. This deviation from the expected standard conversions indicates that our 5-hmC detection technique may have failed and that these data cannot be used in analysis of 5-hmC content of these genomes. However, it may be suspected that, due to the CMS derivative produced during bisulfite conversion and difficulties with amplification of CMS-containing bases (see above), the 5-hmC standard in a BS-seq library would not be well amplified and thus essentially not present in an amplified library, giving the 50% conversion rate seen. 20 Fig. 2.6: Testing of methylated DNA standards libraries. Libraries of a single methylated DNA standard were subjected to either only bisulfite conversion or both oxidation and bisulfite conversion. Converted libraries were then PCR amplified and analyzed by agarose gel electrophoresis. Rows 1 & 2, unmethylated C standard; rows 3 & 4, 5-mC standard; rows 5 & 6, 5-hmC standard. Rows 1, 3, & 5, bisulfite converted and PCR amplified; Rows 2, 4, & 6, oxidized, bisulfite converted and PCR amplified. This reasoning does not explain the deviation from expected standards conversion in the oxBSseq library. To further investigate these deviations, standard-only libraries were created and test amplified. In total, six standards mock libraries were created, one for each standard and conversion type (C only, 5-mC only, and 5-hmC only, bisulfite- and oxidative bisulfiteconverted libraries). After amplification, libraries were run on agarose gel for visualization (Fig. 2.6). This experiment shows that, indeed, 5-hmC standard is not amplified after bisulfite treatment. Also, surprisingly, 5-hmC is not amplified after oxidative bisulfite treatment. Additionally, after oxidative bisulfite treatment, while there appears to be no effect on 5-mC standard, unmethylated C standard is significantly underamplified compared to its bisulfite-only pair. While these mock libraries were not quantified post-amplification, it is clear that oxidative treatment before bisulfite treatment does alter standard ratios from the expected in sequenced libraries and may be the reason for deviations from expected standard ratios seen in Chlorella and Ostreococcus BS/oxBS library pairs. 2.6.5.2: Can we detect 5-hmC in Chlorella and Ostreococcus sequencing data? Oxidative bisulfite sequencing was recently developed and used to detect 5-hmC at single base resolution in mouse embryonic stem cells, and that study claims to have detected regions of 5hmC enrichment51. This thesis’ study, however, will not be able to make such claims due to failure of the libraries’ internal controls. Based on the standards mock library amplification 21 experiment and findings from other studies, sequences containing 5-hmC bases (whether only a few or, as in these standards, high concentrated) will be severely underrepresented in both bisulfite- and oxidative bisulfite-converted libraries. Thus, this study cannot expect to detect 5hmC using sequencing of BS/oxBS library pairs using this oxidation protocol. For example, let us look at a hypothetical base in a specific genomic position that, in a pool of DNA from many cells, is unmethylated C 20% of the time, 5-mC 20% of the time, and 5-hmC 60% of the time. If the conversion techniques work as they should and there is no amplification bias, this position should be easily detected as a candidate for often being 5-hmC. This particular base should be seen as 20% T and 80% C in BS-seq data and 80% T and 20% C in oxBS-seq data, which, given enough representations of this position in our datasets, quickly reaches a very low p-value by a Fisher’s exact test. However, if 5-hmC bases are essentially not represented in the amplified libraries as shown in the standard mock library experiment, both hypothetical libraries will show this particular base as 50% T and 50% C and 5-hmC at this position will be undetectable. Despite these issues, the BS/oxBS paired datasets can still be analyzed using the Fisher’s exact test “signed p-value” method described above. Assume that there are a significant amount of 5hmC-rich positions in each genome that can be detected without the issues described above. If this is the case, we should see a relatively large amount of cytosine positions with very small “positive” p-values, few very small “negative” p-values, and many non-significant p-values, where the very small positive p-values indicate positions at which 5-hmC is often present. Applying this analysis to a subset of Chlorella CG sites, approximately equal numbers of very small positive and negative p-values are seen. Out of 65955 CG sites with a Fisher’s exact test pvalue less than or equal to 10-5, only 32862 (~49.8%) are “positive.” Increasing the significance cut-off to 10-10, 674 out of 1466 (~46%) are “positive,” while at a 10-15 cut-off, 31 out of 58 (~53.4%) are “positive.” Because there are approximately as many “positive” as “negative” sites at all significance cut-offs, we can conclude that 5-hmC is not detectable in these paired datasets using the methods presented here. 2.7: Materials & Methods 2.7.1: Genomic DNA, organismal growth conditions, and methylated DNA standards DNA extracted from frozen or living tissue or culture by the author or DNA samples from the following organisms were used in this study: Oryza sativa seedling frozen tissue (Jessica Rodrigues, UC-Berkeley); Mus musculus brain and liver frozen tissues, Apis mellifera worker head DNA, Selaginella moellendorffii DNA, Uncinocarpus reesii DNA, Postia placenta DNA, Laccaria bicolor DNA, Coprinopsis cinerea DNA (Assaf Zemach, UC-Berkeley); Ostreococcus lucimarinus CCMP#2972 DNA (Jason Huff, UC-Berkeley); Physcomitrella patens living tissue (Tom Kleist, UC-Berkeley); Chlorella sorokiniana culture (Melissa Roth, UC-Berkeley); Chlamydomonas reinhardtii CC503 cw-92 mt+ culture (Melis Lab, UC-Berkeley); growing cultures of Coprinus cinereus (Coprinopsis cinerea) Okayama 7 FGSG#9003, Neurospora crassa 74-OR23-1VA FGSG#2489, Neurospora tetrasperma FGSG#2509, Agaricus bisporus H97 FGSG#10389 and JB137-S8 FGSC#10392 (Fungal Genetics Stock Center, Kansas City, MO); frozen cultures of Chlorella sp. NC64A ATCC#50258, Laccaria bicolor ATCC#MYA- 22 4686, Phycomyces blakesleeanus ATCC#8743B (American Type Culture Collection, Manassas, VA); Arabidopsis thaliana Columbia-0 and Landsberg erecta mature leaf tissue. Green algae cultures were grown in TAP liquid medium (UTEX Culture Collection of Algae, Austin, TX) at 20°C under a 16/8 light/dark cycle. Fungal cultures were grown in 1× or ½× Difco YM liquid medium (BD, Franklin Lakes, NJ) at 20 to 25°C. Arabidopsis thaliana was greenhouse grown in a soil mixture at 20 to 25°C under a 16/8 light/dark cycle. The 5-hmC, 5-mC & Cytosine DNA Standard Pack from Diagenode (Denville, NJ) was used for dot blot standard controls. The Methylated DNA Standard Kit from Active Motif (Carlsbad, CA) was used for restriction enzyme testing and high-throughput sequencing experiments. 2.7.2: DNA extraction Tissues were frozen with liquid nitrogen and ground using a mortar and pestle. An appropriate volume of 1× or 2× CTAB DNA extraction buffer containing 0.2% 2-mercaptoethanol v/v was added tissue and ground again. After a 1-3h incubation at 65°C, the mixture was centrifuged at room temperature (RT) and the supernatant was mixed with an equal volume of phenol:chloroform:isoamyl alcohol (25:24:1), inverted for 30s, and centrifuged at RT. The aqueous phase was then mixed with an equal volume of chloroform:isoamyl alcohol (24:1), inverted for 30s, and centrifuged at RT. The aqueous phase was mixed with 0.7 to 1 volumes of 2-propanol, inverted to mix, and centrifuged at RT. The supernatant was removed and the pellet was dried and resuspended in water or elution buffer, to which 2.5 to 3 volumes of 1.3% 3M sodium acetate v/v ethanol was added. This mixture was chilled at -20°C for at least 1h and centrifuged at 4°C, after which the supernatant was removed and the pellet was dried, resuspended in water or elution buffer, and used for experimentation. All centrifugation steps used 12000 to 18000 rcf. 2.7.3: Restriction enzyme digestion and analysis PvuRts1I recombinant restriction enzyme (Active Motif, Carlsbad, CA) was used according to manufacturer’s recommended reaction conditions with 250 µg of Active Motif Methylated DNA Standards. Mock digestion controls were performed with water in place of PvuRts1I enzyme solution. After digestion, reactions were heat inactivated for 10m at 65°C. After heat inactivation, samples were either used directly in 3% agarose w/v TAE electrophoresis gels or cleaned with QIAGEN MinElute PCR Purification Kits (QIAGEN, Valencia, CA) and submitted to QB3 (Berkeley, CA) for Agilent 2100 Bioanalyzer (Santa Clara, CA) analysis. 2.7.4: Dot blotting and antibody testing Adapted from Brown52, to produce membranes for dot blots, positively charged nylon membrane (Roche Applied Science, Indianapolis, IN) was cut to size, wetted in water for 10m, and dried. NaOH and EDTA (pH 8.2) was added to DNA samples to give final concentrations of 0.4M NaOH and 10mM EDTA. DNA was denatured by heating to 100°C for 10m and crash-cooled on ice for 5m before application to membrane. Denatured DNA was applied to membranes 2µl at a time and spots were allowed to dry before application of another dot. After application of all samples, the membrane was dried, briefly rinsed in 2× SSC buffer, and dried again. 23 For blotting, antibodies were tested with standards in several different combinations of concentration, buffer, blocking agent, and incubation time until a reasonable qualitative signalto-noise ratio was reached. For experiments, dot blot membranes were rehydrated in PBST (0.1% Tween-20 v/v PBS buffer) for 5m at RT and transferred to blocking buffer (1% BSA or 5% dried milk w/v PBST, as appropriate for antibody) on a rotator or rocker for 30m at RT or overnight at 4°C. Blocking buffer was then poured off and blocking buffer with primary antibody – Active Motif mouse α-5-hmC mAb 1:5000 PBST/BSA; Diagenode mouse α-5-hmC mAb, 1µg/5ml PBST/BSA; AbCam rat α-5-hmC mAb (Cambridge, MA) 1:1000 PBST/milk – was added and incubated at RT for 2h. After primary incubation, solution was poured off and membrane was 5 × 5m PBST washed. Membranes were then incubated in secondary antibody – either Invitrogen goat α-mouse mAb::HRP or Invitrogen goat α-rat mAb::HRP, 1:5000 or 1:10000 PBST/milk, for 1h at RT and then 5 × 5m PBST washed. Membranes were then incubated with SuperSignal West Pico Chemiluminescent Substrate (Thermo Fisher Scientific, Waltham, MA) and exposed to Kodak BioMax film (Carestream Health, Toronto, Ontario, Canada). 2.7.5: Dioxygenase inhibition Liquid cultures of green algae and fungi had one volume of appropriate growth medium and either an aqueous solution of L-2-hydroxyglutarate (Sigma-Aldrich, St. Louis, MO) to a concentration of 10mM or water (controls only) added to them. Cultures were allowed to grow for 16 to 20h before DNA extraction and dot blotting, as described above. 2.7.6: Acid hydrolysis of DNA and mass spectrometry Before chemical digestion for mass spectrometry, DNA was isolated and purified using the protocol in Section 2.7.2 with some modification to increase purity. First, DNA was subjected to two rounds each of the chloroform:isoamyl alcohol step and the isopropanol precipitation. Then, after precipitating with the sodium acetate/ethanol solution, the pellet was resuspended in water and precipitated with 100% ethanol twice. After these steps, the dried pellet was suspended in ultrapure, nuclease-free water. Adapted from formic acid digestion protocols in Gommers-Ampt et al.,Djuric et al. and Eick et al.21,46,47, 50µg of DNA was transferred to acid washed glass tubes with 9 volumes of 98% formic acid. The tube was quickly flushed with filtered nitrogen, sealed with a PTFE-lined cap, and heated to between 141 and 145°C for 60m in an aluminum heating block, using heavy mineral oil for greater thermal contact. After heating, the tube was allowed to cool to RT on the benchtop and the formic acid/water mixture was evaporated in a vacuum chamber. The dried residue was resuspended in 100% methanol and submitted to the QB3/Chemistry Mass Spectrometry Facility at UC-Berkeley for analysis using positive ion mode electrospray ionization (nanospray) mode. 2.7.7: Bisulfite and oxidative bisulfite library construction DNA used for BS-seq and oxBS-seq libraries was first isolated and purified using the protocol in Section 2.7.2, sonicated to a mean size of approximately 300bp, purified using Agencourt AMPure XP magnetic beads (Beckman Coulter, Indianapolis, IN) or homemade magnetic beads (Tzung-Fu Hsieh, UC-Berkeley) according to manufacturer’s recommended protocol, resuspended in elution buffer, and checked for appropriate size distribution on a 1% agarose w/v TAE electrophoresis gel. 24 Based on the protocol provided in Lister et al.53, cleaned, sheared DNA was end repaired with T4 DNA polymerase, Klenow DNA polymerase, and T4 PNK (New England Biolabs, Ipswitch, MA) with appropriate reagents for 30m at 20°C, magnetic bead cleaned, then modified with 3’ A base additions using Klenow exo- (New England Biolabs) and appropriate buffers for 30m at 37°C. To this reaction, annealed, fully methylated Illumina paired end adapters (5' PGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG; 5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT) (Bioneer, Alameda, CA) and DNA Quick Ligase (New England Biolabs) with appropriate buffer were added and incubated for 20m at RT. After adapter ligation, the reaction was cleaned with magnetic beads. For libraries to be oxidized, DNA was resuspended in water after purification. For oxBS-seq libraries, an oxidation protocol based on Booth et al. was used51. DNA prepared as discussed above was denatured by adding NaOH to a final concentration of 0.05M, heated for 30m at 37°C, and snap-cooled on ice for 5m. Next, a KRuO4 solution (15mM in 0.05M NaOH) was added to the DNA solution, for a final KRuO4 concentration of 0.6mM, and left on ice for 1h with occasional light finger vortexing. Oxidized DNA was purified with Roche mini Quick Spin Oligo Columns (Sephadex G-25 beads). First, buffer was washed out of the column with 200µl water and centrifuged for 1m at 1000 rcf × 4 before adding DNA/KRuO4 solution. Purified solution was collected by centrifuging for 4m at 1000 rcf. BS- and oxBS-seq libraries were bisulfite converted twice using the FFPE protocol of the QIAGEN EpiTect Kit without carrier RNA. Bisulfite converted libraries were PCR amplified using exTaq HS (Takara, Mountain View, CA), Illumina PE1 (5'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC T) and PE2 (5' CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCG ATCT) primers, and appropriate reagents with 14 to 18 amplification cycles. Amplified libraries were cleaned and concentrated with magnetic beads and analyzed by Bioanalyzer DNA HS (UCBerkeley QB3 facility) and submitted to QB3/Genomics Sequencing Laboratory for sequencing on Illumina HiSeq 2000 sequencers. 2.7.8: Processing and analysis of high-throughput sequencing data Data from sequenced libraries was aligned and analyzed using bowtie (http://bowtiebio.sourceforge.net), SAMtools (http://samtools.sourceforge.net), BAMtools (http://bamtools.sourceforge.net), and the dzlab-tools package (http://dzlab.pmb.berkeley.edu/tools) and analyzed, filtered, organized, and automated with custom perl, Python and bash shell scripts (available from the author) and Microsoft Excel (Microsoft, Redmond, WA). 25 Chapter 3: Hypothesis for function of 5-hydroxymethylcytosine in putative fungal transposons and schemata for engineering sequence-specific epigenetic modification systems 26 3.1: Introduction: Relationships among eukaryotic TET/JBP family dioxygenases and dioxygenase-containing transposons Putative TET/JBP family dioxygenases, enzymes known to catalyze the production of 5-hmC, are known to be encoded in many eukaryotic genomes18. Among these organisms that likely contain TET/JBP-like dioxygenases are the fungi Laccaria bicolor and Coprinopsis cinerea and the Chlorophyte alga Chlamydomonas reinhardtii. This study has shown, by immunodetection, that all three of these organisms likely contain significant amounts of 5-hmC in their genomes. Detailed bioinformatics analysis has shown that TET/JBP-like dioxygenases occur in high copy number in these organisms and may be associated with putative transposases and other DNA binding and enzymatic proteins in a specific genetic orientation18 (Fig. 3.1A). These facts lead to the hypothesis that such genes may together form transposons or transposon-like cassettes. Fig. 3.1: Overview of putative transposon-associated TET-like dioxygenases. (A) Basic structure of putative transposons in Laccaria and Coprinopsis. In these transposon-like cassettes, dioxygenases and DNA binding domains are found parallel to each other, while putative transposases are found antiparallel to the first two sequences. (B) A maximum likelihood tree of selected eukaryotic dioxygenases shows evolutionary relationships among homologs. 27 To further develop this hypothesis, this study completed bioinformatics analysis of these dioxygenases. A maximum likelihood phylogenetic tree was created using aligned amino acid sequences of selected TET/JBP family dioxygenases from vertebrate and invertebrate animals, Discicristate-group organisms (Trypanosoma species and Naegleria gruberi), and the aforementioned fungi and green alga. As an outgroup, an Escherichia coli protein AlkBH2, a 2KG-dependent dioxygenase known to function as a DNA repair enzyme, was used. Based on known phylogeny, the expected tree would form two clades: One clade consisting of the green alga, the fungi, and the animals, with a sub-clade of the fungi and animals, and a second clade consisting of the Discicristate organisms. Instead, clades consisting of animal sequences and Trypanosoma sequences group most closely together. These groups form a larger clade with the putative transposon-associated green algal and fungal sequences, and sister to this larger clade are the N. gruberi sequences (Fig. 3.1B). Indeed, the grouping of the fungal and green algal dioxygenases outside of the known TET and JBP sequences suggests that they are closely related, and it is plausible that they may function as or have been derived from rogue genetic elements. Additionally, the large grouping of N. gruberi dioxygenases, seemingly not closely related to those JBP family dioxygenases of the fellow Discicristate organisms in the genus Trypanosoma, suggests the same may be true for these genes. 3.2: Hypothesis for function of putative fungal transposons 3.2.1: Introduction: A hypothesis for autoderepressing transposons If these dioxygenases are indeed derived from rogue genetic elements and are associated with sequences that either are or once were active transposon-like elements, what could be their purpose? First, transposons, transposon-like sequences, and repeats are often cytosine methylated in Eukaryotic organisms that contain DNA methyltransferases8. This 5-mC acts as a repressive mark. Let us assume that these elements are highly methylated in their host organisms and repressed by the hosts’ genetic regulatory systems. For these elements to more efficiently be expressed and transpose, their repressive marks must be removed. Assuming some amount of low-level transcription of these genes, if these elements can themselves remove their repressive marks, they might be able to increase their own expression and transposition, creating a positive feedback loop. One can speculate that these transposon-like cassettes do encode for an autoderepression mechanism to achieve this effect. The essential structure of the putative fungal transposons includes a putative transposase, a DNA binding protein that could target a sequence or sequences of the cassette, and a dioxygenase (Fig. 3.1A). It is the dioxygenase that, by associating with and being targeted to transposon sequences by the DNA binding protein, would act as the direct derepressive mechanism. By catalyzing the formation of 5-hmC from 5-mC, the dioxygenase effectively removes repressive methylation marks in the transposon by one of the mechanisms discussed in Chapter 1. 3.2.2: Outline of methods for “autoderepressing transposons” hypothesis testing This study has not attempted to show that these putative transposons are, in fact, active transposons. However, there are several methods that could be used to show transposon activity as well as the role that the transposon-associated dioxygenases may play in transposon28 associated gene expression and activity. First, a transcriptomic approach, such as RNA-seq, or a PCR-based approach, such as qRT-PCR, could be used to detect and quantify expression of specific transposon-associated genes. These two types of methods would verify that these genes are actively transcribed by the host. Second, transposon display techniques can show active transposition of transposons54–56. Lastly, if dioxygenase activity does increase expression and transposition of these transposons, then dioxygenase inhibition should decrease expression of transposon-associated genes as well as their transposition. Using a protocol similar to that explained in Chapter 2, 2-HG treatment of these organisms in conjunction with RNA-seq or qRT-PCR and transposon display should cause a marked decrease in expression and transposition of these transposons when compared to an untreated control. 3.3: Proposal for sequence-specific epigenetic modification technology 3.3.1: Introduction Whether or not these putative transposons act as hypothesized in the previous section, their regular structure shown by Iyer et al. suggests the concept of sequence-specific epigenetic modification18. The development of a technology that is capable of sequence-specific epigenetic modification would be a boon for many fields of basic biological and medical research. Conceivably, this technology could be used to either silence or activate expression of genes through the cell’s epigenetic system, thereby altering cellular phenotypes that are (partially) epigenetically controlled, such as certain aspects of the cell cycle and division, development and differentiation, and metabolism. This concept requires the production of two essential, interacting components (Fig. 3.2). First, a targeter is required. This targeter can be a DNA binding protein that binds to a unique sequence or semi-specific set of sequences or a single-stranded nucleic acid that is complimentary to the target sequence. Currently, engineering of DNA binding proteins, such as zinc fingers and transcription activator-like effectors, is possible, and it is likely that this technology will make targeting of any nucleotide sequence much easier in the near future. For now, a proof-of-concept experiment using a protein targeter might use a transcription factor or other DNA binding protein that is known to target a specific set of sequences. Second, a modifier is required. This enzymatic module modifies a specific epigenetic mark in the region in which it is targeted. Possibilities for modifications are DNA methylation (with a DNA methyltransferase), DNA demethylation (with a dioxygenase or a DNA glycosylase, as appropriate for the organism being modified), or any number of histone modifications. In addition to the targeter and modifier components, other secondary components are needed. First, a linker to bind the primary components is required. This linker may either covalently bind the targeter and modifier in a fused system or facilitate interaction between targeter and modifier in a modular system, such as through the use of high-affinity, interacting tags on the components (e.g., biotin-streptavidin system). (Both of these concepts are discussed more below.) In addition to binding the two components together, the linker must allow flexibility and movement of the two components relative to one another so that the modifier can reach the three-dimensional region of chromatin surrounding the sequence to which the targeter is bound. Second, a nuclear localization signal is required to be a part of the system so that it can be delivered to the nucleus and access chromatin. 29 Fig. 3.2: Example scheme of a modular, sequence-specific epigenetic modification system. In this example, the modifier module, a dioxygenase, can link to the targeter module, a DNA binding protein, through streptavidin-biotin interaction. This system can activate gene expression by binding to a specific genomic region and hydroxylating 5-mC bases surrounding that region in a gene normally repressed via cytosine methylation. 30 3.3.2: Fused versus modular systems In the fused system, all components are produced as one large, naturally covalently-linked polypeptide from a single, engineered genetic construct. This fused system may be advantageous where organisms are genetically engineered with the construct and must express the components in a one-to-one manner. In the modular system, the primary components are produced separately and later must be linked through high-affinity interactor tags. By using a modular system delivered to cells, modifier components can be microbially produced in volume and each can be mixed in a one-to-one ratio with microbially produced targeters to create a linked, “off the shelf” targeter-modifier system. With this method, a single locus or many loci may be targeted for a single or many epigenetic modifications without genetic engineering of the cells to be modified. This modular system may be advantageous when the system can be delivered to cells via injection or other specific means. 3.3.3: Potential uses of sequence-specific epigenetic modifier technology How could this technology be used, given a working implementation with appropriate delivery methods? A few ideas upon which one can speculate include: 1. Epigenetic modification of loci important in cell cycle checkpoint, cell growth, or apoptosis to slow or stop cancer cell growth; 2. Demethylation of abnormally methylated loci in pancreatic cells of diabetic or pre-diabetic individuals57; 3. Reversion of differentiated cells into another state of potency by epigenetic means; 4. Altering normal phenotype or development by inducing a fused construct in response to an inductive signal, such as altering flowering time, fruit/seed development, or dormancy in crop plants. 3.4: Methods 3.4.1: Phylogenetic tree construction Dioxygenase amino acid sequences selected for construction of a phylogenetic tree were found in Iyer et al.18 and aligned using MUSCLE (http://www.ebi.ac.uk/Tools/msa/muscle). The maximum likelihood phylogenetic tree was constructed with aligned sequences using PhyML (http://www.phylogeny.fr) with bootstrapping58. 31 Chapter 4: Thesis Summary 32 4.1: Introduction Methylcytosine has long been known to be an important component of certain prokaryotic and eukaryotic genomes. Recently, hydroxymethylcytosine, a modification of methylcytosine catalyzed by Fe(II) and 2-oxoglutarate-dependent dioxygenases, was found to be a significant component of mammalian genomes. In mammals, 5-hmC is suspected to play important roles in DNA methylation/demethylation dynamics as a demethylation intermediate and may constitute a novel epigenetic mark. Dioxygenase protein domains that likely participate in pyrimidine oxidation, including those that are very similar to mammalian TET-like dioxygenases, have been found in several other eukaryotic lineages, including plants, fungi, and Discicristate organisms (e.g., Trypanosoma and Naegleria). Because of its presumably wide evolutionary distribution and its potential importance to understanding the complexity of eukaryotic genomes, it is important that we understand more about this DNA modification. 4.2: Detection of 5-hydroxymethylcytosine Many methods for detection of 5-hmC exist. Several of these methods that were not used in this study were outlined, such as chromatographic methods and specific chemical labeling, and results of experiments to detect the presence of 5-hmC in various Eukaryotic organisms were discussed in detail. First, the 5-hmC-specific PvuRts1I restriction enzyme method was tested. Results from these experiments show that PvuRts1I is likely not as specific as previously published results indicate and this method was subsequently abandoned. Then, immunodetection of 5-hmC using a 5-hmC-specific antibody was used in dot blot analysis of DNA from various Eukaryotes with satisfactory results. This method showed the previously unreported presence of 5-hmC in several fungi and green algae, as well as potentially at low levels in some angiosperms. Immunodetection was then used to inhibit of 5-hmC production in certain fungi and green algae through application of the 2-KG analog 2-HG, showing that at least some of the 5-hmC present in these organisms is likely due to production by 2-KG-dependent dioxygenases. Next, mass spectrometry of chemically-digested DNA was attempted to show the presence of 5-hmC in two Neurospora species, which had shown positive signals from immunodetection methods. Unfortunately, this method did not work as expected. Lastly, a high-throughput sequencingbased method called oxidative bisulfite sequencing was attempted in order to detect 5-hmC at single-base resolution in two green algae. Analysis of the libraries and their included standards show that, while 5-hmC may indeed be present, this study was not able to detect it using this method. Possible issues with the conversion and amplification of 5-hmC-rich DNA sequences that may be reasons for sequencing-based detection problems were also discussed in this section. 4.3: Transposon function hypothesis and epigenetic editor proposal Bioinformatic analysis of TET/JBP-like dioxygenases in certain organisms has shown that these enzymes are often associated with putative transposons-like elements. With the help of DNA targeting proteins, it was hypothesized these dioxygenases may help derepress their associated elements. By hydroxylation of repressive 5-mC marks likely found on these elements, these transposon-associated dioxygenases may effectively override the host’s repression mechanisms, creating an autoderepressing transposon. Several methods to test their activity and, if they are indeed actively transposing elements, discern the role of dioxygenases in their activity were proposed. Based on the structure of these putative transposons, a hypothetical sequence-specific epigenetic modification technology was proposed. Basic system components, two possible 33 system constructions, using fused and modular components, and potential applications of the hypothetical technology were also discussed. 34 References 1. Wilson, G. G. & Murray, N. E. Restriction and modification systems. Annual Review of Genetics 25, 585–627 (1991). 2. Hall, R. M. The DNA adenine methyltransferase (dam+) gene of bacteriophage T4 reverses the mutator phenotype of an Escherichia coli dam mutant. Journal of Bacteriology 172, 2812–3 (1990). 3. Suzuki, M. M. & Bird, A. DNA methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics 9, 465–76 (2008). 4. Cedar, H. & Bergman, Y. Linking DNA methylation and histone modification: patterns and paradigms. Nature Reviews Genetics 10, 295–304 (2009). 5. Cedar, H. & Bergman, Y. Programming of DNA methylation patterns. Annual Review of Biochemistry (2012). 6. Zemach, A., McDaniel, I. E., Silva, P. & Zilberman, D. Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328, 916–9 (2010). 7. Bestor, T. H. The DNA methyltransferases of mammals. Human Molecular Genetics 9, 2395–2402 (2000). 8. Law, J. A. & Jacobsen, S. E. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews Genetics 11, 204–20 (2010). 9. Lehman, I. R. & Pratt, E. A. On the Structure of the Glucosylated Nucleotides of Coliphages Hydroxymethylcytosine. Journal of Biological Chemistry 235, 3254–3259 (1960). 10. Wyatt, G. R. & Cohen, S. S. The bases of the nucleic acids of some bacterial and animal viruses: the occurrence of 5-hydroxymethylcytosine. The Biochemical Journal 55, 774–82 (1953). 11. Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324, 930–5 (2009). 12. Pastor, W. A. et al. Genome-wide mapping of 5-hydroxymethylcytosine in embryonic stem cells. Nature 473, 394–7 (2011). 13. van Luenen, H. G. A. M. et al. Glucosylated Hydroxymethyluracil, DNA Base J, Prevents Transcriptional Readthrough in Leishmania. Cell 150, 909–921 (2012). 14. Militello, K. T. et al. African trypanosomes contain 5-methylcytosine in nuclear DNA. Eukaryotic Cell 7, 2012–6 (2008). 35 15. Kriaucionis, S. & Heintz, N. The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 324, 929–30 (2009). 16. Liutkeviciute, Z., Lukinavicius, G., Masevicius, V., Daujotyte, D. & Klimasauskas, S. Cytosine-5-methyltransferases add aldehydes to DNA. Nature Chemical Biology 5, 400–2 (2009). 17. Wu, H. & Zhang, Y. Mechanisms and functions of Tet protein-mediated 5-methylcytosine oxidation. Genes & Development 25, 2436–2452 (2011). 18. Iyer, L. M., Tahiliani, M., Rao, A. & Aravind, L. Prediction of novel families of enzymes involved in oxidative and other complex modifications of bases in nucleic acids. Cell Cycle 8, 1698–710 (2009). 19. Loenarz, C. & Schofield, C. J. Physiological and biochemical aspects of hydroxylations and demethylations catalyzed by human 2-oxoglutarate oxygenases. Trends in Biochemical Sciences 36, 7–18 (2011). 20. Xu, W. et al. Oncometabolite 2-hydroxyglutarate is a competitive inhibitor of αketoglutarate-dependent dioxygenases. Cancer Cell 19, 17–30 (2011). 21. Gommers-Ampt, J. H., Leeuwen, F. Van & Vliegenthart, J. F. G. A Novel Modified Base Present in the DNA of the Parasitic Protozoan T. brucei. Cell 75, 1129–1136 (1993). 22. Song, C.-X. et al. Selective chemical labeling reveals the genome-wide distribution of 5hydroxymethylcytosine. Nature Biotechnology 29, 68–72 (2011). 23. Song, C.-X. et al. Sensitive and specific single-molecule sequencing of 5hydroxymethylcytosine. Nature Methods 9, 75–7 (2012). 24. Globisch, D. et al. Tissue distribution of 5-hydroxymethylcytosine and search for active demethylation intermediates. PloS ONE 5, e15367 (2010). 25. Williams, K., Christensen, J. & Helin, K. DNA methylation: TET proteins—guardians of CpG islands? EMBO Reports 13, 28–35 (2011). 26. Hackett, J. A. et al. Germline DNA demethylation dynamics and imprint erasure through 5-hydroxymethylcytosine. Science 339, 448–52 (2013). 27. Xiao, W. et al. Imprinting of the MEA Polycomb Gene Is Controlled by Antagonism between MET1 Methyltransferase and DME Glycosylase. Developmental Cell 5, 891–901 (2003). 28. Gehring, M. et al. DEMETER DNA glycosylase establishes MEDEA polycomb gene selfimprinting by allele-specific demethylation. Cell 124, 495–506 (2006). 36 29. Choi, Y. et al. DEMETER, a DNA Glycosylase Domain Protein, Is Required for Endosperm Gene Imprinting and Seed Viability in Arabidopsis. Cell 110, 33–42 (2002). 30. Cortellino, S. et al. Thymine DNA glycosylase is essential for active DNA demethylation by linked deamination-base excision repair. Cell 146, 67–79 (2011). 31. Nabel, C. S., Manning, S. A. & Kohli, R. M. The curious chemical biology of cytosine: deamination, methylation, and oxidation as modulators of genomic potential. ACS Chemical Biology 7, 20–30 (2012). 32. Bhutani, N., Burns, D. M. & Blau, H. M. DNA Demethylation Dynamics. Cell 146, 866– 872 (2011). 33. Ito, S. et al. Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5carboxylcytosine. Science 333, 1300–3 (2011). 34. He, Y.-F. et al. Tet-Mediated Formation of 5-Carboxylcytosine and Its Excision by TDG in Mammalian DNA. Science 333, 1303–1307 (2011). 35. Shen, L. et al. Genome-wide Analysis Reveals TET- and TDG-Dependent 5Methylcytosine Oxidation Dynamics. Cell 153, 1–15 (2013). 36. Song, C.-X. et al. Genome-wide Profiling of 5-Formylcytosine Reveals Its Roles in Epigenetic Priming. Cell 153, 1–14 (2013). 37. Spruijt, C. G. et al. Dynamic Readers for 5-(Hydroxy)Methylcytosine and Its Oxidized Derivatives. Cell 152, 1146–1159 (2013). 38. Jin, S.-G., Wu, X., Li, A. X. & Pfeifer, G. P. Genomic mapping of 5hydroxymethylcytosine in the human brain. Nucleic Acids Research 39, 5015–24 (2011). 39. Wang, L., Chia, N. C., Lu, X. & Ruden, D. M. Hypothesis: Environmental regulation of 5hydroxymethylcytosine by oxidative stress. Epigenetics 6, 853–856 (2011). 40. New England Biolabs Incorporated. Isoschizomers. (2013).at <https://www.neb.com/tools-and-resources/selection-charts/isoschizomers> 41. Song, C.-X., Yu, M., Dai, Q. & He, C. Detection of 5-hydroxymethylcytosine in a combined glycosylation restriction analysis (CGRA) using restriction enzyme Taq(α)I. Bioorganic & Medicinal Chemistry Letters 21, 5075–7 (2011). 42. Szwagierczak, A. et al. Characterization of PvuRts1I endonuclease as a tool to investigate genomic 5-hydroxymethylcytosine. Nucleic Acids Research 39, 5149–56 (2011). 43. Hayatsu, H. & Shiragami, M. Reaction of bisulfite with the 5-hydroxymethyl group in pyrimidines and in phage DNAs. Biochemistry 18, 632–637 (1979). 37 44. Huang, Y. et al. The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PloS ONE 5, e8888 (2010). 45. Münzel, M. et al. Quantification of the sixth DNA base hydroxymethylcytosine in the brain. Angewandte Chemie (International ed. in English) 49, 5375–7 (2010). 46. Djuric, Z., Luongo, D. A. & Harper, D. A. Quantitation of 5-(hydroxymethyl)uracil in DNA by gas chromatography with mass spectral detection. Chemical Research in Toxicology 4, 687–691 (1991). 47. Eick, D., Fritz, H. J. & Doerfler, W. Quantitative determination of 5-methylcytosine in DNA by reverse-phase high-performance liquid chromatography. Analytical Biochemistry 135, 165–71 (1983). 48. Rosch, L., John, P. & Reitmeier, R. Silicone Compounds, Organic. Ullmann’s Encyclopedia of Industrial Chemistry (2000). 49. Hayatsu, H., Wataya, Y., Kai, K. & Iida, S. Reaction of sodium bisulfite with uracil, cytosine, and their derivatives. Biochemistry 9, 2858–2865 (1970). 50. Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, realtime sequencing. Nature Methods 7, 461–5 (2010). 51. Booth, M. J. et al. Quantitative sequencing of 5-methylcytosine and 5hydroxymethylcytosine at single-base resolution. Science 336, 934–7 (2012). 52. Brown, T. Dot and slot blotting of DNA. Current protocols in molecular biology / edited by Frederick M. Ausubel ... [et al.] Chapter 2, Unit2.9B (2001). 53. Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523–36 (2008). 54. Van den Broeck, D. et al. Transposon Display identifies individual transposable elements in high copy number lines. The Plant Journal 13, 121–9 (1998). 55. Syed, N. H. & Flavell, A. J. Sequence-specific amplification polymorphisms (SSAPs): a multi-locus approach for analyzing transposon insertions. Nature Protocols 1, 2746–52 (2006). 56. Grzebelus, D., Jagosz, B. & Simon, P. W. The DcMaster Transposon Display maps polymorphic insertion sites in the carrot (Daucus carota L.) genome. Gene 390, 67–74 (2007). 57. Volkmar, M. et al. DNA methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients. The EMBO Journal 31, 1405–26 (2012). 38 58. Dereeper, A. et al. Phylogeny.fr: robust phylogenetic analysis for the non-specialist. Nucleic Acids Research 36, W465–9 (2008). 59. Cutler, S., Ghassemian, M., Bonetta, D., Cooney, S. & McCourt, P. A Protein Farnesyl Transferase Involved in Abscisic Acid Signal Transduction in Arabidopsis. Science 273, 1239–1241 (1996). 60. Running, M. P., Fletcher, J. C. & Meyerowitz, E. M. The WIGGUM gene is required for proper regulation of floral meristem size in Arabidopsis. Development 125, 2545–53 (1998). 39 Appendix A: “Shotgun mapping”: Mapping genetic mutations with high-throughput sequencing data 40 A.1: Introduction Classical genetics utilized phenotype analysis and the principles of genetic linkage and recombination to produce linkage, or genetic, maps of chromosomes. Since the advent of molecular genetics, researchers have added PCR-based techniques and sequenced genomes to create physical maps of chromosomes. Recent advances in high-throughput sequencing allow researchers pursuing forward genetics projects to quickly and cheaply map genetic mutations to specific genomic regions. This appendix outlines an approach to using RNA-seq data for “rough” mapping of a genetic mutation. Validating this approach with data from standard PCR-based mapping techniques and Sanger sequencing, “shotgun mapping” has proven to be a valuable first step in mapping genetic mutations. While this approach does not allow for the precision of certain PCR-based “fine” mapping techniques, it does allow researchers to quickly and accurately discover a genomic region (in this example, approx. 1 Mbase in Arabidopsis thaliana) on which they can focus molecular and computational resources to discover the precise physical location of the mutant locus. Additionally, while this example used RNA-seq data, data from any high-throughput sequencing method could be used. A.1: Background A morphological mutant phenotype was discovered in an Arabidopsis thaliana Columbia‐0 (col) ecotype h1.1 h1.2 double T-DNA mutant (At1g06760, SALK_128430C; AT2G30620, GABI_406H11; Yvonne Kim, UC-Berkeley). Features of this mutant phenotype included delayed germination, a compact stature (i.e., short internode length), abnormal numbers of floral organs (specifically petals) and rosette leaves at maturity, abnormal phyllotaxy, late flowering, and other rarer morphological defects. The mutant phenotype appeared in the F2 generation after hybridization with a non‐mutant background and was thus assumed to be a recessive mutation. Additionally, the mutation was present in plants that were not homozygous for the double mutation and, in some cases, in plants homozygous wild‐type for both genes, suggesting that a novel mutation had been discovered. In order to map the mutant locus, mutant A. thaliana col plants were crossed with wild‐type A. thaliana Landsberg erecta (ler) ecotype plants (mut col female × wt ler male). Plants of the F1 generation were allowed to self‐fertilize, and selfed F2 seed was used to produce a mapping population. F2 individuals were phenotyped as wild‐type‐like or mutant based on abnormal floral organ number phenotype, where plants with at least one flower having greater than four petals were considered mutants. These two populations were then used in two separate mapping experiments. A.2: Shotgun mapping using RNA‐seq Tissue from rosette leaves of mutant and wild‐type phenotype populations was collected. RNA was extracted from the bulked tissue for creation of two mutant phenotype (technical replicates) and one wild‐type‐like phenotype Illumina RNA‐seq libraries. While the initial purpose of creating the RNA‐seq libraries from the mapping population was to look in the mutant phenotype 41 population libraries for candidate genes that displayed aberrant expression levels compared to the wild‐type‐like population library, we found a novel use of the data that allowed for mapping with a resolution on par with classical and “rough” PCR‐based mapping. Due to linkage to the col background mutation, F2 plants selected for the mutation should be enriched for col loci and depleted in ler loci around the mutation, with bulk incidence of col loci increasing and ler loci decreasing, respectively, on the mutation‐containing chromosomes while walking toward the mutant locus. (Fig. A.1). RNA‐seq reads were mapped to col and ler genomes to compile data on gene expression. Ecotype‐specific mappings, based on polymorphisms between col and ler genomes, were used as a genotype proxy, and this data was analyzed to discover any regions in the mutant population RNA‐seq dataset where col‐specific reads were overrepresented relative to ler‐specific reads. Fig. A.1: Explanation of ratios of col to ler alleles from wild-type and mutant chromosomes for mapping. Assuming no contamination from the other selected population, the expectation is that the mutant chromosomes used for libraries should contain all col alleles around the mutant locus, while the wild type-like chromosomes used for libraries should contain ler and col alleles in a 2:1 ratio 42 (given a 2:1 ratio of heterozygous mutant to homozygous ler individuals) around the locus. These numbers reflect the selection of plants homozygous for the mutant locus in the mutant mapping population and either homozygous or heterozygous for the wild-type ler locus in the wild type-like mapping population. The analysis of ecotype‐specific mappings of RNA‐seq data shows a region of chromosome 5 in which col reads were on average approximately 16 times more prevalent than ler reads, with a minimum in the ler:col read ratio being reached around Chr5:16 Mbase (Fig. A.2E). In this same region of chromosome 5 in wild type-phenotype plants, col reads were approximately 2 times less prevalent than ler reads. On no other chromosome did the read ratio diverge so greatly from unity (Figs. A.2A-D). Thus, based on this analysis, the physical location of the mutant locus appeared to be on chromosome 5 near the 16 Mbase mark. A.3: Validation of shotgun sequencing-based mapping with PCR-based mapping techniques To validate the shotgun sequencing-based mapping approach and further resolve the region containing the putative mutation, DNA extracted from tissue collected from mutant phenotype individuals was used for PCR‐based mapping. Using INDEL markers (Table A.1), “rough” mapping of all Arabidopsis chromosomes in mutant phenotypes of the F2 mapping population indicated linkage of a section of Chromosome 5, from approximately 9.5 to 18.8 Mbases, to the mutation. Finer mapping of a population of ~250 mutant phenotype individuals, using INDEL, SSLP and CAPS markers, narrowed down the region to between approximately 15.8 and 16.5 Mbases. Another round of increasingly finer mapping further narrowed down the region to between approximately 16.07 and 16.26 Mbases. A final round of “fine” mapping was completed on five individuals from the general mapping population that were heterozygous for one of the previous markers at 16.07 and 16.26 Mbase. This round allowed the known region of the mutation‐containing region to decrease to approximately 119 kbases – approximately Chr5:16.083‐16.202Mbases (TAIR10 coordinates), roughly between At5g40240 and At5g40460 (Table A.1). Computational searches uncovered a section of sequence in ERA1/WIGGUM (At5g40280) that was missing from the mutant RNA-seq dataset (Daniel Zilberman, unpublished data)59,60. The era1 phenotype has been shown to include defects in meristem organization, which explains the observed phenotype discussed in Section A.1. Due to the similarities in phenotype and the missing sequence fragments in the sequencing dataset, we further investigated the mutant population’s era1 allele. 43 Fig. A.2: Shotgun mapping analysis. Analysis of ecotype-specific RNA-seq read mappings (AE) for each Arabidopsis thaliana chromosome. The vertical axis shows log2 (ler reads / col reads) for every gene along the chromosome. Red, wild-type phenotype libraries; blue, mutant phenotype libraries. 44 Table A.1: Summarized, selected PCR-based mapping data Marker Name 5-24 EcoRI-5.1432 EcoRI-5.1484 Hind3-5.1535 Marker Type INDEL CAPS CAPS CAPS TAIR8 Chr. 5 Position (bp) 9478916 14316457 14840703 15350649 COL 157 74 76 80 LER 4 1 0 0 HET 44 7 5 1 NA 12 1 2 2 TOTAL 217 83 83 83 % COL (w/o NAs) 76.6% 90.2% 93.8% 98.8% 5-43 INDEL 15810645 201 0 6 9 216 97.1% Hind3-5.1609 EcoRI-5.1610 CAPS CAPS 16081197 16097698 82 3 0 0 1 2 0 0 83 5 98.8% 60.0% DraI-5.1610 DraI-5.1611 SspI-5.1612 SphI-5.1616 SspI-5.1621 SspI-5.1622 CAPS CAPS CAPS CAPS CAPS CAPS 16101040 16114939 16115519 16157378 16213089 16216179 4 5 5 5 5 4 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 5 5 5 5 5 5 80.0% 100.0% 100.0% 100.0% 100.0% 80.0% BamHI-5.1624 EcoRI-5.1624 XbaI-5.1625 EcoRI-5.1626 CAPS CAPS CAPS CAPS 16239636 16243726 16249866 16262592 3 3 3 81 0 0 0 0 2 2 2 1 0 0 0 1 5 5 5 83 60.0% 60.0% 60.0% 98.8% 5-48 INDEL 16533089 205 1 7 4 217 96.2% SSLP6317 SSLP6437 5-57 SSLP SSLP INDEL 17040972 17363749 18803896 74 79 160 0 0 12 3 4 33 6 0 12 83 83 217 96.1% 95.2% 78.0% Table A.1 (cont.) Marker Name 5-24 EcoRI-5.1432 EcoRI-5.1484 Hind3-5.1535 5-43 Hind3-5.1609 EcoRI-5.1610 DraI-5.1610 DraI-5.1611 SspI-5.1612 SphI-5.1616 SspI-5.1621 SspI-5.1622 BamHI-5.1624 EcoRI-5.1624 XbaI-5.1625 EcoRI-5.1626 5-48 SSLP6317 SSLP6437 5-57 Primer, forward TGTGGCACAGGGTTTGTAAG GGAAATCTTCAAAACTTCAA GAGGAACTTGAAAAATGAGA AACAGCATAAGAATCAAACC GAAGTGTGGCTCTCCAATCC GTTCAACCTCTGCAAATACT GAGGAACATAACACCCATAG CTCCATCCTTAATGAGTCAC AACTGAATTATCACGGATGT TCTAGGGAATCGATTTATTG ACTAACCTATTTCCCCATTT ATCTCAAAGAAGGAGAGGAT TTTTGCACCCTAAAAGTATT ATTCACACTGGAAATTTGTT TTAACTCTGGCTTTTGATTT TGGGAGTGACATAGAGAGAT GTTTGCATAGGAAACAAAGT Primer, reverse AAAGCCAGCCAATGTTTCAC AGAGGTGCTCTGCTTAAATA GATCATGAGAATTTTCCAAC ACATATCTTTTGGACTTTCG AAAGCACAAGCCATTTGACC TTACCCATCTCTGATTCTGT CTTACACCTCCATCACCTAC TTCAACACTTCTCCTTCTTC CAATAAATCGATTCCCTAGA ACTTGGCTGTTATATTCAGG TGTTACACAAGCGATCTAAA ACAAATCACCATTCAAGATT AATCAGTTCATACGAAAAGG AGAGTCAAGTCAAGAACGAG CTAGTTACCTGAGTCCCTGA ATTGCATCTTTGTTTAGACC CTAATTGCTTCAAAGAAACC TGCGTTGCAAGAAATTATCG CAGACGTATCAAATGACAAATG AAGGATCTCGTCTTCAATAG GGACAAAGAGGGCGTTGATA AACACCAAAGCTGCCAGAAT GACTACTGCTCAAACTATTCGG GTACTTAGCGTCGCACAC TCAGGCTGCAGTAGTTTGGA 45 1_mut_F2 5_mut_F2 9_mut_F2 12_mut_F2 13_mut_F2 COL_WT LER_WT REF_SEQ_At5g40280_exon6-exon8 TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA TGACAGCGCACTCTTTATGCATTCGTATCNCTGTTAATGCCATACCTTCA TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA ******* ********************* ******************** 1_mut_F2 5_mut_F2 9_mut_F2 12_mut_F2 13_mut_F2 COL_WT LER_WT REF_SEQ_At5g40280_exon6-exon8 GTCAT--------------------------------------------GTCAT--------------------------------------------GTCAT--------------------------------------------GTCAT--------------------------------------------GTCAT--------------------------------------------GTCATGTTGTTTTTTTAATTCTTGCTTAATTCTACTTACTCACTGATCGT GTCATGTTGTTTTTTTAATTCTTGCTTAATTCTACTTACTCACTGATCGT GTCATGTTGTTTTTTTAATTCTTGCTTAATTCTACTTACTCACTGATCGT ***** 1_mut_F2 5_mut_F2 9_mut_F2 12_mut_F2 13_mut_F2 COL_WT LER_WT REF_SEQ_At5g40280_exon6-exon8 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TAGGATGCATGATATGGGAGAAATGGATGTTCGTGCATGCTACACTGCAA TAGGATGCATGATATGGGAGAAATGGATGTTCGTGCATGCTACACTGCAA TAGGATGCATGATATGGGAGAAATGGATGTTCGTGCATGCTACACTGCAA 1_mut_F2 5_mut_F2 9_mut_F2 12_mut_F2 13_mut_F2 COL_WT LER_WT REF_SEQ_At5g40280_exon6-exon8 -TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA -TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA -TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA -TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTNTNNNNA -TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA TTTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA TTTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA TTTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA ****************************************** * * Fig. A.3: Alignments of Sanger sequencing data. A region around exon 7 of ERA1 in mutant mapping population (“mut_F2”) and wild-type (“WT”) individuals was PCR amplified, Sanger sequenced, and aligned against a reference sequence. Red, intron 6; blue, exon 7; orange, intron 7. A.4: Discovery of a genetic mutation using shotgun sequencing data and Sanger sequencing RNA-seq data alignments of the region showed that the missing sequence corresponded to exon 7 of ERA1. It was hypothesized that the missing exon in ERA1 cDNA was due to either a misspliced pre-mRNA (arising as a consequence of a DNA point mutation) or a deletion of all or part of the exon in the mutant genome. After PCR amplification of the region around exon 7 of ERA1 of five mapping population mutant individuals and col and ler wild-type individuals, the fragments were Sanger sequenced. Analysis of the Sanger sequencing data showed that a 96 bp region of ERA1 intron 6 and exon 7 was missing in all mutant individuals but not in col or ler wild-type individuals (Fig. A.3). Due to the deletion of this large section of intron 6 and exon 7, the intron 6/exon 7 splice acceptor site is missing. To explain the lack of the remainder of exon 7 in spliced mRNA/cDNA, it is hypothesized that the exon 6/intron 6 splice donor site uses the next available splice acceptor site at the intron 7/exon 8 boundary, leading to a mature mRNA that is missing what remains of 46 exon 7 in the mutant individuals (Fig. A.3). A.5: Summary High-throughput sequencing data provides researchers with an ability to quickly and confidently map genetic mutations. Using simple analytical techniques on this data, researchers pursuing forward genetics projects can roughly map the location of a given mutation to relatively small regions of chromosomes. In the example given here, a mutation in Arabidopsis thaliana was mapped to approximately 1 Mbase using RNA-seq data from a mapping population. After rough mapping, researchers can use this information to focus on finer mapping via PCR-base techniques as well as detailed computational searches for potential mutant loci. As shown here with molecular mapping data, computational results, and Sanger sequencing, “shotgun mapping” can be a useful and accurate first step when mapping genetic mutations. A.6: Methods A.6.1: RNA-seq library construction and sequencing data analysis RNA was isolated from control and mutant phenotype Arabidopsis thaliana mature leaf tissue of greenhouse-grown plants, frozen with liquid nitrogen and ground with mortar and pestle. Total RNA was isolated and purified with a QIAGEN RNeasy Mini Plant kit according to manufacturer’s recommended protocol. To purify mRNA from total RNA, QIAGEN Oligotex Direct mRNA Mini Kit × 2 followed by Invitrogen RiboMinus Plant Kit was used, according to manufacturer’s recommended protocol. After mRNA isolation, mRNA was fragmented with Ambion RNA Fragmentation Reagents (Invitrogen) according to manufacturer’s recommended protocol. Fragmented mRNA was used to synthesize double-stranded cDNA using Invitrogen SuperScript III, according to manufacturer’s recommended protocol, and random hexamer primers for first strand synthesis, followed by RNaseH and DNA Pol I reaction, according to manufacturer’s recommended protocol, for RNA digestion and second strand synthesis. After cDNA synthesis, RNA-seq libraries were constructed using the end repair, 3’ A base addition, adapter ligation, PCR amplification, and library analysis steps outlined in Section 2.7.7. For RNA-seq libraries, unmethylated adapters were used and all reaction clean-up steps used the QIAGEN MinElute PCR Purification Kit. RNA-seq libraries were sequenced at the QB3/Genomic Sequencing Laboratory on Illumina HiSeq 2000 sequencers. Sequencing data was processed and analyzed as outline in Section 2.7.8. A.6.2: PCR-based mapping DNA was isolated from control and mutant phenotype Arabidopsis thaliana mature leaf tissue of greenhouse-grown plants using the DNA extraction protocol outlined in Section 2.7.2, with the exception of grinding tissue with extraction buffer directly in microcentrifuge tubes and skipping the phenol:chloroform:isoamyl alcohol step. GoTaq Green Master Mix and reaction-appropriate primers (Table A.1) were used for PCR amplifications according to manufacturer’s recommended protocol. For CAPS markers, 5µl of each PCR product was digested according to manufacturer’s recommended protocol with 0.5µl reaction-appropriate restriction enzyme, 2µl 47 enzyme buffer, and 12.5µl water. Products were analyzed on 1 to 4% agarose w/v TAE electrophoresis gels, as appropriate for product size and needed resolution. Mapping data was organized and analyzed in Microsoft Excel. A.6.3: Sanger sequencing DNA from homozygous mutants in the PCR mapping population was used for PCR-based amplification of fragments of A. thaliana ERA1 and Sanger sequencing of amplified fragments. After amplification, samples were cleaned using Agencourt AMPure XP magnetic beads and prepared for Sanger sequencing at the UC-Berkeley DNA Sequencing Facility by including facility-suggested concentration of sequencing primers. Fragment and TAIR reference sequences were then aligned for analysis using MUSCLE. 48
© Copyright 2026 Paperzz