5-hydroxymethylcytosine in Eukaryotic Genomes by Thomas John

5-hydroxymethylcytosine in Eukaryotic Genomes
by
Thomas John Haas
A thesis in partial satisfaction of the
requirement for the degree of
Master of Science
in
Plant Biology
in the
Graduate Division
of the
University of California, Berkeley
Committee in charge:
Professor Daniel Zilberman, Chair
Professor Robert L. Fisher
Professor Patricia Zambryski
Spring 2013
5-hydroxymethylcytosine in Eukaryotic Genomes
Copyright 2013
by
Thomas John Haas
[email protected]
Abstract
5-hydroxymethylcytosine in Eukaryotic Genomes
by
Thomas John Haas
Master of Science in Plant Biology
University of California, Berkeley
Professor Daniel Zilberman, Chair
The genomes of some eukaryotic organisms contain a modified cytosine, 5-methylcytosine (5mC), as an epigenetic mark to help control genetic elements. In some mammals, the TET
dioxygenases catalyze modification of 5-mC to 5-hydroxymethylcytosine (5-hmC). These
enzymes and the resultant modified base likely act as a part of the epigenetic system of mammals
and play important roles in demethylation and genetic regulation. Genomic analysis shows that
putative TET-like dioxygenases exist in other eukaryotic organisms, but the presence of 5-hmC
in the genomes and the roles that it and the TET-like dioxygenases may play in the epigenomic
landscapes of these organisms are unknown.
In this thesis, the presence of 5-hmC and the activity of putative TET-like dioxygenases in
selected fungi, green algae and plants are studied. First, several 5-hmC detection methods,
including immunodetection, mass spectrometry, and oxidative bisulfite sequencing, are
attempted. 5-hmC is likely present in some fungi and green algae, and 5-hmC production is
decreased in selected fungi and green algae by competitive inhibition of dioxygenases with 2hydroxyglutarate. Additionally, some fungal and green algal TET-like dioxygenases may be
components of novel putative transposons, and the potential for putative transposon-associated
dioxygenases to act in an autoderepressive manner is hypothesized. Furthermore, the structure of
these putative transposons is used as a template for the speculation and hypothetical design of
sequence-specific epigenetic modification systems.
Finally, an appendix details a new mapping technique termed “shotgun mapping.” Shotgun
mapping is a unique application and analysis of high-throughput sequencing data from mapping
populations to quickly and accurately determine the physical location of genetic mutations.
Shotgun mapping proof-of-concept is shown by mapping a novel era1 genetic mutation in
Arabidopsis thaliana. The shotgun mapping approach is validated by including PCR-based
mapping and Sanger sequencing data of the discovered era1 mutation.
1
Table of Contents
Chapter 1
Thesis Introduction
1
Chapter 2
Methods for and results of chemical, biochemical and
sequencing-based detection of 5-hydroxymethylcytosine
in various eukaryotes
8
Chapter 3
Hypothesis for function of 5-hydroxymethylcytosine
in putative fungal transposons and schemata for engineering
fused and modular epigenetic modification systems
26
Chapter 4
Thesis Summary
32
References
Appendix A
35
“Shotgun mapping”: Mapping genetic mutations with
high-throughput sequencing data
i
40
Figure List
1.1
Overview of cytosine modification reactions
3
2.1
Analysis of 5-hmC specificity of PvuRts1I
11
2.2
5-hmC antibody test blots
12
2.3
Immunodetection of 5-hmC
13
2.4
2-HG inhibition of algal and fungal dioxygenases
14
2.5
Overview of bisulfite-based sequencing methods
17
2.6
Testing of methylated DNA standards libraries
21
3.1
Overview of putative transposon-associated TET-like dioxygenases
27
3.2
Example scheme of a modular, sequence-specific epigenetic
modification system
30
A.1
Explanation of ratios of col to ler alleles from wild-type and mutant
chromosomes for mapping
42
A.2
“Shotgun mapping” analysis
44
A.3
Alignments of Sanger sequencing data
46
ii
Table List
2.1
Summary of 5-hmC immunodetection dot blot experiments
14
2.2
Comparison of bases as seen after different sequencing methods
18
2.3
Fisher’s exact test contingency table and formula for BS/oxBS paired
datasets with “signed p-value”
19
A.1
Summarized, selected PCR-based mapping data
45
iii
Acknowledgements
First, I’d like to thank Daniel and the rest of the Zilberman lab. Yvonne, thanks for your work,
help and advice. A second thanks to you for allowing me to work on the mapping project and
include it in this thesis. Toshiro, thanks for your perl-fu, whisk(e)y and ribaldry. Jessica, Jason,
Assaf and Devin, thanks for letting me take advantage of copious amounts of your materials and
time. Ping-Hung and Xiao, thanks for your moral support, which has been much needed. I’m
also grateful to my thesis committee members Pat Zambryski (who also helped me through my
first teaching experience) and Bob Fischer (and the rest of the Fischer lab).
I’d also like to thank my fellow classmates, especially Rie Uzawa, Sara Sirivanchai, Jake
Brunkard and Megan Cohen. Thank you all for discussions and support, especially in the past
year. You all helped me through a very challenging time.
Finally, I’d like to thank my parents, Dan and Lisa, and my great friends in the Bay Area, Nate,
Paul, Alison and Alyssa, who have all been wonderfully supportive over the past two and a half
years.
iv
Chapter 1:
Thesis Introduction
1
1.1: Overview of nucleic acid methylation
DNA base modifications are known to play a number of important roles in the protection and
regulation of an organism’s genetic material. In many bacteria, methylation of cytosine and
adenine is often an important component of an organism’s restriction modification system. For
example, methylation can be used as a self-recognition mechanism. By methylating its own
DNA, the organism can then target exogenous, non-methylated DNA for degradation with
methylation-sensitive restriction enzymes1. This mechanism is advantageous to the survival of
the organism because it allows for the destruction of non-endogenously methylated viral DNA
with restriction enzymes that do not cut endogenously methylated sequences.
Some viruses have appropriated the methylation strategies of host restriction methylation
systems. For example, T4 phage and other viruses encode for DNA methylases, employing them
to modify their own DNA to imitate their bacterial hosts’ endogenously methylated DNA, thus
repelling the binding or activity of host methylation-sensitive restriction enzymes2. These viral
“antirestriction” tactics help evade the self-recognition strategy of the host, executed via its
restriction modification systems.
In eukaryotes, the modified DNA base 5-methylcytosine (5-mC) plays important roles in
epigenetic regulation3. In organisms that employ it, cytosine methylation is one element in the
complex landscape of the genetic regulatory system. DNA methylation is counted among several
other common genetic and epigenetic factors used in the larger regulatory scheme, which
includes the modification of histones with small functional groups and larger protein elements
and the binding of trans-acting transcriptional factors to chromatin4. Eukaryotic organisms that
do employ cytosine methylation generally use 5-mC as a mark through which genetic elements
can be repressed, including silencing of transposons, imprinting of genes or entire chromosomes,
and regulation of cell type identity and developmental stages through repression of specific
genetic elements5.
While not universal, methylated cytosine is found at significant levels in many branches of the
tree of life, including many vertebrates and invertebrate animals, plants and green algae, and
fungi6. The biochemical mechanism for methylation of nucleic acids is conserved in all known
eukaryotic, prokaryotic, and viral systems and uses an S-adenosylmethionine (SAM)-dependent
DNA methyltransferase. During cytosine methylation, the thio-bound methyl group of SAM is
transferred to the 5’ carbon of the cytosine (C), generating the modified base 5-mC and Sadenosylhomocysteine as products (Fig. 1.1A)7.
In eukaryotes, not just any cytosine will be methylated. In many organisms, including mammals,
predominantly cytosine-guanine dinucleotide (CG) sites are methylated in a symmetric fashion,
in which the C on both the parallel and antiparallel strands are methylated3,6. Symmetric
methylation can then be reproduced after DNA replication by targeting “maintenance” DNA
methyltransferases to hemimethylated sites. These maintenance methyltransferases then
methylate daughter strand cytosines at CG sites in which parent strand cytosines are methylated.
This mechanism for symmetric methylation allows CG methylation patterns to be faithfully and
reliably propagated through cell generations.
2
Fig. 1.1: Overview of cytosine modification reactions. (A) DNA methyltransferase
catalyzes the reaction of cytosine with S-adenosylmethionine (SAM) to produce 5methylcytosine (5-mC) and S-adenosylhomocysteine (SAH). (B) 5-mC is
hydroxylated by TET-like dioxygenases to produce 5-hydroxymethylcytosine in a 2ketoglutarate (2-KG)- and Fe(II)-dependent co-oxidation reaction.
Additionally, some organisms have an ability to target methylation to sites other than CG
dinucleotides3. For example, in plants, significant amounts of methylation are found at CHG and
CHH trinucleotide sites, where “H” is anything other than guanine. This de novo methylation is
targeted, in part, by RNA-directed DNA methylation pathways (RdDM), wherein RdDM
machinery is directed to specific genomic sites by small RNAs8.
1.2: Hydroxymethylation of pyrimidines
Further modification of methylated nucleic acids is possible. In the mid-20th century, researchers
found a hydroxymethylated pyrimidine, 5-hydroxymethylcytosine (5-hmC), as a component of
T4 phage DNA9,10. Since this time, this and other analogous modifications have been found in
viruses and members of Kinetoplastida and Mammalia, including humans and mice11–15.
Hydroxylation of 5’ methyl groups of pyrimidines is known to occur via three different
reactions. First, hydroxylation occurs randomly through reactive oxygen species. This reaction is
likely to occur at a regular but low level in all organisms, creating a trace amount of 5hydroxymethylated pyrimidines in all organisms, including 5-hmC in organisms that contain
large amounts of 5-mC. Second, in vitro experiments have shown that it is possible for SAMdependent DNA methylases to directly add alkyl aldehydes, such as formaldehyde or
3
acetaldehyde, to cytosine, producing a 5-hydroxyalkylcytosine derivative16. Using formaldehyde,
this reaction produces 5-hmC. However, whether or not this reaction occurs in living systems is
not known. Lastly, enzymes containing a dioxygenase domain may hydroxylate 5’ methylated
pyrimidines11,17. This last reaction will be the focus of further discussion into the creation of 5hmC.
1.3: Dioxygenases and 5-hmC production
The dioxygenase domain is a component of the TET and TET-like proteins in humans and mice
and the JBP enzymes in some members of the Kinetoplastida, as well as many other proteins in
many varied organisms18. These dioxygenase domains in the TET-like subset have been
specifically shown to catalyze the creation of 5-hmC in humans and mice11. The basic
dioxygenase reaction is a co-oxidation of two substrates with molecular oxygen. In the TET-like
dioxygenases used for oxidation of 5-mC to 5-hmC, the domain is dependent on Fe(II) and the
co-oxidant, 2-ketoglutarate (2-KG), in which the 2’ carbon is oxidized and the 1’ carboxylic acid
leaves, generating carbon dioxide and the dicarboxylic acid succinate19. These dioxygenase
domains have also been shown to be competitively inhibited by a 2-KG analog, 2hydroxyglutarate (2-HG), which will be revisited later20.
Many other organisms have predicted TET-like proteins or dioxygenases with homology to the
dioxygenase domains of TET-like enzymes. In addition to those mentioned above, organisms
that have putative dioxygenase-containing enzymes predicted to act on 5-mC include members
of the basidiomycete fungi, chlorophyte algae, Heteroloboseans, and certain viruses18.
Additionally, these and many other organisms have predicted dioxygenase domains in other
proteins families and may be used for pyrimidine modification in RNA or play roles in DNA
base excision repair.
1.4: Glucosylation of 5-hydroxymethylated pyrimidines
The hydroxyl group opens up possibilities for further modification of hydroxymethylated
pyrimidines in DNA. Two well-known natural modifications, in which a glucose moiety is
added, have been found. These glucosylated bases are the J base (β-Dglucosylhydroxymethyluracil) in certain trypanosomes and relatives and β-Dglucosylhydroxymethylcytosine (5-gmC) in certain viruses (e.g., T4 phage)9,21. In T4 phage, 5mC and 5-hmC are likely intermediates in the pathway to 5-gmC, which is produced by a βglucosyltransferase. Like viral methylation and hydroxymethylation of cytosine, glucosylation of
already-modified cytosines is likely another antirestriction strategy employed by viruses, and in
Leishmania spp. acts as a marker to stop transcriptional read-through13.
This glucosyl modification has already been used for detection of 5-hmC via transfer of a
chemically modified or isotopically labeled glucose. Because of the specificity of this reaction, it
may be possible to further label or functionalize DNA containing 5-hmC with engineered T4 βglucosyltransferases or with other reactions using chemically modified glucosyl moieties via
“click chemistry” approaches22.
1.5: 5-hydroxymethylcytosine in the epigenetic landscape
Curiously, 5-hmC has been detected at relatively high levels in certain tissues and cell types in
mice and humans and is therefore likely to occur in other vertebrates containing TET-like
4
enzymes12,22–24. Because 5-hmC is a modification of 5-mC, an important part of the epigenetic
systems of metazoans, fungi and plants, 5-hmC likely also plays some role in the epigenetic
systems of the organisms that contain it. What this role is, exactly, is not clear, but several
putative and hypothetical functions have been recently discussed and tested in mammalian
systems. Three putative functions will be discussed here: First, as a method for passive
demethylation; second, as a signal for active demethylation; third, as an epigenetic mark.
1.5.1: Passive demethylation
Passive demethylation, also known as replication-dependent or –coupled demethylation, relies on
the dilution of symmetrically-methylated (i.e., CG) sites via inaction of maintenance
methyltransferases at those sites after DNA replication. Thus, after one cell division of a
progenitor cell, the passively demethylated loci will be hemimethylated in the daughter cells and
thereafter unmethylated in all but two hemimethylated descendant cells. 5-hmC is a possible
mark for bypass of hemimethylated CG sites by maintenance methylases, in which 5-hmC at a
CG site “hides” the methyl mark from being read and prevents the methylation of the daughter
strand25. As a mechanism for demethylation of large amounts of 5-mC, such as in imprint
erasure, passive demethylation is a more efficient demethylation mechanism than proposed
active demethylation mechanisms discussed below.
This passive mechanism has recently been shown in mouse primordial germ cells (PGCs),
whereby removal of 5-mC via a 5-hmC intermediate can occur in a passive manner. In these
cells, TET dioxygenases act to hydroxylate the majority of methylated sites in the genome.
Following cell division, hemihydroxymethylated CG sites are not maintained by
methyltransferases, leading to a dilution of 5-hmC and effectively removing 5-mC from the
population of descendant cells, thus removing imprinting marks26. Whether this mechanism for
demethylation occurs in other cell types or in instances other than imprint erasure in mouse
PGCs is not known.
1.5.2: Active demethylation
Active demethylation is the enzymatic removal and/or replacement of methylated cytosine or its
5’ methyl group. Active demethylation via removal of bases by DNA glycosylase and
replacement of residues by base excision repair machinery plays important roles in both plants
and animals27–30. 5-hmC may act as a mark for active demethylation in some organisms through
three (related) mechanisms, briefly described below, which are direct removal and replacement
of 5-hmC bases, deamination of 5-hmC bases, which triggers replacement by recognition of T:G
mismatch, and continued oxidation of 5-hmC, which is followed by decarboxylation.
While active demethylation via a 5-hmC intermediate may be plausible for small regions, it
would not be as efficient a mechanism for genome-wide demethylation as is passive methylation.
If using base excision repair systems, active demethylation would also increase chances of
double-stranded breaks during genome-wide demethylation events during developmental
transitions.
1.5.2.1: Direct removal and repair
It is plausible that 5-hmC itself could be recognized by components of an organism’s base
excision repair system. Once recognized, 5-hmC residues could be removed by a DNA
5
glycosylase and replaced by cytosine using base excision repair machinery, generating an
unmethylated site.
1.5.2.3: Deamination, removal and repair
Alternatively, it is plausible that 5-hmC could be recognized by cytosine deaminases, such as
members of the AID/APOBEC family, and their interactors31,32. The 4’ amino group of the 5hmC could then be removed by the cytosine deaminase, creating 5-hydroxymethyluridine. This
T:G mismatch could then be recognized by base excision repair machinery, which would remove
the offending thymine and replace it with a cytosine, generating an unmethylated site.
1.5.2.4: Oxidation-based mechanisms
Mammalian TET-like dioxygenases are able to further oxidize 5-hmC to 5-formylcytosine and
then to 5-carboxycytosine (5-caC), and these modified bases have been found in mouse
DNA33,34. These groups themselves can trigger deamination and DNA glycosylase-mediated
base excision repair by recognition of oxidized forms35,36.
It has also been proposed but not shown that 5-caC may be decarboxylated, returning the site to
an unmethylated cytosine without the need for deamination, excision and repair. This
decarboxylation mechanism, while simple and elegant, has been experimentally shown to be
unlikely to occur in significant amounts24.
1.5.3: As an epigenetic mark
It is plausible that 5-hmC may itself be a novel epigenetic mark and play important roles in an
organism’s epigenetic system outside of simply being an intermediate in demethylation. Recent
work using pull-down and mass spectrometry-based proteomics techniques has shown that
highly specific 5-hmC “reader” proteins likely exist, though whether or not these 5-hmC
interactors act as intermediaries between 5-hmC-marked loci and regulatory components or other
chromatin modifying systems has not been strictly shown37. Curiously, in mouse embryonic stem
cells, high concentrations of 5-hmC do tend to overlap with bivalent histone marks (wherein both
active and repressive histone marks are present), which mark transcriptionally “poised” genes, or
those genes that can readily be expressed at high levels12,38. However, the concentration of 5hmC at these sites may simply be due to demethylation intermediate, where repressive 5-mC
marks are in the process of being removed.
Interestingly, 5-hmC is a candidate for a component of an epigenetic oxygen sensor system39.
This hypothetical system uses the dioxygenase-catalyzed reaction, which requires both molecular
oxygen as well as 2-KG as a co-factor. 2-KG, a product of the tricarboxylic acid cycle, also
requires high oxygen concentration for its production. Thus, dioxygenase, acting as a
biochemical oxygen sensor, may transmit this information to the epigenome via 5-hmC
production. While no specific evidence exists for this mechanism, it does remain an intriguing
and plausible epigenetic mechanism for some organisms, and may extend to other chromatin
modifying systems that require 2-KG as a co-factor.
1.6: Introduction to thesis project
6
My work in this thesis attempts to investigate the presence of 5-hmC in a diverse selection of
eukaryotic genomes and to discover the genomic distribution of the modification in some 5hmC-containing genomes. Additionally, I hypothesize about the presence of specific genetic
constructs found in putative transposons that may use 5-hmC production to disguise themselves
from host repression mechanisms as well as use these constructs as a template to propose novel
molecular tools for epigenetic editing of specific loci. Finally, an appendix detailing the mapping
of an Arabidopsis thaliana mutant using an analysis of high-throughput genomic sequencing data
is added as an example of other work I have completed in the course of my graduate studies.
7
Chapter 2:
Methods for and results of chemical, biochemical
and sequencing-based detection of
5-hydroxymethylcytosine in various Eukaryotes
8
2.1: Introduction to 5-hmC detection methods
Several chemical and biochemical methods have been used to detect and quantify 5-hmC. The
first discovery of 5-hmC, published in 1953, used paper chromatography and analysis of UV
spectra to detect the base in phage DNA10. Since that time, analytical chemical methods (thin
layer chromatography and liquid chromatography/mass spectrometry), biochemical methods
(biochemical modification of 5-hmC, 5-hmC-sensitive restriction enzyme digestion, and
immunodetection), and sequencing-based methods have been used for 5-hmC detection. In this
section, I will first briefly outline modern methods for 5-hmC detection and quantification not
used in the experiments completed for this study. Then, I will discuss attempted detection
methods and results in detail.
2.2: Unused methods
2.2.1: Thin layer chromatography
Thin layer chromatography (TLC) is a technique used to separate the components of a solution
of chemicals by the movement of the chemical components in a solvent (mobile phase) through a
solid matrix (the adsorbent, or stationary phase). Depending on a chemical’s solubility in the
solvent and its attraction to the adsorbent, the chemicals will move through the adsorbent at
different rates, thereby separating the components of the mixture.
To detect 5-hmC and other bases using TLC, enzyme-fractionated DNA is end labeled with a
radioactive isotope (32P or 33P) using polynucleotide kinase and then digested to nucleotides
using phosphodiesterase and nuclease. This mixture is run on a TLC plate, often twodimensionally to allow for greater separation of bases, and then exposed to electron-sensitive
film or camera to create an autoradiograph. This autoradiograph can be compared to standard
mixtures to detect the presence of 5-hmC and quantify the amount present in the mixture.
2.2.2: High-performance liquid chromatography
Like TLC, high-performance liquid chromatography (HPLC) is used to separate components of a
solution of chemicals by passing the mobile phase through a column-based stationary phase.
Also like TLC, chemicals are separated based on their interaction with the stationary phase. 5hmC and other bases can be detected and quantified using HPLC by DNA digestion to
nucleotides. The mixture of nucleotides is then run on HPLC equipment and individual bases are
detected by using various spectrographic-based detectors and comparing this data to known
standards.
2.2.3: Glucosylation and chemical labeling
As discussed in Section 1.4, the hydroxyl group on 5-hmC opens up many possibilities for
chemical labeling. β-glucosyltransferase (β-GT) from T4 phage has been used to biochemically
label 5-hmC with glucose in vitro. In this reaction, β-GT transfers the glucose moiety from a
UDP-glucose to 5-hmC, producing 5-gmC.
This specific modification is an important reaction and useful for 5-hmC detection for several
reasons. First, this reaction is specific to 5-hmC, so only 5-hmC bases are glucosylated. Second,
the glucose moiety is large and chemically distinct. These properties allow glucosylation to be
9
used in tandem with other methods, such as various chromatographic and spectroscopic methods,
allowing 5-gmC resides to be more easily observed. Additionally, 5-gmC can be used as a
blocking group for restriction enzyme-based detection, as it is naturally used in some phages, and
as an epitope that can be specifically detected with antibodies. Lastly, a modified glucose can be
transferred. This glucose can be radioisotopically labeled for specific detection and
quantification of 5-hmC, or it can be chemically modified so that other chemistries may be used
to detect, quantify and/or pull down 5-hmC, such as has been used in azide-labeled
glucose/biotin click chemistry22.
2.3: 5-hmC-specific restriction enzymes
2.3.1: Introduction to 5-hmC detection with restriction enzymes
Restriction enzymes sensitive to the presence of 5-hmC or a modified 5-hmC are a convenient
method for potential genome-wide quantification and partial single base resolution of 5-hmC.
Two basic approaches have been proposed or used for this method of 5-hmC detection, briefly
outlined here.
In the first approach, two restriction enzymes are used. These restriction enzymes, called
isoschizomers, recognize the same base sequence but, whereas one restriction enzyme will be
“methylation sensitive” (i.e., it will usually not cut sites containing 5-mC or 5-hmC), the other
will be “methylation insensitive” (i.e., it will usually cut site containing 5-mC or 5-hmC)40.
Using these isoschizomers with β-GT-modified DNA, in which the 5-hmC residues contain a
large “blocking group” that prevents the cutting of sites normally recognized by the methylation
insensitive restriction enzyme, cytosines at restriction enzyme-recognized sites can be
determined to be C, 5-mC, or 5-hmC41.
For example, the pair MspI and HpaII are isoschizomers that recognize the sequence C/CGG;
however, while HpaII is sensitive to 5-mC, MspI will cut sequences in which the internal
cytosine is 5-mC or 5-hmC40. To differentiate between methylated and hydroxymethylated sites,
a blocking group is added to 5-hmC via β-GT, creating 5-gmC. Thus, the 5’ modification status
of the internal C can be determined: CCGG sites are cut by both HpaII and MspI, C(5-mC)GG
sites are cut by only MspI, and C(5-hmC/5-gmC)GG sites are not cut by either, due to the
blocking action of the glucose moiety added to the hydroxyl group. By size separation and bulk
DNA quantification of both digestions, the number of hydroxymethylated CCGG sites can be
determined. By single or paired end shotgun sequencing, highly hydroxymethylated CCGG sites
can be determined by dataset comparison, as CCGG sites that are shown as “cut” in both datasets
are likely unmethylated, CCGG sites that are often cut in MspI but not HpaII datasets may be
methylated, and CCGG sites that are often not cut in either may be hydroxymethylated.
A simpler second approach requires the use of a restriction enzyme with particular activity at
recognition sites that contain 5-hmC. Due to the groups’ similar size, nearly all known
methylation insensitive restriction enzymes are also hydroxymethylation insensitive42. However,
a restriction enzyme, PvuRts1I, has been shown to be hydroxymethylation insensitive but
methylation and unmethylated C sensitive in certain positions of its recognition sequence, (5hmC)N11-12/N9-10G42. In other words, PvuRts1I will only cut the recognition sequence when the
first cytosine is hydroxymethylated but not when it is methylated or unmethylated. Assuming
this action of PvuRts1I, genomic DNA could be digested and analyzed for bulk quantification of
10
5-hmC based on fragment size distribution compared to undigested sample, or be used for
shotgun sequencing library construction to map recognition sites that may contain 5-hmC,
without the use of isoschizomers and multiple dataset analysis.
2.3.2: Testing 5-hmC-sensitive restriction enzyme detection techniques
To test the feasibility of using this restriction enzyme method for 5-hmC analysis, PvuRts1I was
first tested against a set of standards. These 338 bp double-stranded DNA standards contained
either all unmethylated C, all 5-mC, or all 5-hmC bases. Before experimentation, the standard
sequence was calculated to contain 25 recognition sequences, such that 5-hmC standards should
be digested at many of these sites to produce sub-75 bp fragments. The experiment was repeated
twice. In both cases, agarose gel electrophoresis showed that, while 5-hmC standard was indeed
digested to small fragments, there was significant digestion of 5-mC standard that was not
reported in the original paper (Fig 2.1A)42. To confirm this result, the same samples were run on
a high-sensitivity Bioanalyzer DNA chip. This analysis confirmed the significant nuclease
activity of PvuRts1I on the 5-mC standards (Fig. 2.1B). Due to this finding, it was decided that
PvuRts1I is not specific enough to discern between methylated and hydroxymethylated
sequences and thus is not suitable for bulk quantification or sequencing-based analysis of 5-hmC.
Fig. 2.1: Analysis of 5-hmC
specificity of PvuRts1I.
PvuRts1I-digested methylated
DNA standards were analyzed by
(A) agarose gel electrophoresis
and (B) Agilent 2100
Bioanalyzer. Red brackets
indicate areas of fragments that
are not visible in image. Size
difference in (B) is due to
molecular weight and charge
difference of highly-modifed
standards. C, unmethylated
cytosine standard; M, 5-mC
standard; H, 5-hmC standard;
MOCK, mock digestion; DIG,
digestion with PvuRts1I.
11
2.4: Immunodetection
2.4.1: Introduction
Immunodetection techniques are relatively fast and convenient methods to reveal the presence of
specific antigens in a sample. Common immunodetection methods, such as those used here,
employ one antibody, called the primary antibody, to target an antigen of interest. Often another
antibody, called the secondary antibody, is used to target the primary antibody. The secondary
(or sometimes primary) antibody is conjugated to a signaling molecule or enzyme. This signaling
molecule acts as a generator of a fluorescent, chemiluminescent, chromogenic, or light-/electronopaque signal and signal amplifier, allowing for the visualization of the location and density of
the antigen of interest.
Antibodies are often raised against fragments of large biopolymers and bind to an epitope of
those polymers; however, antibodies can also be raised against much smaller molecules. Three
commercially-available antibodies that detect 5-hmC have been produced. Additionally, it is
possible to detect 5-hmC by 5-hmC modification and use of antibodies against those specific
modifications. These two published modified immunodetection techniques include the GLIB
“click chemistry” technique and the cytosine 5-methylenesulfonate (CMS) technique. The GLIB
technique, mentioned previously, transfers an azide-modified glucose moiety to 5-hmC via βGT, which is then further modified with a biotin-containing group, creating biotin-N3-5-gmC22.
The biotin is then detected using a biotin antibody or streptavidin. The CMS technique makes
use of CMS, produced from the reaction of sodium bisulfite with 5-hmC (explained in more
detail below)43,44. A CMS antibody is then used for detection. However, for all work described
here, only immunodetection of underivatized 5-hmC was performed.
Fig. 2.2: 5-hmC antibody test blots. Dot blots of methylated DNA standards (10 ng each) were
used to test efficacy of several 5-hmC antibodies. Each column represents a different test. The
schematics in the last row shows membrane layout. MB, mouse brain DNA (500 ng).
12
2.4.2: 5-hmC antibody testing
To test 5-hmC antibodies effectiveness against one another, dot blots with 10 ng of doublestranded DNA standards containing either all C, all 5-mC, or all 5-hmC bases were conducted.
While none of the antibodies produced a signal from C and 5-mC dots, the Active Motif 5-hmC
antibody produced the strongest signal from the 5-hmC dot (Fig. 2.2). For all work described
below, only the Active Motif 5-hmC antibody was used.
Fig. 2.3: Immunodetection of 5-hmC. Examples of immunodetection of 5-hmC using genomic
DNA of various Eukaryotes. Each column (A-C) shows a separate dot blot experiment with
methylated DNA standard controls. Organisms or standards used are listed next to each dot. See
Table 2.1 for a summary of immunodetection experiments. (Images of blots were edited by cut
and paste for placement in figure. Neutral gray represents whitespace after cut and paste.)
2.4.3: Immunodetection results
To detect 5-hmC in genomes, dot blotting of DNA from various organisms with 5-hmC antibody
was completed. Each blot contained three DNA standard dots, with C and 5-mC standard serving
as negative controls and 5-hmC as positive control. Additionally, many blots also contained Mus
musculus (mouse) brain, mouse liver, and Apis mellifera (bee) worker-caste head DNA. Mouse
brain and liver have previously been reported to contain significant amounts of 5-hmC, while bee
worker head contains only traces of 5-mC, and thus should contain no 5-hmC. These three
samples were used as further controls.
Samples with unknown amounts of 5-hmC were chosen based on the existence of putative
TET/JPB family dioxygenases18. These samples were composed of DNA from several species of
Ascomycete and Basidiomycete fungi and the Chlorophyte green algae. Other species not known
to contain putative TET/JPB family dioxygenases were also tested, such as the Zygomycete
fungus Phycomyces blakesleeanus and several plant species.
Based on these dot blots, 5-hmC is present in the genomes of all tested fungal species and three
of four green algal species (Fig. 2.3B,C; Table 2.1). Though these dot blots are not quantitative,
13
in some cases the amount of 5-hmC present in these genomes appears to rival the magnitude of
the amount of 5-hmC present in mouse brain, which has been shown to be 0.3 to 0.6% of the
genome, depending on cell type24,45. Additionally, while some plant genomes (Selaginella
moellendorffii, Physcomitrella patens) did not show 5-hmC-positive signal on dot blots, others
did sometimes register weak signals (Arabidopsis thaliana, Oryza sativa) (Fig. 2.3; Table 2.1).
Positive tests /
Times tested
Lowest positive
test amount (ng)
5-hmC
present?
all (>5)
50
Yes
all (>5)
50
Yes
all (>5)
50
Yes
1/1
1000
Inconclusive
all (>5)
200
Yes
Weak signal at 200ng
all (>5)
200
Yes
Weak signal at 200ng
2/2
200
Yes
Weak signal at 200ng
2/2
1/1
1/1
2/2
1/1
1/1
50
500
1000
40
500
1000
Yes
Yes
Yes
Yes
Yes
Inconclusive
Arabidopsis thaliana
2/4
1000
Inconclusive
Physcomitrella patens
Selaginella
moellendorffii
0/1
N/A
No
0/1
N/A
No
all (>5)
500
Yes
all (>5)
500
Yes
Used as genomic positive control;
consistently stronger signals than liver
Used as genomic positive control
none (>10)
N/A
No
Used as genomic negative control
Organism
Chlamydomonas
reinhardtii
Chlorella sorokiniana
Chlorella sp. NC64A
Ostreococcus
lucimarinus
Neurospora
tetrasperma
Neurospora crassa
Phycomyces
blakesleeanus
Agaricus bisporus
Uncinocarpus reesii
Postia placenta
Laccaria bicolor
Coprinopsis cinerea
Oryza sativa
Mus musculus, brain
Mus musculus, liver
Apis mellifera, worker
head
Notes
Usually stronger than both Chlorella;
weak signal at 50ng
Weak signal at 50ng
Usually stronger than C. sorokiniana;
weak signal at 50ng
DATA NOT SHOWN. Weak signal; single
test: limited amount of DNA.
Weak signal at 50ng
Single test: limited amount of DNA
Single test: limited amount of DNA
Strong signal at 40ng
Single test: limited amount of DNA
Very weak signal
DATA NOT SHOWN. Weak signals;
different tissue samples for each test
DATA NOT SHOWN
Table 2.1: Summary of 5-hmC immunodetection dot blot experiments.
Fig. 2.4: 2-HG inhibition of algal and fungal dioxygenases. DNA from cultures grown with L2-hydroxyglutarate (2-HG) was tested for inhibition of TET-like dioxygenases by dot blot-based
immunodetection of 5-hmC. Control row, no 2-HG added before harvest and DNA extraction;
+2-HG row, 2-HG added to culture ~18h before harvest and DNA extraction.
14
2.4.4: Dioxygenase inhibition results
A 2-KG analog, 2-HG, can be used to inhibit 2-KG-dependent dioxygenases. To show that the 5hmC in these organisms is produced by a 2-KG-dependent dioxygenase and not by random
chemical oxidation or by a modified SAM-dependent DNA methyltransferase mechanism
described in Chapter 1, 2-HG was used to inhibit hydroxylation of 5-mC in three species of
green algae (Chlorella sp. NC64A, Chlorella sorokiniana, Chlamydomonas reinhardtii) and two
species of fungi (Neurospora tetrasperma, N. crassa). To test for dioxygenase inhibition in these
organisms, 2-HG and additional growth medium were added to liquid cultures of selected green
algae and fungi. For controls, only additional growth medium was added to liquid cultures. After
16 to 20 hours of inhibition, DNA was extracted from samples and dot blotted using 5-hmC
antibody. In 2-HG-treated cultures, a marked decrease in 5-hmC signal was seen (Fig. 2.4). This
signal decrease resulted from a decrease in the amount of 5-hmC in the genome, likely due to
inhibition by 2-HG of TET-like dioxygenase-mediated hydroxylation of 5-mC. It should be
noted that no macroscopic phenotype was noticed in cultures after up to 24 hours of 2-HG
treatment and that similar amounts of DNA were always extracted from control and treated
cultures.
2.5: Mass spectrometry
2.5.1: Introduction
Mass spectrometry (MS) has been used to discover the presence of and quanify 5-hmC in
mammalian genomes. For MS methods, a large quantity of very clean DNA is required to
reliably detect 5-hmC, which, even in relatively 5-hmC-rich eukaryotic genomes and tissues, is
known to make up only a fraction of a percent of the total cytosines. Total digestion of DNA can
be accomplished by enzymatic means, using a mixture of nucleases and phosphodiesterases to
produce a mixture of nucleotides and/or nucleosides, or by chemical means with formic acid to
produce cleaved nucleobases. While enzymatic digestion produces heavier entities, which can be
easier to detect with many MS equipment set-ups, digestions leave salt and buffering agent
residues that often interfere with clean and accurate detection. On the other hand, digestions with
pure formic acid/water mixtures leave no undesirable reaction residues but create smaller
molecular entities, which can be more difficult to detect, and under some conditions can cause
random oxidation of nucleobases.
2.5.2: MS detection attempt
Based on the above criteria and after consultation with MS specialists, the formic acid digestion
method was chosen21,46,47. Genomic DNA samples from Neurospora crassa and N. tetrasperma,
both of which had previously been determined to have significant amounts of 5-hmC based on
antibody detection methods (see previous section), were digested with 88% formic acid v/v H2O
and digested residues were analyzed by MS. Unfortunately, MS results were inconclusive. While
the expected mass of 5-hmC nucleobase was not detected, neither were masses for C, 5-mC, T,
G, or A (data not shown). These results could be the result of oxidation of nucleobases by formic
acid under non-ideal digestion reaction conditions or the presence of large amounts of unknown
contaminants.
2.5.3: Method alterations for future MS experiments
15
Due to time constrains, this experiment has not been able to be repeated or optimized. However,
future experimental protocols might be altered to produce cleaner results. First, samples
originally produced on the benchtop could be produced in a clean hood to reduce the possibility
of contamination with air. Second, the formic acid digestion protocol could be altered to better
remove any present atmospheric oxygen, which hastens base oxidation. Like other detection
methods, before cleaning samples to produce digestible genomic DNA, samples could be
modified with β-GT, creating 5-gmC or click chemistry-modified bases that would be
significantly heavier and thus more easily detected. Lastly, other chemical modifications could
be employed, such as silylation. Silylation, which is the derivatization of a molecule with a -SiR3
moiety, could be used to, for example, produce 5-(trimethylsiloxy)methyldeoxycytosine via the
reaction of trimethylsilylchloride with the hydroxyl group of 5-hmC48. This much heavier
derivative would be much more easily detected than the naked nucleobase.
2.6: Sequencing-based detection methods
2.6.1: Introduction to bisulfite sequencing
Normal DNA sequencing methods are not able to distinguish 5-mC from an unmethylated C, and
high-throughput methods that rely on immunoprecipitation of 5-mC-containing DNA fragments
yield relatively low resolutions. However, whole-genome, single-base resolution of 5-mC is
possible using a technique called bisulfite sequencing (BS-seq). In BS-seq, libraries of genomic
DNA are bisulfite converted before PCR amplification and sequencing. In bisulfite conversion,
denatured DNA is reacted with sodium bisulfite (NaHSO3) under low pH. Unmethylated C is
sulfonated at the 6’ position, which facilitates deamination of the 4’ carbon, replacing the amino
moiety with a hydroxyl moiety. A second desulfonation reaction at high pH removes the 6’
sulfite and facilitates the hydrogenation of the nitrogen at the 3’ position and conversion of the 4’
hydroxyl into a keto moiety. Overall, this reaction turns the unmethylated C into deoxyuracil
(U)43,44,49. However, when 5-mC reacts with sodium bisulfite and is desulfonated, 5-methylU
(thymine) is not produced. Instead, the 5’ methyl group interferes with deamination and the 6’
sulfite is removed, leaving the 5-mC unmolested43,44,49. (See Fig. 1 of Huang et al. for an
overview of the reaction44.)
After amplification of the DNA library with a DNA polymerase tolerant of template-strand U,
this process effectively replaces unmethylated C with T, whereas 5-mC in the template strand is
read by the polymerase as C and amplified as such (Fig. 2.5A). It is therefore possible to
determine the position of 5-mC when bisulfite-treated DNA is sequenced and compared to a
reference sequence. In this comparison, bases that are sequenced as T in the bisulfite-converted
sample but are C in the reference were unmethylated, whereas bases that are sequenced as C in
the bisulfite-converted sample and also read as C in the reference were methylated.
16
Fig. 2.5: Overview of bisulfite-based sequencing methods. (A) In BS-seq, unmethylated C is
converted to U during bisulfite conversion but 5-mC and 5-hmC are not. Comparision of BS-seq
data to a reference sequences yields which bases were modified and and which were not. (B)
Potassium perruthenate oxidation of 5-hmC yields a formylated intermediate (5-fC). This base is
converted to U during bisulfite converison. (C) oxBS-seq is a form of BS-seq modified with the
oxidation step outlined in (B). Comparison of oxBS-seq data to a reference sequence yields
which cytosines were methylated and which were either unmethylated or hydroxymethylated.
17
2.6.2: Single-base resolution of 5-hydroxymethylcytosine?
Until recently, it was not possible to distinguish 5-hmC from 5-mC using bisulfite-based
sequencing methods. Like 5-mC, 5-hmC is not converted to U after standard bisulfite treatment.
During 5-hmC reaction with sodium bisulfite, the hydroxyl moiety on the 5’ methyl group is
replaced with sulfite, creating cytosine 5-methylene sulfonate (CMS)43,44. This sulfonate group
does not leave after the deamination step and, during amplification and sequencing, CMS is read
as a C. While CMS can be detected by other methods, this fact means that 5-hmC cannot be
distinguished from 5-mC during bisulfite sequencing44. In fact, the equivalency of 5-mC and 5hmC under bisulfite sequencing methods means that methylation datasets from organisms that do
contain significant amounts of 5-hmC are contaminated with significant amounts of 5-hmC
signal that has been interpreted as 5-mC.
Another issue with determining the position of 5-hmC bases is a difference in how the base is
read by DNA polymerases after bisulfite conversion. In qRT-PCR and primer extension assays
of bisulfite-treated DNA, putative CMS adduct has been shown to stall DNA polymerases and
produce significantly less full-length product from 5-hmC-containing templates compared to 5mC-containing or C-only templates44. In addition to contamination from 5-hmC signal, 5-hmCcontaining genomic regions may actually be underrepresented in high-throughput bisulfite
sequencing datasets.
2.6.3: Oxidative bisulfite sequencing
While enzyme kinetics-based epigenomic sequencing and the advent of semiconductor- and
nanopore-based sequencing may yield exciting and accurate methods to detect all modified bases
at single-base resolution, these emerging technologies are not currently accurate enough to
produce reliable data for most full-length eukaryotic genomes23,50. To overcome the issues with
single-base resolution of 5-hmC, a technique called oxidative bisulfite sequencing (oxBS-seq)
has recently been developed51. oxBS-seq uses a oxidative step with potassium perruthenate
(KRuO4) prior to bisulfite conversion, and like BS-seq, bisulfite conversion is followed by
library amplification and sequencing. KRuO4 is a strong oxidizer of primary and secondary
alcohols and causes conversion of these hydroxyl groups to carbonyls, producing an aldehyde or
ketone. In DNA, this reaction allows for the oxidation of 5-hmC to 5-formylcytosine (5-fC) (Fig.
2.5B).
Table 2.2: Comparison of bases as seen after difference sequencing methods
Actual Base in Genome:
C
5-mC
5-hmC
appears as…
…in Reference Sequence:
C
C
C
…in BS-seq Dataset:
T
C
C
…in oxBS-seq Dataset:
T
C
T
Unlike 5-mC and 5-hmC, 5-fC is converted to U during bisulfite conversion, with the 5’
carbonyl leaving the base during desulfonation (ostensibly by decarboxylation after being
18
converted to a carboxyl group). This method’s conversion of both unmethylated C and 5-hmC
allows determination of 5-hmC sites through base-by-base comparison of an oxBS-seq and BSseq dataset pair (Table 2.2). In the BS-seq dataset, unmethylated C bases will be converted to T,
while 5-mC and 5-hmC bases will remain C. In the oxBS-seq dataset, both unmethylated C and
5-hmC bases will be converted to T, while 5-mC bases will remain C (Fig. 2.5C). Thus, positions
at which significant amounts of C (from the BS-seq dataset) have changed to T in the oxBS-seq
dataset are likely consistent 5-hmC bases in the genome, whereas positions that have shown
conversion in both datasets and positions that have not shown conversion in both datasets are
likely unmethylated C and 5-mC bases in the genome, respectively.
2.6.4: oxBS-seq of Ostreococcus and Chlorella: Processing
Genomic DNA of two species of green algae, Ostreococcus lucimarinus and Chlorella sp.
NC64A, were used to create BS-seq and oxBS-seq library pairs (as explained above) for each
organism. These two organisms were chosen for oxBS-seq for several reasons. First, both
organisms have previously sequenced, published genomes. Second, 5-hmC was immunodetected
in the genome Chlorella at moderately high levels compared to other organisms used in the
assays. (O. lucimarinus, which did not have 5-hmC signal at high enough levels to be counted as
definitely present, did show a weak, low level signal.) Third, both genomes are small for
eukaryotic organisms, allowing very high sequencing coverage of both organisms. This high
coverage gives the possibility of high statistical confidence in determining the consistent
presence of 5-hmC at a particular location. Fourth, cytosine methylation in Ostreococcus occurs
in a discrete and periodic manner, in which particular cytosines in a genomic library are more-orless either unmethylated or nearly fully methylated (Jason Huff, UC-Berkeley, unpublished
data). This digital-type methylation of particular cytosines means that, if present, 5-hmC may
also occur in a discrete manner and thus be more easily detectable.
Additionally, before library creation, each sample was spiked with a 1:1:1 mixture of the same
standards used in the restriction enzyme testing. After sequencing, standards were filtered from
sequencing data and datasets were mapped to appropriate genomes and processed for
methylation determination. To determine the significance of differences between sequencing
data at a given position, a Fisher’s exact test was run on data from every cytosine position in the
genome. With two datasets (BS-seq and oxBS-seq) and the potential modification status of a
position from each dataset, we can tally the values for each category and construct a 2 × 2
contingency table at each cytosine position (Table 2.2, 2.3). Using Fisher’s exact test, a p-value
can then be computed for each position.
Table 2.3: Fisher’s exact test contingency table and formula for BS/oxBS paired datasets
with “signed p-value”
For each base:
𝑝=
BS-seq
oxBS-seq
Converted
TB
TO
Not Converted
CB
CO
(𝑇𝐵 + 𝑇𝑂 )! (𝐶𝐵 + 𝐶𝑂 )! (𝑇𝐵 + 𝐶𝐵 )! (𝑇𝑂 + 𝐶𝑂 )!
𝑇𝐵 ! 𝑇𝑂 ! 𝐶𝐵 ! 𝐶𝑂 ! (𝑇𝐵 + 𝑇𝑂 + 𝐶𝐵 + 𝐶𝑂 )!
19
𝑇𝑂
𝑇𝐵
<
(𝑇𝑂 + 𝐶𝑂 ) (𝑇𝐵 + 𝐶𝐵 )
𝑝=
𝑇𝑂
𝑇𝐵
⎨ 𝑝,
≥
(𝑇𝑂 + 𝐶𝑂 ) (𝑇𝐵 + 𝐶𝐵 )
⎩
⎧−𝑝,
However, this p-value is unsigned – that is, it gives us information about each position, but it
does not tell us if a given position may be favored to have a 5-hmC at the position. For example,
to say that a given position may have a good chance of often being 5-hmC and not C or 5-mC,
the position must have a low p-value (below our chosen significance cutoff). The position also
must show a shift to an increase in converted bases (T) in the oxBS dataset compared to the BS
dataset, as positions at which 5-hmC bases are often found will show as converted in the oxBS
dataset due to the chemistry of oxidative bisulfite conversion. We can sign our p-value by
checking the percentage of converted bases in both datasets at a given position. Positions at
which the percentage of converted bases in the oxBS dataset is higher than in the BS dataset are
given a positive sign; position at which the percentage of converted bases in the BS dataset is
higher than in the oxBS dataset are given a negative sign. This “signed p-value” allows filtering
for genomic positions at which 5-hmC may occur by searching for p-values between 0 and the
chosen significance cutoff.
2.6.5: oxBS-seq of Ostreococcus and Chlorella: Results
2.6.5.1: Standards
The inclusion of 1:1:1 methylated DNA standards (described above) into each library was used
as a method to check for oxidative bisulfite conversion using the recently developed technique.
Assuming full conversion of the standards, standards sequences filtered from bisulfite-converted
libraries are expected to show that ~33% of C positions in standard reads have been CT
converted, while standards filtered from oxidative bisulfite-converted libraries are expected to
show that ~67% of C positions in standard reads have been CT converted, per chemistries
described above.
Standards filtered from these sequenced libraries did not show these expected conversion
frequencies. For example, analyzing the first few bases of the standard reads in the Chlorella BSseq library, ~33% of C positions in standards were expected to be CT converted but actually
~50% were converted, while in its oxBS-seq library pair, ~67% of C positions in standards were
expected to be CT converted but actually ~43% were converted. These numbers were
consistent in the Ostreococcus library pairs as well.
This deviation from the expected standard conversions indicates that our 5-hmC detection
technique may have failed and that these data cannot be used in analysis of 5-hmC content of
these genomes. However, it may be suspected that, due to the CMS derivative produced during
bisulfite conversion and difficulties with amplification of CMS-containing bases (see above), the
5-hmC standard in a BS-seq library would not be well amplified and thus essentially not present
in an amplified library, giving the 50% conversion rate seen.
20
Fig. 2.6: Testing of methylated DNA standards libraries. Libraries of a single methylated
DNA standard were subjected to either only bisulfite conversion or both oxidation and bisulfite
conversion. Converted libraries were then PCR amplified and analyzed by agarose gel
electrophoresis. Rows 1 & 2, unmethylated C standard; rows 3 & 4, 5-mC standard; rows 5 & 6,
5-hmC standard. Rows 1, 3, & 5, bisulfite converted and PCR amplified; Rows 2, 4, & 6,
oxidized, bisulfite converted and PCR amplified.
This reasoning does not explain the deviation from expected standards conversion in the oxBSseq library. To further investigate these deviations, standard-only libraries were created and test
amplified. In total, six standards mock libraries were created, one for each standard and
conversion type (C only, 5-mC only, and 5-hmC only, bisulfite- and oxidative bisulfiteconverted libraries). After amplification, libraries were run on agarose gel for visualization (Fig.
2.6). This experiment shows that, indeed, 5-hmC standard is not amplified after bisulfite
treatment. Also, surprisingly, 5-hmC is not amplified after oxidative bisulfite treatment.
Additionally, after oxidative bisulfite treatment, while there appears to be no effect on 5-mC
standard, unmethylated C standard is significantly underamplified compared to its bisulfite-only
pair. While these mock libraries were not quantified post-amplification, it is clear that oxidative
treatment before bisulfite treatment does alter standard ratios from the expected in sequenced
libraries and may be the reason for deviations from expected standard ratios seen in Chlorella
and Ostreococcus BS/oxBS library pairs.
2.6.5.2: Can we detect 5-hmC in Chlorella and Ostreococcus sequencing data?
Oxidative bisulfite sequencing was recently developed and used to detect 5-hmC at single base
resolution in mouse embryonic stem cells, and that study claims to have detected regions of 5hmC enrichment51. This thesis’ study, however, will not be able to make such claims due to
failure of the libraries’ internal controls. Based on the standards mock library amplification
21
experiment and findings from other studies, sequences containing 5-hmC bases (whether only a
few or, as in these standards, high concentrated) will be severely underrepresented in both
bisulfite- and oxidative bisulfite-converted libraries. Thus, this study cannot expect to detect 5hmC using sequencing of BS/oxBS library pairs using this oxidation protocol.
For example, let us look at a hypothetical base in a specific genomic position that, in a pool of
DNA from many cells, is unmethylated C 20% of the time, 5-mC 20% of the time, and 5-hmC
60% of the time. If the conversion techniques work as they should and there is no amplification
bias, this position should be easily detected as a candidate for often being 5-hmC. This particular
base should be seen as 20% T and 80% C in BS-seq data and 80% T and 20% C in oxBS-seq
data, which, given enough representations of this position in our datasets, quickly reaches a very
low p-value by a Fisher’s exact test. However, if 5-hmC bases are essentially not represented in
the amplified libraries as shown in the standard mock library experiment, both hypothetical
libraries will show this particular base as 50% T and 50% C and 5-hmC at this position will be
undetectable.
Despite these issues, the BS/oxBS paired datasets can still be analyzed using the Fisher’s exact
test “signed p-value” method described above. Assume that there are a significant amount of 5hmC-rich positions in each genome that can be detected without the issues described above. If
this is the case, we should see a relatively large amount of cytosine positions with very small
“positive” p-values, few very small “negative” p-values, and many non-significant p-values,
where the very small positive p-values indicate positions at which 5-hmC is often present.
Applying this analysis to a subset of Chlorella CG sites, approximately equal numbers of very
small positive and negative p-values are seen. Out of 65955 CG sites with a Fisher’s exact test pvalue less than or equal to 10-5, only 32862 (~49.8%) are “positive.” Increasing the significance
cut-off to 10-10, 674 out of 1466 (~46%) are “positive,” while at a 10-15 cut-off, 31 out of 58
(~53.4%) are “positive.” Because there are approximately as many “positive” as “negative” sites
at all significance cut-offs, we can conclude that 5-hmC is not detectable in these paired datasets
using the methods presented here.
2.7: Materials & Methods
2.7.1: Genomic DNA, organismal growth conditions, and methylated DNA standards
DNA extracted from frozen or living tissue or culture by the author or DNA samples from the
following organisms were used in this study: Oryza sativa seedling frozen tissue (Jessica
Rodrigues, UC-Berkeley); Mus musculus brain and liver frozen tissues, Apis mellifera worker
head DNA, Selaginella moellendorffii DNA, Uncinocarpus reesii DNA, Postia placenta DNA,
Laccaria bicolor DNA, Coprinopsis cinerea DNA (Assaf Zemach, UC-Berkeley); Ostreococcus
lucimarinus CCMP#2972 DNA (Jason Huff, UC-Berkeley); Physcomitrella patens living tissue
(Tom Kleist, UC-Berkeley); Chlorella sorokiniana culture (Melissa Roth, UC-Berkeley);
Chlamydomonas reinhardtii CC503 cw-92 mt+ culture (Melis Lab, UC-Berkeley); growing
cultures of Coprinus cinereus (Coprinopsis cinerea) Okayama 7 FGSG#9003, Neurospora
crassa 74-OR23-1VA FGSG#2489, Neurospora tetrasperma FGSG#2509, Agaricus bisporus
H97 FGSG#10389 and JB137-S8 FGSC#10392 (Fungal Genetics Stock Center, Kansas City,
MO); frozen cultures of Chlorella sp. NC64A ATCC#50258, Laccaria bicolor ATCC#MYA-
22
4686, Phycomyces blakesleeanus ATCC#8743B (American Type Culture Collection, Manassas,
VA); Arabidopsis thaliana Columbia-0 and Landsberg erecta mature leaf tissue.
Green algae cultures were grown in TAP liquid medium (UTEX Culture Collection of Algae,
Austin, TX) at 20°C under a 16/8 light/dark cycle. Fungal cultures were grown in 1× or ½×
Difco YM liquid medium (BD, Franklin Lakes, NJ) at 20 to 25°C. Arabidopsis thaliana was
greenhouse grown in a soil mixture at 20 to 25°C under a 16/8 light/dark cycle.
The 5-hmC, 5-mC & Cytosine DNA Standard Pack from Diagenode (Denville, NJ) was used for
dot blot standard controls. The Methylated DNA Standard Kit from Active Motif (Carlsbad, CA)
was used for restriction enzyme testing and high-throughput sequencing experiments.
2.7.2: DNA extraction
Tissues were frozen with liquid nitrogen and ground using a mortar and pestle. An appropriate
volume of 1× or 2× CTAB DNA extraction buffer containing 0.2% 2-mercaptoethanol v/v was
added tissue and ground again. After a 1-3h incubation at 65°C, the mixture was centrifuged at
room temperature (RT) and the supernatant was mixed with an equal volume of
phenol:chloroform:isoamyl alcohol (25:24:1), inverted for 30s, and centrifuged at RT. The
aqueous phase was then mixed with an equal volume of chloroform:isoamyl alcohol (24:1),
inverted for 30s, and centrifuged at RT. The aqueous phase was mixed with 0.7 to 1 volumes of
2-propanol, inverted to mix, and centrifuged at RT. The supernatant was removed and the pellet
was dried and resuspended in water or elution buffer, to which 2.5 to 3 volumes of 1.3% 3M
sodium acetate v/v ethanol was added. This mixture was chilled at -20°C for at least 1h and
centrifuged at 4°C, after which the supernatant was removed and the pellet was dried,
resuspended in water or elution buffer, and used for experimentation. All centrifugation steps
used 12000 to 18000 rcf.
2.7.3: Restriction enzyme digestion and analysis
PvuRts1I recombinant restriction enzyme (Active Motif, Carlsbad, CA) was used according to
manufacturer’s recommended reaction conditions with 250 µg of Active Motif Methylated DNA
Standards. Mock digestion controls were performed with water in place of PvuRts1I enzyme
solution. After digestion, reactions were heat inactivated for 10m at 65°C. After heat
inactivation, samples were either used directly in 3% agarose w/v TAE electrophoresis gels or
cleaned with QIAGEN MinElute PCR Purification Kits (QIAGEN, Valencia, CA) and submitted
to QB3 (Berkeley, CA) for Agilent 2100 Bioanalyzer (Santa Clara, CA) analysis.
2.7.4: Dot blotting and antibody testing
Adapted from Brown52, to produce membranes for dot blots, positively charged nylon membrane
(Roche Applied Science, Indianapolis, IN) was cut to size, wetted in water for 10m, and dried.
NaOH and EDTA (pH 8.2) was added to DNA samples to give final concentrations of 0.4M
NaOH and 10mM EDTA. DNA was denatured by heating to 100°C for 10m and crash-cooled on
ice for 5m before application to membrane. Denatured DNA was applied to membranes 2µl at a
time and spots were allowed to dry before application of another dot. After application of all
samples, the membrane was dried, briefly rinsed in 2× SSC buffer, and dried again.
23
For blotting, antibodies were tested with standards in several different combinations of
concentration, buffer, blocking agent, and incubation time until a reasonable qualitative signalto-noise ratio was reached. For experiments, dot blot membranes were rehydrated in PBST (0.1%
Tween-20 v/v PBS buffer) for 5m at RT and transferred to blocking buffer (1% BSA or 5% dried
milk w/v PBST, as appropriate for antibody) on a rotator or rocker for 30m at RT or overnight at
4°C. Blocking buffer was then poured off and blocking buffer with primary antibody – Active
Motif mouse α-5-hmC mAb 1:5000 PBST/BSA; Diagenode mouse α-5-hmC mAb, 1µg/5ml
PBST/BSA; AbCam rat α-5-hmC mAb (Cambridge, MA) 1:1000 PBST/milk – was added and
incubated at RT for 2h. After primary incubation, solution was poured off and membrane was 5
× 5m PBST washed. Membranes were then incubated in secondary antibody – either Invitrogen
goat α-mouse mAb::HRP or Invitrogen goat α-rat mAb::HRP, 1:5000 or 1:10000 PBST/milk, for
1h at RT and then 5 × 5m PBST washed. Membranes were then incubated with SuperSignal
West Pico Chemiluminescent Substrate (Thermo Fisher Scientific, Waltham, MA) and exposed
to Kodak BioMax film (Carestream Health, Toronto, Ontario, Canada).
2.7.5: Dioxygenase inhibition
Liquid cultures of green algae and fungi had one volume of appropriate growth medium and
either an aqueous solution of L-2-hydroxyglutarate (Sigma-Aldrich, St. Louis, MO) to a
concentration of 10mM or water (controls only) added to them. Cultures were allowed to grow
for 16 to 20h before DNA extraction and dot blotting, as described above.
2.7.6: Acid hydrolysis of DNA and mass spectrometry
Before chemical digestion for mass spectrometry, DNA was isolated and purified using the
protocol in Section 2.7.2 with some modification to increase purity. First, DNA was subjected to
two rounds each of the chloroform:isoamyl alcohol step and the isopropanol precipitation. Then,
after precipitating with the sodium acetate/ethanol solution, the pellet was resuspended in water
and precipitated with 100% ethanol twice. After these steps, the dried pellet was suspended in
ultrapure, nuclease-free water. Adapted from formic acid digestion protocols in Gommers-Ampt
et al.,Djuric et al. and Eick et al.21,46,47, 50µg of DNA was transferred to acid washed glass tubes
with 9 volumes of 98% formic acid. The tube was quickly flushed with filtered nitrogen, sealed
with a PTFE-lined cap, and heated to between 141 and 145°C for 60m in an aluminum heating
block, using heavy mineral oil for greater thermal contact. After heating, the tube was allowed to
cool to RT on the benchtop and the formic acid/water mixture was evaporated in a vacuum
chamber. The dried residue was resuspended in 100% methanol and submitted to the
QB3/Chemistry Mass Spectrometry Facility at UC-Berkeley for analysis using positive ion mode
electrospray ionization (nanospray) mode.
2.7.7: Bisulfite and oxidative bisulfite library construction
DNA used for BS-seq and oxBS-seq libraries was first isolated and purified using the protocol in
Section 2.7.2, sonicated to a mean size of approximately 300bp, purified using Agencourt
AMPure XP magnetic beads (Beckman Coulter, Indianapolis, IN) or homemade magnetic beads
(Tzung-Fu Hsieh, UC-Berkeley) according to manufacturer’s recommended protocol,
resuspended in elution buffer, and checked for appropriate size distribution on a 1% agarose w/v
TAE electrophoresis gel.
24
Based on the protocol provided in Lister et al.53, cleaned, sheared DNA was end repaired with
T4 DNA polymerase, Klenow DNA polymerase, and T4 PNK (New England Biolabs, Ipswitch,
MA) with appropriate reagents for 30m at 20°C, magnetic bead cleaned, then modified with 3’ A
base additions using Klenow exo- (New England Biolabs) and appropriate buffers for 30m at
37°C. To this reaction, annealed, fully methylated Illumina paired end adapters (5' PGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG; 5'
ACACTCTTTCCCTACACGACGCTCTTCCGATCT) (Bioneer, Alameda, CA) and DNA
Quick Ligase (New England Biolabs) with appropriate buffer were added and incubated for 20m
at RT. After adapter ligation, the reaction was cleaned with magnetic beads. For libraries to be
oxidized, DNA was resuspended in water after purification.
For oxBS-seq libraries, an oxidation protocol based on Booth et al. was used51. DNA prepared as
discussed above was denatured by adding NaOH to a final concentration of 0.05M, heated for
30m at 37°C, and snap-cooled on ice for 5m. Next, a KRuO4 solution (15mM in 0.05M NaOH)
was added to the DNA solution, for a final KRuO4 concentration of 0.6mM, and left on ice for
1h with occasional light finger vortexing. Oxidized DNA was purified with Roche mini Quick
Spin Oligo Columns (Sephadex G-25 beads). First, buffer was washed out of the column with
200µl water and centrifuged for 1m at 1000 rcf × 4 before adding DNA/KRuO4 solution.
Purified solution was collected by centrifuging for 4m at 1000 rcf.
BS- and oxBS-seq libraries were bisulfite converted twice using the FFPE protocol of the
QIAGEN EpiTect Kit without carrier RNA. Bisulfite converted libraries were PCR amplified
using exTaq HS (Takara, Mountain View, CA), Illumina PE1 (5'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
T) and PE2 (5'
CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCG
ATCT) primers, and appropriate reagents with 14 to 18 amplification cycles. Amplified libraries
were cleaned and concentrated with magnetic beads and analyzed by Bioanalyzer DNA HS (UCBerkeley QB3 facility) and submitted to QB3/Genomics Sequencing Laboratory for sequencing
on Illumina HiSeq 2000 sequencers.
2.7.8: Processing and analysis of high-throughput sequencing data
Data from sequenced libraries was aligned and analyzed using bowtie (http://bowtiebio.sourceforge.net), SAMtools (http://samtools.sourceforge.net), BAMtools
(http://bamtools.sourceforge.net), and the dzlab-tools package
(http://dzlab.pmb.berkeley.edu/tools) and analyzed, filtered, organized, and automated with
custom perl, Python and bash shell scripts (available from the author) and Microsoft Excel
(Microsoft, Redmond, WA).
25
Chapter 3:
Hypothesis for function of 5-hydroxymethylcytosine
in putative fungal transposons and schemata for engineering
sequence-specific epigenetic modification systems
26
3.1: Introduction: Relationships among eukaryotic TET/JBP family dioxygenases and
dioxygenase-containing transposons
Putative TET/JBP family dioxygenases, enzymes known to catalyze the production of 5-hmC,
are known to be encoded in many eukaryotic genomes18. Among these organisms that likely
contain TET/JBP-like dioxygenases are the fungi Laccaria bicolor and Coprinopsis cinerea and
the Chlorophyte alga Chlamydomonas reinhardtii. This study has shown, by immunodetection,
that all three of these organisms likely contain significant amounts of 5-hmC in their genomes.
Detailed bioinformatics analysis has shown that TET/JBP-like dioxygenases occur in high copy
number in these organisms and may be associated with putative transposases and other DNA
binding and enzymatic proteins in a specific genetic orientation18 (Fig. 3.1A). These facts lead to
the hypothesis that such genes may together form transposons or transposon-like cassettes.
Fig. 3.1: Overview of putative transposon-associated TET-like dioxygenases. (A) Basic
structure of putative transposons in Laccaria and Coprinopsis. In these transposon-like cassettes,
dioxygenases and DNA binding domains are found parallel to each other, while putative
transposases are found antiparallel to the first two sequences. (B) A maximum likelihood tree of
selected eukaryotic dioxygenases shows evolutionary relationships among homologs.
27
To further develop this hypothesis, this study completed bioinformatics analysis of these
dioxygenases. A maximum likelihood phylogenetic tree was created using aligned amino acid
sequences of selected TET/JBP family dioxygenases from vertebrate and invertebrate animals,
Discicristate-group organisms (Trypanosoma species and Naegleria gruberi), and the
aforementioned fungi and green alga. As an outgroup, an Escherichia coli protein AlkBH2, a 2KG-dependent dioxygenase known to function as a DNA repair enzyme, was used.
Based on known phylogeny, the expected tree would form two clades: One clade consisting of
the green alga, the fungi, and the animals, with a sub-clade of the fungi and animals, and a
second clade consisting of the Discicristate organisms. Instead, clades consisting of animal
sequences and Trypanosoma sequences group most closely together. These groups form a larger
clade with the putative transposon-associated green algal and fungal sequences, and sister to this
larger clade are the N. gruberi sequences (Fig. 3.1B). Indeed, the grouping of the fungal and
green algal dioxygenases outside of the known TET and JBP sequences suggests that they are
closely related, and it is plausible that they may function as or have been derived from rogue
genetic elements. Additionally, the large grouping of N. gruberi dioxygenases, seemingly not
closely related to those JBP family dioxygenases of the fellow Discicristate organisms in the
genus Trypanosoma, suggests the same may be true for these genes.
3.2: Hypothesis for function of putative fungal transposons
3.2.1: Introduction: A hypothesis for autoderepressing transposons
If these dioxygenases are indeed derived from rogue genetic elements and are associated with
sequences that either are or once were active transposon-like elements, what could be their
purpose? First, transposons, transposon-like sequences, and repeats are often cytosine methylated
in Eukaryotic organisms that contain DNA methyltransferases8. This 5-mC acts as a repressive
mark. Let us assume that these elements are highly methylated in their host organisms and
repressed by the hosts’ genetic regulatory systems. For these elements to more efficiently be
expressed and transpose, their repressive marks must be removed. Assuming some amount of
low-level transcription of these genes, if these elements can themselves remove their repressive
marks, they might be able to increase their own expression and transposition, creating a positive
feedback loop.
One can speculate that these transposon-like cassettes do encode for an autoderepression
mechanism to achieve this effect. The essential structure of the putative fungal transposons
includes a putative transposase, a DNA binding protein that could target a sequence or sequences
of the cassette, and a dioxygenase (Fig. 3.1A). It is the dioxygenase that, by associating with and
being targeted to transposon sequences by the DNA binding protein, would act as the direct
derepressive mechanism. By catalyzing the formation of 5-hmC from 5-mC, the dioxygenase
effectively removes repressive methylation marks in the transposon by one of the mechanisms
discussed in Chapter 1.
3.2.2: Outline of methods for “autoderepressing transposons” hypothesis testing
This study has not attempted to show that these putative transposons are, in fact, active
transposons. However, there are several methods that could be used to show transposon activity
as well as the role that the transposon-associated dioxygenases may play in transposon28
associated gene expression and activity. First, a transcriptomic approach, such as RNA-seq, or a
PCR-based approach, such as qRT-PCR, could be used to detect and quantify expression of
specific transposon-associated genes. These two types of methods would verify that these genes
are actively transcribed by the host. Second, transposon display techniques can show active
transposition of transposons54–56. Lastly, if dioxygenase activity does increase expression and
transposition of these transposons, then dioxygenase inhibition should decrease expression of
transposon-associated genes as well as their transposition. Using a protocol similar to that
explained in Chapter 2, 2-HG treatment of these organisms in conjunction with RNA-seq or
qRT-PCR and transposon display should cause a marked decrease in expression and
transposition of these transposons when compared to an untreated control.
3.3: Proposal for sequence-specific epigenetic modification technology
3.3.1: Introduction
Whether or not these putative transposons act as hypothesized in the previous section, their
regular structure shown by Iyer et al. suggests the concept of sequence-specific epigenetic
modification18. The development of a technology that is capable of sequence-specific epigenetic
modification would be a boon for many fields of basic biological and medical research.
Conceivably, this technology could be used to either silence or activate expression of genes
through the cell’s epigenetic system, thereby altering cellular phenotypes that are (partially)
epigenetically controlled, such as certain aspects of the cell cycle and division, development and
differentiation, and metabolism.
This concept requires the production of two essential, interacting components (Fig. 3.2). First, a
targeter is required. This targeter can be a DNA binding protein that binds to a unique sequence
or semi-specific set of sequences or a single-stranded nucleic acid that is complimentary to the
target sequence. Currently, engineering of DNA binding proteins, such as zinc fingers and
transcription activator-like effectors, is possible, and it is likely that this technology will make
targeting of any nucleotide sequence much easier in the near future. For now, a proof-of-concept
experiment using a protein targeter might use a transcription factor or other DNA binding protein
that is known to target a specific set of sequences. Second, a modifier is required. This enzymatic
module modifies a specific epigenetic mark in the region in which it is targeted. Possibilities for
modifications are DNA methylation (with a DNA methyltransferase), DNA demethylation (with
a dioxygenase or a DNA glycosylase, as appropriate for the organism being modified), or any
number of histone modifications.
In addition to the targeter and modifier components, other secondary components are needed.
First, a linker to bind the primary components is required. This linker may either covalently bind
the targeter and modifier in a fused system or facilitate interaction between targeter and modifier
in a modular system, such as through the use of high-affinity, interacting tags on the components
(e.g., biotin-streptavidin system). (Both of these concepts are discussed more below.) In addition
to binding the two components together, the linker must allow flexibility and movement of the
two components relative to one another so that the modifier can reach the three-dimensional
region of chromatin surrounding the sequence to which the targeter is bound. Second, a nuclear
localization signal is required to be a part of the system so that it can be delivered to the nucleus
and access chromatin.
29
Fig. 3.2: Example scheme of a modular, sequence-specific epigenetic modification system.
In this example, the modifier module, a dioxygenase, can link to the targeter module, a DNA
binding protein, through streptavidin-biotin interaction. This system can activate gene expression
by binding to a specific genomic region and hydroxylating 5-mC bases surrounding that region
in a gene normally repressed via cytosine methylation.
30
3.3.2: Fused versus modular systems
In the fused system, all components are produced as one large, naturally covalently-linked
polypeptide from a single, engineered genetic construct. This fused system may be advantageous
where organisms are genetically engineered with the construct and must express the components
in a one-to-one manner.
In the modular system, the primary components are produced separately and later must be linked
through high-affinity interactor tags. By using a modular system delivered to cells, modifier
components can be microbially produced in volume and each can be mixed in a one-to-one ratio
with microbially produced targeters to create a linked, “off the shelf” targeter-modifier system.
With this method, a single locus or many loci may be targeted for a single or many epigenetic
modifications without genetic engineering of the cells to be modified. This modular system may
be advantageous when the system can be delivered to cells via injection or other specific means.
3.3.3: Potential uses of sequence-specific epigenetic modifier technology
How could this technology be used, given a working implementation with appropriate delivery
methods? A few ideas upon which one can speculate include: 1. Epigenetic modification of loci
important in cell cycle checkpoint, cell growth, or apoptosis to slow or stop cancer cell growth;
2. Demethylation of abnormally methylated loci in pancreatic cells of diabetic or pre-diabetic
individuals57; 3. Reversion of differentiated cells into another state of potency by epigenetic
means; 4. Altering normal phenotype or development by inducing a fused construct in response
to an inductive signal, such as altering flowering time, fruit/seed development, or dormancy in
crop plants.
3.4: Methods
3.4.1: Phylogenetic tree construction
Dioxygenase amino acid sequences selected for construction of a phylogenetic tree were found
in Iyer et al.18 and aligned using MUSCLE (http://www.ebi.ac.uk/Tools/msa/muscle). The
maximum likelihood phylogenetic tree was constructed with aligned sequences using PhyML
(http://www.phylogeny.fr) with bootstrapping58.
31
Chapter 4:
Thesis Summary
32
4.1: Introduction
Methylcytosine has long been known to be an important component of certain prokaryotic and
eukaryotic genomes. Recently, hydroxymethylcytosine, a modification of methylcytosine
catalyzed by Fe(II) and 2-oxoglutarate-dependent dioxygenases, was found to be a significant
component of mammalian genomes. In mammals, 5-hmC is suspected to play important roles in
DNA methylation/demethylation dynamics as a demethylation intermediate and may constitute a
novel epigenetic mark. Dioxygenase protein domains that likely participate in pyrimidine
oxidation, including those that are very similar to mammalian TET-like dioxygenases, have been
found in several other eukaryotic lineages, including plants, fungi, and Discicristate organisms
(e.g., Trypanosoma and Naegleria). Because of its presumably wide evolutionary distribution
and its potential importance to understanding the complexity of eukaryotic genomes, it is
important that we understand more about this DNA modification.
4.2: Detection of 5-hydroxymethylcytosine
Many methods for detection of 5-hmC exist. Several of these methods that were not used in this
study were outlined, such as chromatographic methods and specific chemical labeling, and
results of experiments to detect the presence of 5-hmC in various Eukaryotic organisms were
discussed in detail. First, the 5-hmC-specific PvuRts1I restriction enzyme method was tested.
Results from these experiments show that PvuRts1I is likely not as specific as previously
published results indicate and this method was subsequently abandoned. Then, immunodetection
of 5-hmC using a 5-hmC-specific antibody was used in dot blot analysis of DNA from various
Eukaryotes with satisfactory results. This method showed the previously unreported presence of
5-hmC in several fungi and green algae, as well as potentially at low levels in some angiosperms.
Immunodetection was then used to inhibit of 5-hmC production in certain fungi and green algae
through application of the 2-KG analog 2-HG, showing that at least some of the 5-hmC present
in these organisms is likely due to production by 2-KG-dependent dioxygenases. Next, mass
spectrometry of chemically-digested DNA was attempted to show the presence of 5-hmC in two
Neurospora species, which had shown positive signals from immunodetection methods.
Unfortunately, this method did not work as expected. Lastly, a high-throughput sequencingbased method called oxidative bisulfite sequencing was attempted in order to detect 5-hmC at
single-base resolution in two green algae. Analysis of the libraries and their included standards
show that, while 5-hmC may indeed be present, this study was not able to detect it using this
method. Possible issues with the conversion and amplification of 5-hmC-rich DNA sequences
that may be reasons for sequencing-based detection problems were also discussed in this section.
4.3: Transposon function hypothesis and epigenetic editor proposal
Bioinformatic analysis of TET/JBP-like dioxygenases in certain organisms has shown that these
enzymes are often associated with putative transposons-like elements. With the help of DNA
targeting proteins, it was hypothesized these dioxygenases may help derepress their associated
elements. By hydroxylation of repressive 5-mC marks likely found on these elements, these
transposon-associated dioxygenases may effectively override the host’s repression mechanisms,
creating an autoderepressing transposon. Several methods to test their activity and, if they are
indeed actively transposing elements, discern the role of dioxygenases in their activity were
proposed. Based on the structure of these putative transposons, a hypothetical sequence-specific
epigenetic modification technology was proposed. Basic system components, two possible
33
system constructions, using fused and modular components, and potential applications of the
hypothetical technology were also discussed.
34
References
1.
Wilson, G. G. & Murray, N. E. Restriction and modification systems. Annual Review of
Genetics 25, 585–627 (1991).
2.
Hall, R. M. The DNA adenine methyltransferase (dam+) gene of bacteriophage T4
reverses the mutator phenotype of an Escherichia coli dam mutant. Journal of
Bacteriology 172, 2812–3 (1990).
3.
Suzuki, M. M. & Bird, A. DNA methylation landscapes: provocative insights from
epigenomics. Nature Reviews Genetics 9, 465–76 (2008).
4.
Cedar, H. & Bergman, Y. Linking DNA methylation and histone modification: patterns
and paradigms. Nature Reviews Genetics 10, 295–304 (2009).
5.
Cedar, H. & Bergman, Y. Programming of DNA methylation patterns. Annual Review of
Biochemistry (2012).
6.
Zemach, A., McDaniel, I. E., Silva, P. & Zilberman, D. Genome-wide evolutionary
analysis of eukaryotic DNA methylation. Science 328, 916–9 (2010).
7.
Bestor, T. H. The DNA methyltransferases of mammals. Human Molecular Genetics 9,
2395–2402 (2000).
8.
Law, J. A. & Jacobsen, S. E. Establishing, maintaining and modifying DNA methylation
patterns in plants and animals. Nature Reviews Genetics 11, 204–20 (2010).
9.
Lehman, I. R. & Pratt, E. A. On the Structure of the Glucosylated Nucleotides of
Coliphages Hydroxymethylcytosine. Journal of Biological Chemistry 235, 3254–3259
(1960).
10.
Wyatt, G. R. & Cohen, S. S. The bases of the nucleic acids of some bacterial and animal
viruses: the occurrence of 5-hydroxymethylcytosine. The Biochemical Journal 55, 774–82
(1953).
11.
Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in
mammalian DNA by MLL partner TET1. Science 324, 930–5 (2009).
12.
Pastor, W. A. et al. Genome-wide mapping of 5-hydroxymethylcytosine in embryonic
stem cells. Nature 473, 394–7 (2011).
13.
van Luenen, H. G. A. M. et al. Glucosylated Hydroxymethyluracil, DNA Base J, Prevents
Transcriptional Readthrough in Leishmania. Cell 150, 909–921 (2012).
14.
Militello, K. T. et al. African trypanosomes contain 5-methylcytosine in nuclear DNA.
Eukaryotic Cell 7, 2012–6 (2008).
35
15.
Kriaucionis, S. & Heintz, N. The nuclear DNA base 5-hydroxymethylcytosine is present
in Purkinje neurons and the brain. Science 324, 929–30 (2009).
16.
Liutkeviciute, Z., Lukinavicius, G., Masevicius, V., Daujotyte, D. & Klimasauskas, S.
Cytosine-5-methyltransferases add aldehydes to DNA. Nature Chemical Biology 5, 400–2
(2009).
17.
Wu, H. & Zhang, Y. Mechanisms and functions of Tet protein-mediated 5-methylcytosine
oxidation. Genes & Development 25, 2436–2452 (2011).
18.
Iyer, L. M., Tahiliani, M., Rao, A. & Aravind, L. Prediction of novel families of enzymes
involved in oxidative and other complex modifications of bases in nucleic acids. Cell
Cycle 8, 1698–710 (2009).
19.
Loenarz, C. & Schofield, C. J. Physiological and biochemical aspects of hydroxylations
and demethylations catalyzed by human 2-oxoglutarate oxygenases. Trends in
Biochemical Sciences 36, 7–18 (2011).
20.
Xu, W. et al. Oncometabolite 2-hydroxyglutarate is a competitive inhibitor of αketoglutarate-dependent dioxygenases. Cancer Cell 19, 17–30 (2011).
21.
Gommers-Ampt, J. H., Leeuwen, F. Van & Vliegenthart, J. F. G. A Novel Modified Base
Present in the DNA of the Parasitic Protozoan T. brucei. Cell 75, 1129–1136 (1993).
22.
Song, C.-X. et al. Selective chemical labeling reveals the genome-wide distribution of 5hydroxymethylcytosine. Nature Biotechnology 29, 68–72 (2011).
23.
Song, C.-X. et al. Sensitive and specific single-molecule sequencing of 5hydroxymethylcytosine. Nature Methods 9, 75–7 (2012).
24.
Globisch, D. et al. Tissue distribution of 5-hydroxymethylcytosine and search for active
demethylation intermediates. PloS ONE 5, e15367 (2010).
25.
Williams, K., Christensen, J. & Helin, K. DNA methylation: TET proteins—guardians of
CpG islands? EMBO Reports 13, 28–35 (2011).
26.
Hackett, J. A. et al. Germline DNA demethylation dynamics and imprint erasure through
5-hydroxymethylcytosine. Science 339, 448–52 (2013).
27.
Xiao, W. et al. Imprinting of the MEA Polycomb Gene Is Controlled by Antagonism
between MET1 Methyltransferase and DME Glycosylase. Developmental Cell 5, 891–901
(2003).
28.
Gehring, M. et al. DEMETER DNA glycosylase establishes MEDEA polycomb gene selfimprinting by allele-specific demethylation. Cell 124, 495–506 (2006).
36
29.
Choi, Y. et al. DEMETER, a DNA Glycosylase Domain Protein, Is Required for
Endosperm Gene Imprinting and Seed Viability in Arabidopsis. Cell 110, 33–42 (2002).
30.
Cortellino, S. et al. Thymine DNA glycosylase is essential for active DNA demethylation
by linked deamination-base excision repair. Cell 146, 67–79 (2011).
31.
Nabel, C. S., Manning, S. A. & Kohli, R. M. The curious chemical biology of cytosine:
deamination, methylation, and oxidation as modulators of genomic potential. ACS
Chemical Biology 7, 20–30 (2012).
32.
Bhutani, N., Burns, D. M. & Blau, H. M. DNA Demethylation Dynamics. Cell 146, 866–
872 (2011).
33.
Ito, S. et al. Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5carboxylcytosine. Science 333, 1300–3 (2011).
34.
He, Y.-F. et al. Tet-Mediated Formation of 5-Carboxylcytosine and Its Excision by TDG
in Mammalian DNA. Science 333, 1303–1307 (2011).
35.
Shen, L. et al. Genome-wide Analysis Reveals TET- and TDG-Dependent 5Methylcytosine Oxidation Dynamics. Cell 153, 1–15 (2013).
36.
Song, C.-X. et al. Genome-wide Profiling of 5-Formylcytosine Reveals Its Roles in
Epigenetic Priming. Cell 153, 1–14 (2013).
37.
Spruijt, C. G. et al. Dynamic Readers for 5-(Hydroxy)Methylcytosine and Its Oxidized
Derivatives. Cell 152, 1146–1159 (2013).
38.
Jin, S.-G., Wu, X., Li, A. X. & Pfeifer, G. P. Genomic mapping of 5hydroxymethylcytosine in the human brain. Nucleic Acids Research 39, 5015–24 (2011).
39.
Wang, L., Chia, N. C., Lu, X. & Ruden, D. M. Hypothesis: Environmental regulation of 5hydroxymethylcytosine by oxidative stress. Epigenetics 6, 853–856 (2011).
40.
New England Biolabs Incorporated. Isoschizomers. (2013).at
<https://www.neb.com/tools-and-resources/selection-charts/isoschizomers>
41.
Song, C.-X., Yu, M., Dai, Q. & He, C. Detection of 5-hydroxymethylcytosine in a
combined glycosylation restriction analysis (CGRA) using restriction enzyme Taq(α)I.
Bioorganic & Medicinal Chemistry Letters 21, 5075–7 (2011).
42.
Szwagierczak, A. et al. Characterization of PvuRts1I endonuclease as a tool to investigate
genomic 5-hydroxymethylcytosine. Nucleic Acids Research 39, 5149–56 (2011).
43.
Hayatsu, H. & Shiragami, M. Reaction of bisulfite with the 5-hydroxymethyl group in
pyrimidines and in phage DNAs. Biochemistry 18, 632–637 (1979).
37
44.
Huang, Y. et al. The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PloS
ONE 5, e8888 (2010).
45.
Münzel, M. et al. Quantification of the sixth DNA base hydroxymethylcytosine in the
brain. Angewandte Chemie (International ed. in English) 49, 5375–7 (2010).
46.
Djuric, Z., Luongo, D. A. & Harper, D. A. Quantitation of 5-(hydroxymethyl)uracil in
DNA by gas chromatography with mass spectral detection. Chemical Research in
Toxicology 4, 687–691 (1991).
47.
Eick, D., Fritz, H. J. & Doerfler, W. Quantitative determination of 5-methylcytosine in
DNA by reverse-phase high-performance liquid chromatography. Analytical Biochemistry
135, 165–71 (1983).
48.
Rosch, L., John, P. & Reitmeier, R. Silicone Compounds, Organic. Ullmann’s
Encyclopedia of Industrial Chemistry (2000).
49.
Hayatsu, H., Wataya, Y., Kai, K. & Iida, S. Reaction of sodium bisulfite with uracil,
cytosine, and their derivatives. Biochemistry 9, 2858–2865 (1970).
50.
Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, realtime sequencing. Nature Methods 7, 461–5 (2010).
51.
Booth, M. J. et al. Quantitative sequencing of 5-methylcytosine and 5hydroxymethylcytosine at single-base resolution. Science 336, 934–7 (2012).
52.
Brown, T. Dot and slot blotting of DNA. Current protocols in molecular biology / edited
by Frederick M. Ausubel ... [et al.] Chapter 2, Unit2.9B (2001).
53.
Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in
Arabidopsis. Cell 133, 523–36 (2008).
54.
Van den Broeck, D. et al. Transposon Display identifies individual transposable elements
in high copy number lines. The Plant Journal 13, 121–9 (1998).
55.
Syed, N. H. & Flavell, A. J. Sequence-specific amplification polymorphisms (SSAPs): a
multi-locus approach for analyzing transposon insertions. Nature Protocols 1, 2746–52
(2006).
56.
Grzebelus, D., Jagosz, B. & Simon, P. W. The DcMaster Transposon Display maps
polymorphic insertion sites in the carrot (Daucus carota L.) genome. Gene 390, 67–74
(2007).
57.
Volkmar, M. et al. DNA methylation profiling identifies epigenetic dysregulation in
pancreatic islets from type 2 diabetic patients. The EMBO Journal 31, 1405–26 (2012).
38
58.
Dereeper, A. et al. Phylogeny.fr: robust phylogenetic analysis for the non-specialist.
Nucleic Acids Research 36, W465–9 (2008).
59.
Cutler, S., Ghassemian, M., Bonetta, D., Cooney, S. & McCourt, P. A Protein Farnesyl
Transferase Involved in Abscisic Acid Signal Transduction in Arabidopsis. Science 273,
1239–1241 (1996).
60.
Running, M. P., Fletcher, J. C. & Meyerowitz, E. M. The WIGGUM gene is required for
proper regulation of floral meristem size in Arabidopsis. Development 125, 2545–53
(1998).
39
Appendix A:
“Shotgun mapping”:
Mapping genetic mutations with high-throughput sequencing data
40
A.1: Introduction
Classical genetics utilized phenotype analysis and the principles of genetic linkage and
recombination to produce linkage, or genetic, maps of chromosomes. Since the advent of
molecular genetics, researchers have added PCR-based techniques and sequenced genomes to
create physical maps of chromosomes. Recent advances in high-throughput sequencing allow
researchers pursuing forward genetics projects to quickly and cheaply map genetic mutations to
specific genomic regions.
This appendix outlines an approach to using RNA-seq data for “rough” mapping of a genetic
mutation. Validating this approach with data from standard PCR-based mapping techniques and
Sanger sequencing, “shotgun mapping” has proven to be a valuable first step in mapping genetic
mutations. While this approach does not allow for the precision of certain PCR-based “fine”
mapping techniques, it does allow researchers to quickly and accurately discover a genomic
region (in this example, approx. 1 Mbase in Arabidopsis thaliana) on which they can focus
molecular and computational resources to discover the precise physical location of the mutant
locus. Additionally, while this example used RNA-seq data, data from any high-throughput
sequencing method could be used.
A.1: Background
A morphological mutant phenotype was discovered in an Arabidopsis thaliana Columbia‐0 (col)
ecotype h1.1 h1.2 double T-DNA mutant (At1g06760, SALK_128430C; AT2G30620,
GABI_406H11; Yvonne Kim, UC-Berkeley). Features of this mutant phenotype included
delayed germination, a compact stature (i.e., short internode length), abnormal numbers of floral
organs (specifically petals) and rosette leaves at maturity, abnormal phyllotaxy, late flowering,
and other rarer morphological defects. The mutant phenotype appeared in the F2 generation after
hybridization with a non‐mutant background and was thus assumed to be a recessive mutation.
Additionally, the mutation was present in plants that were not homozygous for the double
mutation and, in some cases, in plants homozygous wild‐type for both genes, suggesting that a
novel mutation had been discovered.
In order to map the mutant locus, mutant A. thaliana col plants were crossed with wild‐type A.
thaliana Landsberg erecta (ler) ecotype plants (mut col female × wt ler male). Plants of the F1
generation were allowed to self‐fertilize, and selfed F2 seed was used to produce a mapping
population. F2 individuals were phenotyped as wild‐type‐like or mutant based on abnormal floral
organ number phenotype, where plants with at least one flower having greater than four petals
were considered mutants. These two populations were then used in two separate mapping
experiments.
A.2: Shotgun mapping using RNA‐seq
Tissue from rosette leaves of mutant and wild‐type phenotype populations was collected. RNA
was extracted from the bulked tissue for creation of two mutant phenotype (technical replicates)
and one wild‐type‐like phenotype Illumina RNA‐seq libraries. While the initial purpose of
creating the RNA‐seq libraries from the mapping population was to look in the mutant phenotype
41
population libraries for candidate genes that displayed aberrant expression levels compared to the
wild‐type‐like population library, we found a novel use of the data that allowed for mapping with
a resolution on par with classical and “rough” PCR‐based mapping.
Due to linkage to the col background mutation, F2 plants selected for the mutation should be
enriched for col loci and depleted in ler loci around the mutation, with bulk incidence of col loci
increasing and ler loci decreasing, respectively, on the mutation‐containing chromosomes while
walking toward the mutant locus. (Fig. A.1). RNA‐seq reads were mapped to col and ler
genomes to compile data on gene expression. Ecotype‐specific mappings, based on
polymorphisms between col and ler genomes, were used as a genotype proxy, and this data was
analyzed to discover any regions in the mutant population RNA‐seq dataset where col‐specific
reads were overrepresented relative to ler‐specific reads.
Fig. A.1: Explanation of ratios of col to ler alleles from wild-type and mutant chromosomes
for mapping.
Assuming no contamination from the other selected population, the expectation is that the mutant
chromosomes used for libraries should contain all col alleles around the mutant locus, while the
wild type-like chromosomes used for libraries should contain ler and col alleles in a 2:1 ratio
42
(given a 2:1 ratio of heterozygous mutant to homozygous ler individuals) around the locus.
These numbers reflect the selection of plants homozygous for the mutant locus in the mutant
mapping population and either homozygous or heterozygous for the wild-type ler locus in the
wild type-like mapping population.
The analysis of ecotype‐specific mappings of RNA‐seq data shows a region of chromosome 5 in
which col reads were on average approximately 16 times more prevalent than ler reads, with a
minimum in the ler:col read ratio being reached around Chr5:16 Mbase (Fig. A.2E). In this same
region of chromosome 5 in wild type-phenotype plants, col reads were approximately 2 times
less prevalent than ler reads. On no other chromosome did the read ratio diverge so greatly from
unity (Figs. A.2A-D). Thus, based on this analysis, the physical location of the mutant locus
appeared to be on chromosome 5 near the 16 Mbase mark.
A.3: Validation of shotgun sequencing-based mapping with PCR-based mapping
techniques
To validate the shotgun sequencing-based mapping approach and further resolve the region
containing the putative mutation, DNA extracted from tissue collected from mutant phenotype
individuals was used for PCR‐based mapping. Using INDEL markers (Table A.1), “rough”
mapping of all Arabidopsis chromosomes in mutant phenotypes of the F2 mapping population
indicated linkage of a section of Chromosome 5, from approximately 9.5 to 18.8 Mbases, to the
mutation. Finer mapping of a population of ~250 mutant phenotype individuals, using INDEL,
SSLP and CAPS markers, narrowed down the region to between approximately 15.8 and 16.5
Mbases. Another round of increasingly finer mapping further narrowed down the region to
between approximately 16.07 and 16.26 Mbases. A final round of “fine” mapping was completed
on five individuals from the general mapping population that were heterozygous for one of the
previous markers at 16.07 and 16.26 Mbase. This round allowed the known region of the
mutation‐containing region to decrease to approximately 119 kbases – approximately
Chr5:16.083‐16.202Mbases (TAIR10 coordinates), roughly between At5g40240 and At5g40460
(Table A.1).
Computational searches uncovered a section of sequence in ERA1/WIGGUM (At5g40280) that
was missing from the mutant RNA-seq dataset (Daniel Zilberman, unpublished data)59,60. The
era1 phenotype has been shown to include defects in meristem organization, which explains the
observed phenotype discussed in Section A.1. Due to the similarities in phenotype and the
missing sequence fragments in the sequencing dataset, we further investigated the mutant
population’s era1 allele.
43
Fig. A.2: Shotgun mapping analysis. Analysis of ecotype-specific RNA-seq read mappings (AE) for each Arabidopsis thaliana chromosome. The vertical axis shows log2 (ler reads / col reads)
for every gene along the chromosome. Red, wild-type phenotype libraries; blue, mutant
phenotype libraries.
44
Table A.1: Summarized, selected PCR-based mapping data
Marker Name
5-24
EcoRI-5.1432
EcoRI-5.1484
Hind3-5.1535
Marker Type
INDEL
CAPS
CAPS
CAPS
TAIR8 Chr. 5 Position (bp)
9478916
14316457
14840703
15350649
COL
157
74
76
80
LER
4
1
0
0
HET
44
7
5
1
NA
12
1
2
2
TOTAL
217
83
83
83
% COL (w/o NAs)
76.6%
90.2%
93.8%
98.8%
5-43
INDEL
15810645
201
0
6
9
216
97.1%
Hind3-5.1609
EcoRI-5.1610
CAPS
CAPS
16081197
16097698
82
3
0
0
1
2
0
0
83
5
98.8%
60.0%
DraI-5.1610
DraI-5.1611
SspI-5.1612
SphI-5.1616
SspI-5.1621
SspI-5.1622
CAPS
CAPS
CAPS
CAPS
CAPS
CAPS
16101040
16114939
16115519
16157378
16213089
16216179
4
5
5
5
5
4
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
5
5
5
5
5
5
80.0%
100.0%
100.0%
100.0%
100.0%
80.0%
BamHI-5.1624
EcoRI-5.1624
XbaI-5.1625
EcoRI-5.1626
CAPS
CAPS
CAPS
CAPS
16239636
16243726
16249866
16262592
3
3
3
81
0
0
0
0
2
2
2
1
0
0
0
1
5
5
5
83
60.0%
60.0%
60.0%
98.8%
5-48
INDEL
16533089
205
1
7
4
217
96.2%
SSLP6317
SSLP6437
5-57
SSLP
SSLP
INDEL
17040972
17363749
18803896
74
79
160
0
0
12
3
4
33
6
0
12
83
83
217
96.1%
95.2%
78.0%
Table A.1 (cont.)
Marker Name
5-24
EcoRI-5.1432
EcoRI-5.1484
Hind3-5.1535
5-43
Hind3-5.1609
EcoRI-5.1610
DraI-5.1610
DraI-5.1611
SspI-5.1612
SphI-5.1616
SspI-5.1621
SspI-5.1622
BamHI-5.1624
EcoRI-5.1624
XbaI-5.1625
EcoRI-5.1626
5-48
SSLP6317
SSLP6437
5-57
Primer, forward
TGTGGCACAGGGTTTGTAAG
GGAAATCTTCAAAACTTCAA
GAGGAACTTGAAAAATGAGA
AACAGCATAAGAATCAAACC
GAAGTGTGGCTCTCCAATCC
GTTCAACCTCTGCAAATACT
GAGGAACATAACACCCATAG
CTCCATCCTTAATGAGTCAC
AACTGAATTATCACGGATGT
TCTAGGGAATCGATTTATTG
ACTAACCTATTTCCCCATTT
ATCTCAAAGAAGGAGAGGAT
TTTTGCACCCTAAAAGTATT
ATTCACACTGGAAATTTGTT
TTAACTCTGGCTTTTGATTT
TGGGAGTGACATAGAGAGAT
GTTTGCATAGGAAACAAAGT
Primer, reverse
AAAGCCAGCCAATGTTTCAC
AGAGGTGCTCTGCTTAAATA
GATCATGAGAATTTTCCAAC
ACATATCTTTTGGACTTTCG
AAAGCACAAGCCATTTGACC
TTACCCATCTCTGATTCTGT
CTTACACCTCCATCACCTAC
TTCAACACTTCTCCTTCTTC
CAATAAATCGATTCCCTAGA
ACTTGGCTGTTATATTCAGG
TGTTACACAAGCGATCTAAA
ACAAATCACCATTCAAGATT
AATCAGTTCATACGAAAAGG
AGAGTCAAGTCAAGAACGAG
CTAGTTACCTGAGTCCCTGA
ATTGCATCTTTGTTTAGACC
CTAATTGCTTCAAAGAAACC
TGCGTTGCAAGAAATTATCG
CAGACGTATCAAATGACAAATG
AAGGATCTCGTCTTCAATAG
GGACAAAGAGGGCGTTGATA
AACACCAAAGCTGCCAGAAT
GACTACTGCTCAAACTATTCGG
GTACTTAGCGTCGCACAC
TCAGGCTGCAGTAGTTTGGA
45
1_mut_F2
5_mut_F2
9_mut_F2
12_mut_F2
13_mut_F2
COL_WT
LER_WT
REF_SEQ_At5g40280_exon6-exon8
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
TGACAGCGCACTCTTTATGCATTCGTATCNCTGTTAATGCCATACCTTCA
TGACAGCTCACTCTTTATGCATTCGTATCGCTGTTAATGCCATACCTTCA
******* ********************* ********************
1_mut_F2
5_mut_F2
9_mut_F2
12_mut_F2
13_mut_F2
COL_WT
LER_WT
REF_SEQ_At5g40280_exon6-exon8
GTCAT--------------------------------------------GTCAT--------------------------------------------GTCAT--------------------------------------------GTCAT--------------------------------------------GTCAT--------------------------------------------GTCATGTTGTTTTTTTAATTCTTGCTTAATTCTACTTACTCACTGATCGT
GTCATGTTGTTTTTTTAATTCTTGCTTAATTCTACTTACTCACTGATCGT
GTCATGTTGTTTTTTTAATTCTTGCTTAATTCTACTTACTCACTGATCGT
*****
1_mut_F2
5_mut_F2
9_mut_F2
12_mut_F2
13_mut_F2
COL_WT
LER_WT
REF_SEQ_At5g40280_exon6-exon8
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TAGGATGCATGATATGGGAGAAATGGATGTTCGTGCATGCTACACTGCAA
TAGGATGCATGATATGGGAGAAATGGATGTTCGTGCATGCTACACTGCAA
TAGGATGCATGATATGGGAGAAATGGATGTTCGTGCATGCTACACTGCAA
1_mut_F2
5_mut_F2
9_mut_F2
12_mut_F2
13_mut_F2
COL_WT
LER_WT
REF_SEQ_At5g40280_exon6-exon8
-TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
-TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
-TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
-TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTNTNNNNA
-TTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
TTTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
TTTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
TTTCGGTGAGTTTTACCAACTTCTATTTTCCTTTTCTCTGTTTTTGTGGA
****************************************** *
*
Fig. A.3: Alignments of Sanger sequencing data. A region around exon 7 of ERA1 in mutant
mapping population (“mut_F2”) and wild-type (“WT”) individuals was PCR amplified, Sanger
sequenced, and aligned against a reference sequence. Red, intron 6; blue, exon 7; orange, intron
7.
A.4: Discovery of a genetic mutation using shotgun sequencing data and Sanger sequencing
RNA-seq data alignments of the region showed that the missing sequence corresponded to exon
7 of ERA1. It was hypothesized that the missing exon in ERA1 cDNA was due to either a
misspliced pre-mRNA (arising as a consequence of a DNA point mutation) or a deletion of all or
part of the exon in the mutant genome. After PCR amplification of the region around exon 7 of
ERA1 of five mapping population mutant individuals and col and ler wild-type individuals, the
fragments were Sanger sequenced. Analysis of the Sanger sequencing data showed that a 96 bp
region of ERA1 intron 6 and exon 7 was missing in all mutant individuals but not in col or ler
wild-type individuals (Fig. A.3).
Due to the deletion of this large section of intron 6 and exon 7, the intron 6/exon 7 splice
acceptor site is missing. To explain the lack of the remainder of exon 7 in spliced mRNA/cDNA,
it is hypothesized that the exon 6/intron 6 splice donor site uses the next available splice acceptor
site at the intron 7/exon 8 boundary, leading to a mature mRNA that is missing what remains of
46
exon 7 in the mutant individuals (Fig. A.3).
A.5: Summary
High-throughput sequencing data provides researchers with an ability to quickly and confidently
map genetic mutations. Using simple analytical techniques on this data, researchers pursuing
forward genetics projects can roughly map the location of a given mutation to relatively small
regions of chromosomes. In the example given here, a mutation in Arabidopsis thaliana was
mapped to approximately 1 Mbase using RNA-seq data from a mapping population. After rough
mapping, researchers can use this information to focus on finer mapping via PCR-base
techniques as well as detailed computational searches for potential mutant loci. As shown here
with molecular mapping data, computational results, and Sanger sequencing, “shotgun mapping”
can be a useful and accurate first step when mapping genetic mutations.
A.6: Methods
A.6.1: RNA-seq library construction and sequencing data analysis
RNA was isolated from control and mutant phenotype Arabidopsis thaliana mature leaf tissue of
greenhouse-grown plants, frozen with liquid nitrogen and ground with mortar and pestle. Total
RNA was isolated and purified with a QIAGEN RNeasy Mini Plant kit according to
manufacturer’s recommended protocol. To purify mRNA from total RNA, QIAGEN Oligotex
Direct mRNA Mini Kit × 2 followed by Invitrogen RiboMinus Plant Kit was used, according to
manufacturer’s recommended protocol. After mRNA isolation, mRNA was fragmented with
Ambion RNA Fragmentation Reagents (Invitrogen) according to manufacturer’s recommended
protocol. Fragmented mRNA was used to synthesize double-stranded cDNA using Invitrogen
SuperScript III, according to manufacturer’s recommended protocol, and random hexamer
primers for first strand synthesis, followed by RNaseH and DNA Pol I reaction, according to
manufacturer’s recommended protocol, for RNA digestion and second strand synthesis.
After cDNA synthesis, RNA-seq libraries were constructed using the end repair, 3’ A base
addition, adapter ligation, PCR amplification, and library analysis steps outlined in Section 2.7.7.
For RNA-seq libraries, unmethylated adapters were used and all reaction clean-up steps used the
QIAGEN MinElute PCR Purification Kit. RNA-seq libraries were sequenced at the
QB3/Genomic Sequencing Laboratory on Illumina HiSeq 2000 sequencers. Sequencing data was
processed and analyzed as outline in Section 2.7.8.
A.6.2: PCR-based mapping
DNA was isolated from control and mutant phenotype Arabidopsis thaliana mature leaf tissue of
greenhouse-grown plants using the DNA extraction protocol outlined in Section 2.7.2, with the
exception of grinding tissue with extraction buffer directly in microcentrifuge tubes and skipping
the phenol:chloroform:isoamyl alcohol step. GoTaq Green Master Mix and reaction-appropriate
primers (Table A.1) were used for PCR amplifications according to manufacturer’s
recommended protocol. For CAPS markers, 5µl of each PCR product was digested according to
manufacturer’s recommended protocol with 0.5µl reaction-appropriate restriction enzyme, 2µl
47
enzyme buffer, and 12.5µl water. Products were analyzed on 1 to 4% agarose w/v TAE
electrophoresis gels, as appropriate for product size and needed resolution. Mapping data was
organized and analyzed in Microsoft Excel.
A.6.3: Sanger sequencing
DNA from homozygous mutants in the PCR mapping population was used for PCR-based
amplification of fragments of A. thaliana ERA1 and Sanger sequencing of amplified fragments.
After amplification, samples were cleaned using Agencourt AMPure XP magnetic beads and
prepared for Sanger sequencing at the UC-Berkeley DNA Sequencing Facility by including
facility-suggested concentration of sequencing primers. Fragment and TAIR reference sequences
were then aligned for analysis using MUSCLE.
48