Fine scale recombination variation in Drosophila melanogaster

University of Iowa
Iowa Research Online
Theses and Dissertations
Fall 2015
Fine scale recombination variation in Drosophila
melanogaster
Andrew B. Adrian
University of Iowa
Copyright © 2015 Andrew Blake Adrian
This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/2175
Recommended Citation
Adrian, Andrew B.. "Fine scale recombination variation in Drosophila melanogaster." PhD (Doctor of Philosophy) thesis, University of
Iowa, 2015.
http://ir.uiowa.edu/etd/2175.
Follow this and additional works at: http://ir.uiowa.edu/etd
Part of the Biology Commons
FINE SCALE RECOMBINATION VARIATION IN DROSOPHILA MELANOGASTER
by
Andrew B. Adrian
A thesis submitted in partial fulfillment
of the requirements for the Doctor of
Philosophy degree in Biology
in the Graduate College of
The University of Iowa
December 2015
Thesis Supervisor: Associate Professor Josep M. Comeron
Copyright by
Andrew Blake Adrian
2015
All Rights Reserved
Graduate College
The University of Iowa
Iowa City, Iowa
CERTIFICATE OF APPROVAL
_______________________
PH.D. THESIS
_______________
This is to certify that the Ph.D. thesis of
Andrew Blake Adrian
has been approved by the Examining Committee
for the thesis requirement for the Doctor of Philosophy
degree in Biology at the December 2015 graduation.
Thesis Committee: ___________________________________________________
Josep Comeron, Thesis Supervisor
___________________________________________________
Ana Llopart
___________________________________________________
Robert Malone
___________________________________________________
Sarit Smolikove
___________________________________________________
Marc Wold
To my grandfather, Nathan Pearson, for instilling in me the value of curiosity.
ii
ACKNOWLEDGEMENTS
Over the course of the last ten years, I have encountered many people that have had
an impact on my development as a scientist and as a thoughtful individual. As I approach
the final months of my doctoral education, it seems appropriate to acknowledge those
people who have meant the most to me. During my tenure as an undergraduate at the
University of Alabama in Huntsville, my mentor Bruce Stallsmith inspired me to follow my
own interests and how to systematically analyze the world around me. There too I made
friends with exceptional individuals who encouraged me to pursue science. Derek and Nick,
I am eternally grateful for your friendship. At the University of Iowa, I have learned a great
deal from my adviser, Josep Comeron, whom has taught me to figure things out on my own
rather than just giving me the answer and has provided thoughtful and well-reasoned
feedback throughout my journey in graduate school. My committee members, Bob, Ana,
Marc, and Sarit—thank you for your precious time and helpful feedback. While riding the
sinusoidal wave of life, I met my soon to be wife, Claire, who has altered my world view
more than any other. Thank you for always being there for me and for being you.
Furthermore, I am indebted to my family for their persistent and unwavering support. My
mother and father have always believed in me, and my sister’s pride for me is continually
flattering. Finally, I’d like to thank the proverbial “Reviewer #3” that is always able to bring
me down to reality and force me to question my own competence. Claire, I’m glad I thought
of that.
iii
ABSTRACT
Natural variation is a principle component of biology. One process that affects levels
of natural variation is meiotic recombination—the process by which homologous
chromosomes break and interchange genetic information with one another during the
formation of gametes. Surprisingly, this factor that shapes levels of natural variation within
the genome also exhibits a great deal of variation. That variation in the distribution of
recombination rates manifests itself at many levels: within genomes, between individual
organisms, across populations, and among species. In most cases, the factors and
mechanisms responsible for the non-random patterning of recombination events across the
genome remain particularly elusive. Herein, I utilize a combination of bioinformatic and
molecular genetic approaches to better explain recombination patterning. I explore several
factors that are now known to contribute to the distribution of recombination events across
genomes. In particular, I demonstrate that transcriptional activity during meiosis is
associated with, and partially predictive of crossing over events in Drosophila melanogaster.
Additionally, I present a model which is capable of accounting for approximately 40% of the
variation in crossover rates in Drosophila based on the localization of several previously
identified DNA motifs. Lastly, I present preliminary data describing how recombination
patterns are altered under naturally stressful conditions, a key insight that is necessary for
uniting our findings between levels of variation in recombination rates. These findings
support a multifactorial model for crossover distribution that includes both genetic and
epigenetic factors and will further progress the field in developing a comprehensive
understanding of recombination localization.
iv
PUBLIC ABSTRACT
Meiotic recombination, the process by which parental chromosomes are broken and
repaired to generate a patchwork of genetic material, maintains genetic diversity by
allowing individual sites in the DNA to take on separate evolutionary trajectories. This
process of forming and repairing breaks in double-stranded DNA, occurs non-randomly
across the genome. The reasons for the non-random distribution of breaks and the factors
that affect it are poorly understood in all organisms examined. Because recombination
shapes levels of genetic diversity, it is important to understand how and why recombination
is varied in order to draw conclusions about the evolution of a species and in order to assist
in the association between many diseases and their causative genetic mutations.
The work presented in this thesis is an investigation of recombination rate variation
in a fruit fly model and the factors that affect its genomic distribution. In particular, I
present a study of meiotic gene expression and how recombination rates are affected by
genes that are ‘on’ during the formation of double-stranded breaks in the DNA. I also
present a study of DNA motifs and how these motifs may be predictive of recombination
rate variation. Lastly, I examine how recombination rates are altered under stressful
conditions to better understand how the many levels of recombination variation are
intertwined. My findings demonstrate that many factors—both genetic and epigenetic—are
involved in shaping variation in recombination rates, and a combination of each can account
for a large proportion of the variation we observe within genomes.
v
TABLE OF CONTENTS
LIST OF TABLES ...................................................................................................................................................... ix
LIST OF FIGURES ...................................................................................................................................................... x
LIST OF ABBREVIATIONS ................................................................................................................................... xi
CHAPTER 1: Introduction..................................................................................................................................... 1
1.1
Meiotic Recombination .................................................................................................................... 1
1.1.1
A General Overview of Meiotic Recombination ................................................................ 1
1.1.2
General Pathways of DSB Induction, Resolution .............................................................. 3
1.1.3
Advantages of Recombination ................................................................................................. 7
1.1.4
Recombination and Polymorphism ....................................................................................... 8
1.2
Recombination Rate Variation ...................................................................................................... 9
1.2.1
The Many Levels of Recombination Rate Variation ........................................................ 9
1.2.2
Differences Within & Between Species .............................................................................. 10
1.2.3
Sex-Biased Recombination...................................................................................................... 12
1.2.4
Genomic and Environmental Variation in Recombination Rates ........................... 13
1.2.5
Explaining Recombination Variation .................................................................................. 16
1.3
Chapter 1 References ...................................................................................................................... 19
CHAPTER 2: The Drosophila early ovarian transcriptome provides insight to the
molecular causes of recombination rate variation across genomes ................................................ 30
2.0.1
Preface ................................................................................................................................................... 30
2.0.2
Abstract................................................................................................................................................. 30
2.1
Background ......................................................................................................................................... 31
2.2
Results & Discussion ....................................................................................................................... 35
2.2.1
General Patterns of the Drosophila Early Meiotic Transcriptome ......................... 35
2.2.2
Differentially Expressed Genes in Early Meiotic Tissues ........................................... 39
2.2.3
New Genes and Isoforms ......................................................................................................... 40
2.2.4
Parent-of-Origin Effects in the Early Meiotic Tissue .................................................... 41
2.2.5
Transcription is Associated with Increased Recombination Rates ........................ 43
2.3
Conclusions ......................................................................................................................................... 48
vi
2.4
Methods ................................................................................................................................................... 51
2.4.1
Drosophila Stocks and Tissue Preparation ...................................................................... 51
2.4.2
Illumina Library Preparation and Sequencing ............................................................... 51
2.4.3
Sequence Alignment and Expression Analyses .............................................................. 52
2.4.4
Novel Gene Identification ........................................................................................................ 53
2.4.5
Parent-of-Origin Effects............................................................................................................ 54
2.4.6
Genomic Distribution of Transcribed Genes ................................................................... 55
2.4.7
Recombination vs Expression Analysis ............................................................................. 55
2.5
Supplementary Information............................................................................................................ 56
2.6
Chapter 2 References ......................................................................................................................... 60
CHAPTER 3: In silico prediction of recombination rate variation across the Drosophila
melanogaster genome based on multiple DNA motif analysis ........................................................... 67
3.0.1
Preface .................................................................................................................................................. 67
3.0.2
Abstract ................................................................................................................................................ 67
3.1
Background ............................................................................................................................................ 68
3.2
Results & Discussion .......................................................................................................................... 71
3.2.1
Generation of Genome-Wide landscapes of DNA motifs ............................................ 71
3.2.2
Variation in motif presence among chromosome arms .............................................. 72
3.2.3
Genomic co-occurrence of motifs ......................................................................................... 73
3.2.4
Motif presence is correlated with crossover rates across the genome ................ 73
3.2.5 A predictive model of variation in crossover rates across the genome
based on sequence motif occurrence .................................................................................................. 78
3.2.6
Random Forests (RF) categorical modeling .................................................................... 78
3.2.7
MARS modeling............................................................................................................................ 80
3.3
Conclusions ............................................................................................................................................ 84
3.3.1
Motif Composition ...................................................................................................................... 84
3.3.2
Differences among Chromosome Arms ............................................................................. 86
3.3.3
Crossover Localization across the Genome...................................................................... 87
3.4
Methods ................................................................................................................................................... 90
3.4.1
Motif Landscape Generation .................................................................................................. 90
vii
3.4.2
Model Generation and Attribute Selection ....................................................................... 92
3.4.3
Population-scaled high-resolution crossover maps ..................................................... 94
3.5
Supplementary Information............................................................................................................ 96
3.5
Chapter 3 References ...................................................................................................................... 102
CHAPTER 4: Environmental contribution to recombination variation........................................ 111
4.1
Introduction ........................................................................................................................................ 111
4.2
Results & Discussion ....................................................................................................................... 113
4.2.1
Identification of Stress-Inducing Conditions ................................................................ 113
4.2.2
Recombination Frequency among Phenotypic Markers ......................................... 114
4.3
Materials and Methods ................................................................................................................... 116
4.3.1
Generation of Recombinants ............................................................................................... 116
4.3.2
DNA Library Preparation...................................................................................................... 117
4.3.3
Recombination rate estimation.......................................................................................... 118
4.4
Chapter 4 References ...................................................................................................................... 118
CHAPTER 5: Conclusions & Future Directions ....................................................................................... 121
5.1
Transcriptional Effects on Crossover Localization ............................................................. 121
5.2
In silico Prediction of Recombination Rates Based on Multiple Motifs ..................... 123
5.3
Environmental Impact on Recombination Rates ................................................................. 125
5.4
Summary .............................................................................................................................................. 126
5.5
Chapter 5 References ...................................................................................................................... 128
viii
LIST OF TABLES
Table 2.1 mRNA-seq statistics for each sample ....................................................................................... 37
Table 2.2 Top ten differentially expressed genes by fold-change .................................................... 41
Table 2.3S Top Enriched GO Terms among genes with parent-of origin effects with
maternal-like transcript levels for Early- and Late-Ovarian samples.............................................. 56
Table 2.4S Top Enriched GO Terms among genes with parent-of origin effects with
maternal-like transcript levels ......................................................................................................................... 57
Table 3.1S Statistics of motif presence in D. melanogaster .................................................................. 96
Table 3.2S Summary of MARS models of crossover distribution in
D. melanogaster ...................................................................................................................................................... 97
ix
LIST OF FIGURES
Figure 1.1 Double strand break repair and recombination during meiosis .................................... 3
Figure 2.1 Comparison of Log10 FPKM values .......................................................................................... 38
Figure 2.2 Transcriptional differences between autosomes and the X chromosome ............... 38
Figure 2.3 Relationship between transcription and recombination rates ..................................... 45
Figure 2.4 Relative presence of DSBs across the genome ..................................................................... 47
Figure 2.5S FPKM distribution density across genes.............................................................................. 58
Figure 2.6S Volcano plot of genes by significance .................................................................................... 58
Figure 2.7S Early vs Late Log2 fold difference histogram .................................................................... 59
Figure 2.8S P-value distribution following normalization ................................................................... 59
Figure 3.1 Genomic Landscape of Motif 3.................................................................................................... 76
Figure 3.2 Probability heatmap of correlation between motif presence and crossover
rates for different chromosome arms ........................................................................................................... 77
Figure 3.3 True-positive rate generated by a Random Forests (RF) model.................................. 82
Figure 3.4 MARS predictive models of crossover rates ......................................................................... 83
Figure 3.5S Effect of false discovery thresholds on motif presence and sequence .................... 98
Figure 3.6S Motif logos and individual Spearman’s correlation with crossover rates ............. 99
Figure 3.7S Correlation between motif presence and crossover rates for different
chromosome arms .............................................................................................................................................. 100
Figure 3.8S LASSO coefficient paths ........................................................................................................... 101
Figure 4.1 Surviving offspring in media containing acetic acid....................................................... 114
Figure 4.2 Map distances between select phenotypic markers under acid stress .................. 116
x
LIST OF ABBREVIATIONS
AUC
Area under the curve
cM
CentiMorgan
CO
Crossover
CoHR
Common homology region
CV
Cross-validation
DGRP
Drosophila genetic reference panel
DNA
Deoxyribonucleic acid
DSB
Double strand break
FDR
False discovery rate
FPKM
Fragments per kilobase of transcript per million mapped reads
GC
Gene conversion
GCV
Generalized cross-validation
H3K4me3
Histone 3 lysine 4 trimethylation
HR
Homologous recombination
Kb
Kilobase
LASSO
Least angle selection and shrinkage operator
MARS
Multiple adaptive regression splines
NHEJ
Non-homologous end joining
PCR
Polymerase chain reaction
PFM
Position frequency matrix
RAL
Raleigh, NC population
RF
Random forests
RG
Rwanda population
RNA
Ribonucleic acid
ZI
Zambia population
xi
1
CHAPTER 1: Introduction
1.1 Meiotic Recombination
1.1.1
A General Overview of Meiotic Recombination
Meiosis (from the Greek meioun, meaning “to make small”) is the highly regulated process
in which diploid progenitor cells undergo reductional and equational divisions yielding
haploid gametes, ready for fertilization. The process of meiosis by which gametes such as
egg and sperm are created, ensures the proper passage of genetic material from one
generation to the next in sexually reproducing organisms. The process is tightly regulated at
many points, as the consequences of failure in the process can lead to disastrous effects for
the next generation of organisms. One particular process within meiosis is of greatest
interest to me—recombination. During the process of recombination, cells sacrifice the
inherent stability of their double-stranded DNA molecules and intentionally form breaks in
the DNA. This dance by which chromosomes are duplicated, broken and repaired,
separated, and segregated into cells containing half of the parental DNA content leads to
gametes ready for fertilization in order to establish the next generation.
The process of meiosis is heavily conserved across evolutionary time, as sex
presents itself as the rule, rather than the exception, throughout the great majority of
animal life. While many organisms are capable of asexual reproduction, sex appears to be
ubiquitous in Eukarya due to need to generate recombinant offspring. Indeed, the
machinery necessary for the interchange of homologous DNA seems to have evolved well
before the appearance of eukaryotes as a mechanism for DNA repair. Intriguingly, while the
process as a whole is heavily conserved, organisms display a wide diversity in terms of the
specifics of each particular organism’s program of meiosis. However, one aspect of meiosis
that appears extraordinarily near universal is the process of meiotic recombination. One
2
notable exception, however, is that males of D. melanogaster do not exhibit recombination.
Very few multicellular organisms are known to lack recombination (and indeed, many, if
not most, unicellular organisms possess the capability). Outside of meiosis, homologous
recombination exists as a repair mechanism for sporadic double-stranded breaks (DSBs)
and other types of damage in somatic cells (Kuzminov 1999, Wyman and Kanaar 2006).
Within meiosis, homologous recombination is the process in which homologous
chromosomes sacrifice the inherent stability of their double-stranded DNA molecules by
programmatically forming double-stranded breaks in the DNA, and then repairing those
breaks with their homologous partner, generating recombinant chromosomes containing a
patchwork of genetic information from both parental chromosomes. These breaks and their
subsequent repair manifest themselves as either crossing-over (CO) events (Figure 1.1), in
which a double-Holliday junction is formed and resolved, or as gene-conversion (GC)
events, where a strand of DNA is used as the repair template in either a Holliday-dependent
(as a crossover that is repaired symmetrically) or –independent (via synthesis dependent
strand annealing) manner, yielding unequal resultant products (Mehrotra, Hawley et al
2008). During meiosis, all DSBs are repaired through either CO or GC (Figure 1.1), and thus
the sum of both events equals the total DSBs, though this thesis is primarily concerned with
those DSBs that are repaired as CO events. While small tracts of heteroduplex DNA are in
fact formed in CO, I will exclude these regions in practice, referring to all non-crossovers as
just GC, and all crossovers as CO. The simple ability for organisms to reorganize genetic
variation within a population over generations yields extraordinary consequences for the
evolution of the species. The primary focus of this thesis will be the distribution of these
breaks and their subsequent repair as crossovers in meiosis. As we shall see, this process is
important evolutionarily for the maintenance of diversity and plays an integral role in
sculpting the genomes of species.
3
Figure 1.1 Double strand break repair and recombination during meiosis
The DSB repair via double Holliday junction can generate either crossover (CO) or non-crossover (gene
conversion, GC) events while the Holliday junction-independent repair (synthesis-dependent strand annealing,
SDSA) mechanism causes only GC events. (From Comeron et. al 2012)
1.1.2
General Pathways of DSB Induction, Resolution
Because organisms maintain the same number of chromosomes throughout
generations, chromosome number before fertilization must be reduced by half. Before this
reductional division, the process of recombination occurs—generating a chromosomal
patchwork that causes each gamete to be genetically distinct. The steps in which meiosis
occurs: replication, recombination, and reduction, are carefully controlled to produce
functional gametes and to avoid mistakes which could cause difficulties for the next
generation.
4
The initial work in meiosis was mainly produced through observations of
segregating chromosomes with light microscopy (Reviewed in Zickler and Kleckner 1999).
Subsequently, researchers developed mutants and more advanced methods of detecting
meiotic defects (Cox and Game 1974, Game and Mortimer 1974, Hopper 1975). Much of our
current understanding comes from work that was performed in the budding yeast,
Saccharomyces cerevisiae. As the steps and control mechanisms of meiosis have been found
to be widely conserved, these yeast experiments have provided significant insight into the
process of meiosis in other organisms (Roeder 1995, Zickler and Kleckner 1999).
Prior to the first division in meiotic prophase I, homologous chromosomes pair,
align, and undergo homologous recombination with one another. In most organisms
recombination is mechanically necessary to mediate the pairing of homologous
chromosomes during prophase I. The details and relationships between synapsis and
recombination vary among organisms (Santos 1999), but these basic processes appear
generally conserved and the steps of each, despite individualities endemic to particular
organisms, seem highly regulated (John 2005). For instance, while the initiation of
recombination is not required for synapsis in Drosophila (McKim, Green-Marroquin et al.
1998, Da Ines, Gallego et al. 2014) and C. elegans (Bhalla and Dernburg 2008, Da Ines,
Gallego et al. 2014), recombination is essential for synapsis in Sacchromyces and all
mammals thus far examined (Weiner and Kleckner 1994, Gerton and Hawley 2005, Zickler
and Kleckner 2015). In any event, where recombination is required for proper meiosis,
disruption in any event leading to the formation or resolution of DSBs is catastrophic.
In all recombining organisms, the formation of DSBs is catalyzed by a highly
conserved topoisomerase-like protein resembling the Spo11 (Mei-w68 in Drosophila)
protein from yeast (McKim and Hayashi-Hagihara 1998). This protein been identified as an
5
element necessary for DSB induction for which recombination is ultimately dependent
upon. In yeast, at least nine other accessory proteins have been identified to be required
with Spo-11 (MEI4, MER2, REC102, REC103, REC104, REC114, MRE11, RAD50, XRS2, as
well as MER1 and MRE2) (Keeney 2001, Jiao, Salem et al. 2003, Prieler, Penkner et al. 2005).
Most of these genes we identified to be required for the repair of radiation-induced DNA
damage in S. cerevisiae. Spo11 catalyzes breaks in the DNA through a trans-esterification
reaction involving a covalent protein-DNA intermediate and is subsequently removed
through endonucleolytic cleavage (Reviewed in Keeney 2008) by the Mre11-Rad50-Xrs2
(MRX) complex (Symington 2002).
In most cases crossing over (CO) or gene conversion (GC) formation can be
produced with exogenous DSB inducing agents, but Spo11 is necessary in their absence
(Keeney 2008). Though the distribution of DSBs is demonstrably non-random, the precise
mechanism by which Spo11 and its cofactors determines sites to cleave is unknown, and
will be a subject of discussion throughout this thesis (Petes 2001, de Massy 2003, Kauppi,
Jeffreys et al. 2004). Though there is some evidence in S. pombe in which Spo11 may possess
a DNA binding domain, no such consistent recombination-associated signal for the
recruitment of Spo11 has thus far been identified (DeWall, Davidson et al. 2005). It is
possible that epigenetic labeling of chromatin influences recombination sites, as
methylation of Histone 3 Lysine residue 4 (H3K4me3) is enriched near sites of DSB
formation, and has further been shown to be recruited to the chromosomal axis by Spp1,
where recombination events are carried out (Sollier, Lin et al. 2004, Borde, Robine et al.
2009, Acquaviva, Szekvolgyi et al. 2013, Sommermeyer, Beneut et al. 2013, Sun, Huang et al.
2015).
6
Many breaks are formed across a chromosome while only a few are repaired as
crossovers. As an example, Mus musculus generates 300-400 double stranded breaks
whereas only a few (20-35) are repaired as crossovers (Plug, Xu et al. 1996, Kolas,
Svetlanov et al. 2005, Baudat and de Massy 2007). However, whether sites enriched for
DSBs are always enriched for COs remains unclear, though some experiments in S. pombe
indicate that some DSB hotspots are preferentially repaired with the sister chromatid
(Hyppa and Smith 2010). It remains to be seen how precisely the decision to repair DSBs as
crossovers is made.
For a crossover event to occur, DSBs are exonucleolytically resected on the 5’ end of
the DNA molecule by the MRN complex, leaving a single-stranded 3’ overhand that is free to
invade a homologous chromosome sequence (Symington 2002, Yin and Smolikove 2013,
Symington 2014). The single stranded overhang becomes bound by Rad51, which
promotes pairing and strand exchange (Sung and Robberson 1995). The invading strand is
extended and recaptured by the broken chromatid, forming a double Holliday junction
(Figure 1). Throughout this process, a proteinaceous scaffolding called the synaptonemal
complex is erected to stabilize the synapsed homologs. Not all of the formed Holliday
junctions will be repaired as crossovers, but may instead just form small (<1000bp) tracts
of heteroduplex DNA known as non-crossover gene conversion events (Bhalla and
Dernburg 2008, Comeron, Ratnappan et al. 2012, Padhukasahasram and Rannala 2013).
Both events, crossovers and non-crossover gene conversions, generate recombinant
products but affect nucleotide polymorphism of populations at different scales—crossovers,
while relatively rare, are large evolutionary events while gene conversion events are much
more prevalent and act on short, sub-1kb, scales (Ohta 1976, Ohta 1976, Walsh 1983,
7
Padhukasahasram and Rannala 2013). The question remains, however: why does such a
complicated system facilitating sexual reproduction exist?
1.1.3
Advantages of Recombination
For recombination to exist as pervasively as it does (as a product of natural
selection) the advantages of recombination must outweigh the risks—but what are those
advantages? The interchanging of bits of genetic information between homologous
chromosomes serves several important functions, both mechanistically and evolutionary.
Mechanistically, recombination is critical for ensuring proper chromosomal pairing and
disjunction in the reductional division. The formation of DSBs and their subsequent repair
can, in many organisms, often only properly occur when chromosomes are synapsed. A
notable exception to this is Drosophila melanogaster, in which males show no evidence for
meiotic recombination. Meiotic crossovers help provide tension to the chromosome as
homologous chromosomes align in preparation for separation—aiding the proper unipolar
orientation towards the spindle poles (Bhalla and Dernburg 2008).
Non-meiotic homologous recombination is an important process by which sporadic
somatic DSBs may be repaired. In the event that homologous recombination fails to repair
DSBs, the error-prone non-homologous end-joining (NHEJ) repair mechanism may be
employed in some organisms such as C. elegans (Joyce, Paul et al. 2012, Yin and Smolikove
2013), potentially leading to fragmentation, chromosomal rearrangements (e.g.
translocations), or deletion of genetic information (Bhalla and Dernburg 2008) and for that
reason NHEJ, where tested appears to be suppressed during meiosis (Joyce, Paul et al.
2012). In metazoans, failure to repair completely shunts the meiotic cell towards apoptosis
(Gumienny, Lambie et al. 1999). HR is therefore the preferred mechanism of repair, as
alternatives are suppressed and failure of HR leads to death of the developing oocyte.
8
1.1.4
Recombination and Polymorphism
Evolutionarily, recombination has been extensively studied for its role in genetic
variation (Begun and Aquadro 1992, Rockman, Skrovanek et al. 2010, Comeron 2014,
Wallberg, Glemin et al. 2015), limiting accumulation of deleterious mutations (Comeron,
Kreitman et al. 1999, Bachtrog 2003), and enhancing rates of adaptation (Nordborg, Hu et
al. 2005, Presgraves 2005, Larracuente, Sackton et al. 2008). Recombination acts to free
linked alleles from one another, untangling loci that would otherwise be tied to the fate of
other proximal loci. This phenomena is colloquially known as the Hill-Robertson effect (Hill
and Robertson 1966), whereby selection at one site in the genome limits the effect of
selection at another site when recombination is reduced or absent (Felsenstein 1974). In
these arguments, naturally occurring linkage disequilibria—nonrandom associations
between alleles—form due to the sheer nature of the DNA polymer chain, and this
interrupts the ability for selection to act efficiently at individual sites (Kliman and Hey 1993,
Hey and Kliman 2002, Comeron, Williford et al. 2008). Because of linkage disequilibrium,
sites that are linked to selected sites also experience reduced effective population size (Ne).
Recombination has the effect of increasing Ne and it consequently enhances a population’s
ability to respond to selection (See below). This plays out in two models of selection. In the
case of positive selection, alleles are pushed to high frequency through selective sweeps and
carry with them any linked variants. This has the net effect of reducing genetic variation in
areas proximal to the selected locus when linked alleles replace all other variants in the
region. Similarly, in models of background (negative) selection, deleterious alleles are
removed from populations, likewise carrying all linked variation with the deleterious allele.
Once again, this results in a net reduction of genetic diversity that may be ameliorated by
recombination breaking linkage between genes, allowing selection to act on smaller regions
of the genome. These effects have been well established, especially in Drosophila (Berry,
9
Ajioka et al. 1991, Kliman and Hey 1993, Charlesworth 1996, Barton and Charlesworth
1998, McVean and Charlesworth 2000, Marais, Mouchiroud et al. 2001, Hey and Kliman
2002, Kliman and Hey 2003). In fact, it is this phenomenon of reduced diversity that many
evolutionary biologists search for as hallmarks of selection (Betancourt and Presgraves
2002, Smith and Eyre-Walker 2002, Kim and Nielsen 2004). Thus, in order to reach accurate
conclusions regarding selection, researchers need to understand the effects of
recombination. Recombination enables selection—either positive or negative—to act at
more precise points within the genome, increasing the efficacy of selection and preserving
important nucleotide diversity that fuels the evolutionary process.
Recombination affects levels of standing variation within a population—giving
organisms the possibility of an immediate response to environmental change. Genetic
variation is of paramount importance to the overall existence of a species, as without fuel to
the fires of evolution, species are extinguished—failing to adapt in changing conditions.
Genetic variation is the major source of phenotypic and adaptive differences we observe
from organism to organism, species to species. Therefore, our understanding of life through
the lens of genetics is based upon the study of genetic variation in the genomes of living
organisms. Herein, I present an effort to further understand how rates of meiotic
recombination are distributed across genomes in an attempt to better understand how this
event shapes genomic patterns of nucleotide variation.
1.2 Recombination Rate Variation
1.2.1
The Many Levels of Recombination Rate Variation
Given how recombination rates alter the distribution, or “landscape”, of standing
polymorphisms across the genome, it is perhaps surprising that the process of
recombination itself possesses significant variation (Jensen-Seaman, Furey et al. 2004,
10
Coop, Wen et al. 2008, Smukowski and Noor 2011, Comeron, Ratnappan et al. 2012).
Recombination variation presents itself at several levels: between species, among
populations of the same species, among individuals of the same population, between sexes,
and within individual genomes. Using traditional mapping techniques, early researchers
quickly discovered that the rate of crossing over was “…one of the most variable
phenomena known” (Gowen 1919, Detlefsen and Clemente 1923). In fact, it was this
variability that was key to solving the Castle-Morgan debate of the linear arrangement of
genes on chromosomes (Sturtevant, Bridges et al. 1919). In some cases, such as is the case
in humans, most of the genome is a veritable recombination desert (recombination
infrequent) with a few hotspot oases (regions where recombination is enriched hundreds to
thousands of times greater than the genome average rate) harboring >60% of all
recombination events (Petes 2001, Myers, Bottolo et al. 2005, Serrentino and Borde 2012).
As the focus of this thesis is on Drosophila, it must be noted that D. melanogaster does not
possess classical hotspots where regions are enriched thousand-fold, but rather 10-20 fold
increases (Hey 2004, Comeron, Ratnappan et al. 2012). Why some regions are more or less
likely to recombine than others is a complex story that remains largely unanswered, and
will be the focus of this thesis. In the following section I will provide evidence for the many
levels of variation that we observe in recombination rates.
1.2.2
Differences Within & Between Species
Just as a honeybee is morphologically distinct from a human, so too are the macro-
scale recombination rates between the two. Genome-wide average recombination rates for
honeybees is approximately 22 centiMorgans per megabase (cM/Mb) (Gempe, Hasselmann
et al. 2009, Wallberg, Glemin et al. 2015) while in humans the rate is approximately 1.2
cM/Mb (Jensen-Seaman, Furey et al. 2004). Life histories of each organism may play a role,
11
too. For instance, while the budding yeast, S. cerevisiae, may not recombine every
generation in nature [perhaps only once in every 1000 generations in the related S.
paradoxus (Tsai, Bensasson et al. 2008)]; when it does undergo meiosis, genome-wide
average recombination rate reaches approximately 360 cM/Mb (Connallon and Knowles
2007, Tsai, Bensasson et al. 2008). Furthermore, between-species topology of
recombination variation can vary widely. C. elegans and D. melanogaster have similar
genome-average recombination rates (2.7 and 2.44 cM/Mb respectively (Rockman and
Kruglyak 2009, Fiston-Lavier, Singh et al. 2010) but the distribution of CO is drastically
different—with maximum CO rates existing near the edges of C. elegans chromosomes and
Drosophila maxima existing centrally between centromeres and telomeres on each
chromosome arm (Hillers and Villeneuve 2003, Comeron, Ratnappan et al. 2012).
While large differences in recombination existing between species is not necessarily
surprising, the fact that large differences also exist within a species is much less easily
explained (Koehler, Cherry et al. 2002, Jeffreys and Neumann 2005, Stevison and Noor
2010, Heil and Noor 2012). In closely related populations of Mus musculus, Dumont and
Payseur (2011) identified significant variation in the number of meiotic MLH1 foci—a
marker of crossovers—with some populations differing from other as much as 35%. Other
studies in Homo (Graffelman, Balding et al. 2007) and Drosophila (Comeron, Ratnappan et
al. 2012) have identified similar population-scale differences among members of the same
species. What remains to be determined, however, is whether the factors affecting interand intra-species recombination rate variation are the same, and if these factors are shared
at other levels of recombination variation.
12
1.2.3
Sex-Biased Recombination
Recombination also varies between sexes. It has been known for well over 100
years that some organisms have achaiasmatic sex—the lack of recombination in one sex,
which has arisen independently at least 30 times throughout evolution (Charlesworth
2005). In general, certain aspects seem to be pervasive which may suggest a constant
evolutionary trend. For example, when there is recombination in only one sex, it is almost
always the homogametic sex and when there is recombination in both sexes, the
homogametic sex often has higher recombination rates, but there are a few exceptions
(Lenormand 2003, Hedrick 2007).
The homogametic sex possessing higher recombination rates is colloquially known
as the Haldane-Huxley rule, and is thought to be a side effect on autosomes of the
suppression of recombination between the sex chromosomes (Lenormand and Dutheil
2005). For those heterochiasmate species (those with recombination in both sexes) there
seems to be a great range in the ratio of hetero- to homogametic recombination rates—0.35
in the zebrafish, 0.62 in humans, and .83 in sheep (reviewed in Lenormand 2005).
Furthermore, the pattern of recombination varies with sex, with human males having
higher recombination at telomeric regions and females having higher recombination near
the centromere, as noted by Kong and others (Kong, Gudbjartsson et al. 2002), who
measured this using 1257 meiotic events detected within 5136 microsatellite markers. Why
this discrepancy in recombination between the sexes exists is of considerable debate, with
several explanations offered by Haldane (1922), Trivers (1988), Burt et al. (1991) and
Lenormand (2003) (reviewed in Hedrick). In particular, Lenormand and Dutheil (2005)
suggest that the sex with more intense haploid selection should have less recombination,
13
and indeed this explanation seems adequate for plants and most animals that have
repressed male recombination relative to their female counterparts.
1.2.4
Genomic and Environmental Variation in Recombination Rates
Intriguingly—and the focus of this thesis—is that recombination events occur non-
randomly within individual genomes. Complicating the search for causes of the distribution
of recombination rates is that recombination rates and are altered in differing
environmental conditions within the same genome (Nachman 2002). These two
observations, that recombination events occur nonrandomly and are affected by
environmental conditions, suggest interplay between epigenetics (in the context of
thermodynamic alterations) and genomics in the localization of recombination events. Thus
far, it is unknown if variation within individual genomes is a result of the same factors as
variation at other levels. However, we do have many observations of the types of variables
that seem to influence recombination rates. For example, recombination rates have long
been known to be altered by temperature (Bridges 1927), food quality, maternal age
(Plough 1917, Charlesworth and Charlesworth 1985), and exposure to various substances
(Plough 1917, Bridges 1927). That recombination rates are altered under different
environmental conditions may lend hints to the underlying mechanisms of DSB formation
and/or repair with respect to localization. This idea will serve as the basis of the work
performed in chapter 4.
Whereas certain factors: radiation, age, and temperature were found to alter
crossover frequency in Drosophila, others: food moisture content, starvation, or the
presence of ferric chloride were found to have no significant effect (Plough 1917). It was
also apparent that certain chromosomal regions responded differentially to each factor
(Bridges 1927, Abdullah and Charlesworth 1974, Charlesworth and Charlesworth 1985), an
14
early suggestion of underlying chromosome structure effects that largely still continue to
elude researchers. The usage of multiple phenotypic markers and repeated crossing to
assay crossover frequency continues today. There are recent reports employing this
traditional genetic mapping technique in investigating fine scale recombination patterns
from the Noor Lab in D. pseudoobscura (Cirulli, Kliman et al. 2007, Kulathinal, Bennett et al.
2008), the Singh Lab in D. melanogaster (Singh, Stone et al. 2013, Singh, Criscoe et al. 2015),
and ours (Chapter 4), though more modern techniques involving high-throughput
sequencing have begun replacing traditional approaches in each of the above laboratories
as well. A hybrid approach incorporating both high-throughput sequencing and traditional
mapping is being undertaken by our lab, and will be further discussed in Chapter four.
Recombination rates have been reasonably well studied at the level of entire
genomes. Whole genome distributions of recombination rates, or “Recombination
Landscapes”, reveal an incredible amount of thus far unexplained variation in
recombination rates. Recombination landscapes are different for each species; in humans,
sperm typing experiments revealed the punctate nature of recombination rates (JensenSeaman, Furey et al. 2004, Graffelman, Balding et al. 2007, Coop, Wen et al. 2008). While
humans and Drosophila have similar sex-averaged recombination rates (albeit without
recombination in Drosophila males) the chromosomal distribution of recombination varies
greatly between the species (Jensen-Seaman, Furey et al. 2004). As a broad pattern, in most
eukaryotes, rate of crossing-over peaks near the center of each chromosomal arm with a
great reduction of COs in heterochromatic telomeric and centromeric regions (Smukowski
and Noor 2011). As more information was gathered about recombination in humans
through the typing of sperm using PCR assays, researchers began to notice small regions of
the genome of 1-2kb in length with greatly enriched recombination (Jeffreys, Kauppi et al.
15
2001). These so called 'hotspots' of recombination have been found in yeast and mammals,
and in humans may account for as much as 60% of recombination in as little as 6% of the
genome (Frazer and O'Keefe 2007, Paigen and Petkov 2010, Parvanov, Petkov et al. 2010).
This sporadic distribution of a few highly recombining regions can be, in part, explained by
the occurrence of the PRDM9-associated motif. This motif, originally identified by Myers
(Myers, Freeman et al. 2008) represents a binding site for PRDM9, a histone
methyltransferase, but its role in recombination patterning was not apparent until two
groups independently linked it to mouse hotspots (Grey, Baudat et al. 2009, Parvanov,
Petkov et al. 2010) [The history of PRDM9 is reviewed in (Ségurel, Leffler et al. 2011)].
Other mammals have been shown to harbor PRDM9, though in some cases the PRDM9 gene
appears lost (Muñoz-Fuentes, Di Rienzo et al. 2011, Auton, Rui Li et al. 2013). In any case,
PRDM9 cannot explain the majority of recombination variance in any species thus far
examined and furthermore offer limited predictive power to identify hotspots a priori. For
example, the PRDM9 motif appears over 300,000 times in the human genome, but less than
30,000 hotspots have been identified—making it a poor predictor of recombination rates
(Ségurel, Leffler et al. 2011). Furthermore, there is evidence that when PRDM9 is knocked
down or absent (such as in dogs), recombination sites are shifted towards promoters and
recombination rates are more spread out—such as in the case of Drosophila (Grey, Baudat
et al. 2009, Brick, Smagulova et al. 2012, Auton, Rui Li et al. 2013). In chapter 3, I present my
work on other recombination-associated DNA motifs (repeated strings of short, variable
nucleotide sequences) in an attempt to reveal if there are other associated and predictive
motifs within the genome of D. melanogaster.
16
1.2.5
Explaining Recombination Variation
Having now established the many levels in which species exhibit recombination
variation, I now turn to a discussion of how such variation has been supposed to come
about. Early speculation about the modification of crossing rates was that variation might
occur through how tightly wound chromosomes are. By this logic, winding of DNA would
alter the distance between genes and thus alter crossover frequency (Bridges 1927).
Further, while early researchers hypothesized that genic modifiers of recombination
existed, these researchers were unable to find correlation between specific genes and
elevated or depressed crossover frequency (Gowen 1919). We now understand that
recombination rates are in fact modified by a variety of factors, which has made progress in
the field more difficult.
Rates of crossing over are known to yield to selection. Early studies by Kidwell
(1972) and Chinnici (1971) demonstrated that artificial selection for increased and (to a
much lesser degree) decreased rates of crossing over could be successful in Drosophila.
These results reveal that recombination rates, or at least crossing-over, exhibit heritable
variation, and are not a sole product of environmental conditions. Broad scale
interpretations of heritable recombination landscapes have led to attempts to quantify
recombination on a whole-chromosome scale by comparing genetic and physical maps and
fitting a seemingly appropriate function to the data that neglects most local variation (Hey
and Kliman 2002, Fiston-Lavier, Singh et al. 2010). Indeed, our present concept of genetic
maps hinges upon this idea of a species average recombination rate, but these maps are
often constructed using divergent strain or combine data from different environmental
conditions—a poor choice since both of these variables influence recombination rates.
These sorts of maps often correlate well with levels of polymorphism (Begun and Aquadro
17
1992, Hellmann, Ebersberger et al. 2003, Barton 2010, Comeron, Ratnappan et al. 2012).
Indeed, the best predictor of recombination rates, thus far, is the amount of nucleotide
variation. This has been observed in a vast array of species (Begun and Aquadro 1992,
Stephan and Langley 1998, Nachman 2002, Hellmann, Ebersberger et al. 2003).
Unfortunately, this appears to be an effect of, rather than a cause of increased
recombination (Maynard Smith and Haigh 1974, Charlesworth, Morgan et al. 1993). Due to
this chicken-or-egg problem, caution must be employed when attributing genomic
signatures as causative recombinatory agents.
Sequence motifs have also been found to be correlated with recombination rates in
some instances. Different regulatory motifs have been supposed to direct protein machinery
to specific locations within in the genome to catalyze the induction of DSBs. These motifs
are typically known as hotspot motifs, as they have traditionally been identified to be
enriched near punctate sites of heavy recombination. Specific motifs have been shown to
activate (though not necessarily or sufficiently) recombination hotspots in humans (Myers,
Freeman et al. 2008), mice (Baudet, Lemaitre et al. 2010), and S. pombe (Steiner, Steiner et
al. 2009, Steiner, Kohli et al. 2010). Less is known about the impact of motifs on other
organisms, especially in those without bona fide recombination hotspots. It is also unknown
if these motifs serve a distinctly structural function (by coding for a nuclease sensitive site,
for example), or if they represent sites of binding for transcriptional elements or nucleases.
Previous results demonstrating the plasticity of recombination in response to either
selection or environmental conditions suggest that the localization of DSB events is a
multifactorial problem that is unlikely to be resolved using classical or population genetics
techniques alone. For this reason, I have employed a variety of techniques to address this
growing question. To begin the process of fully understanding recombination rates, and in
18
order to accurately model the distribution of recombination rates across the genome, I
began with a highly simplistic model: Reci = Genetic(Rec)i + Epigenetic(Rec)i . Where
recombination rate at some site i is determined by the genetic recombination factors at site
i plus the epigenetic factors affecting recombination at some site i. In order to accurately
model this process, it is necessary to fill in these two parameters. In chapter 2, I detail my
attempt to define epigenetic factors affecting recombination and glean insight from the
meiotic transcriptional environment (study published in BMC Genomics, Adrian and
Comeron 2013). In chapter 3, I utilize simple primary DNA sequence in the search of
recombination-modifying motifs in order to better define factors on the genetics side of the
equation (study in review, Adrian, Cruz Corchado, Comeron 2015). Finally, in chapter 4, I
preliminarily examine the effects of environment on fine scale variation in recombination
rates. If we can properly explore the greatest epigenetic and genetic factors affecting
recombination, we should ideally be able to accurately model rates of recombination. With
such a model of recombination rates we will be able to better understand how nucleotide
variation is maintained in the genomes of recombining organisms, which has great
implications for the evolution of species.
19
1.3
Chapter 1 References
Abdullah, N. F. and B. Charlesworth (1974). "Selection for reduced crossing over in
Drosophila melanogaster." Genetics 76(3): 447-451.
Acquaviva, L., L. Szekvolgyi, B. Dichtl, B. S. Dichtl, C. de La Roche Saint Andre, A. Nicolas and
V. Geli (2013). "The COMPASS subunit Spp1 links histone methylation to initiation of
meiotic recombination." Science 339(6116): 215-218.
Auton, A., Y. Rui Li, J. Kidd, K. Oliveira, J. Nadel, J. K. Holloway, J. J. Hayward, P. E. Cohen, J. M.
Greally, J. Wang, C. D. Bustamante and A. R. Boyko (2013). "Genetic recombination is
targeted towards gene promoter regions in dogs." PLoS Genet 9(12): e1003984.
Bachtrog, D. (2003). "Adaptation shapes patterns of genome evolution on sexual and
asexual chromosomes in Drosophila." Nat. Genet.. 34(2): 215-219.
Barton, N. H. (2010). "Genetic linkage and natural selection." Philosophical Transactions of
the Royal Society of London. Series B: Biological Sciences 365(1552): 2559-2569.
Barton, N. H. and B. Charlesworth (1998). "Why sex and recombination?" Science
281(5385): 1986-1990.
Baudat, F. and B. de Massy (2007). "Regulating double-stranded DNA break repair towards
crossover or non-crossover during mammalian meiosis." Chromosome Res 15(5): 565-577.
Baudet, C., C. Lemaitre, Z. Dias, C. Gautier, E. Tannier and M. F. Sagot (2010). "Cassis:
detection of genomic rearrangement breakpoints." Bioinformatics 26(15): 1897-1898.
Begun, D. J. and C. F. Aquadro (1992). "Levels of naturally occurring DNA polymorphism
correlate with recombination rates in D. melanogaster." Nature 356(6369): 519-520.
Berry, A. J., J. W. Ajioka and M. Kreitman (1991). "Lack of polymorphism on the Drosophila
fourth chromosome resulting from selection." Genetics 129(4): 1111-1117.
Betancourt, A. J. and D. C. Presgraves (2002). "Linkage limits the power of natural selection
in Drosophila." Proceedings of the National Academy of Sciences, USA 99(21): 13616-13620.
Bhalla, N. and A. F. Dernburg (2008). "Prelude to a division." Annu Rev Cell Dev Biol 24:
397-424.
20
Borde, V., N. Robine, W. Lin, S. Bonfils, V. Geli and A. Nicolas (2009). "Histone H3 lysine 4
trimethylation marks meiotic recombination initiation sites." EMBO J 28(2): 99-111.
Brick, K., F. Smagulova, P. Khil, R. D. Camerini-Otero and G. V. Petukhova (2012). "Genetic
recombination is directed away from functional genomic elements in mice." Nature
485(7400): 642-645.
Bridges, C. B. (1927). "The Relation of the Age of the Female to Crossing over in the Third
Chromosome of Drosophila Melanogaster." J Gen Physiol 8(6): 689-700.
Burt, A., G. Bell and P. H. Harvey (1991). "Sex differences in recombination." Journal of
Evolutionary Biology 4(2): 259-277.
Charlesworth, B. (1996). "Background selection and patterns of genetic diversity in
Drosophila melanogaster." Genet. Res. 68(2): 131-149.
Charlesworth, B. and D. Charlesworth (1985). "Genetic variation in recombination in
Drosophila. I. Responses to selection and preliminary genetic analysis." Heredity 54(1): 7183.
Charlesworth, B., M. T. Morgan and D. Charlesworth (1993). "The effect of deleterious
mutations on neutral molecular variation." Genetics 134(4): 1289-1303.
Chinnici, J. P. (1971). "Modification of recombination frequency in Drosophila. II. The
polygenic control of crossing over." Genetics 69(1): 85-96.
Cirulli, E. T., R. M. Kliman and M. A. F. Noor (2007). "Fine-scale crossover rate heterogeneity
in Drosophila pseudoobscura." J Mol Evol 64(1): 129-135.
Comeron, J. M. (2014). "Background selection as baseline for nucleotide variation across the
Drosophila genome." PLoS Genet 10(6): e1004434.
Comeron, J. M., M. Kreitman and M. Aguade (1999). "Natural selection on synonymous sites
is correlated with gene length and recombination in Drosophila." Genetics 151(1): 239-249.
Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The Many Landscapes of Recombination
in Drosophila melanogaster." PLoS Genet 8(10): e1002905.
21
Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination
in Drosophila melanogaster." PLoS Genet 8(10): e1002905.
Comeron, J. M., A. Williford and R. M. Kliman (2008). "The Hill-Robertson effect:
evolutionary consequences of weak selection and linkage in finite populations." Heredity
(Edinb) 100(1): 19-31.
Connallon, T. and L. L. Knowles (2007). "Recombination rate and protein evolution in yeast."
BMC Evol Biol 7: 235.
Coop, G., X. Wen, C. Ober, J. K. Pritchard and M. Przeworski (2008). "High-resolution
mapping of crossovers reveals extensive variation in fine-scale recombination patterns
among humans." Science 319(5868): 1395-1398.
Cox, B. and J. Game (1974). "Repair systems in Saccharomyces." Mutat Res 26(4): 257-264.
Da Ines, O., M. E. Gallego and C. I. White (2014). "Recombination-independent mechanisms
and pairing of homologous chromosomes during meiosis in plants." Mol Plant 7(3): 492501.
de Massy, B. (2003). "Distribution of meiotic recombination sites." Trends Genet 19(9): 514522.
Detlefsen, J. A. and L. S. Clemente (1923). "Genetic Variation in Linkage Values." Proc Natl
Acad Sci U S A 9(5): 149-156.
DeWall, K. M., M. K. Davidson, W. D. Sharif, C. A. Wiley and W. P. Wahls (2005). "A DNA
binding motif of meiotic recombinase Rec12 (Spo11) defined by essential glycine-202, and
persistence of Rec12 protein after completion of recombination." Gene 356: 77-84.
Dumont, B. L. and B. A. Payseur (2011). "Genetic analysis of genome-scale recombination
rate evolution in house mice." PLoS Genet 7(6): e1002116.
Felsenstein, J. (1974). "The evolutionary advantage of recombination." Genetics 78(2): 737756.
Fiston-Lavier, A. S., N. D. Singh, M. Lipatov and D. A. Petrov (2010). "Drosophila
melanogaster recombination rate calculator." Gene 463(1-2): 18-20.
22
Frazer, L. N. and R. T. O'Keefe (2007). "A new series of yeast shuttle vectors for the recovery
and identification of multiple plasmids from Saccharomyces cerevisiae." Yeast 24(9): 777789.
Game, J. C. and R. K. Mortimer (1974). "A genetic study of x-ray sensitive mutants in yeast."
Mutat Res 24(3): 281-292.
Gempe, T., M. Hasselmann, M. Schiott, G. Hause, M. Otte and M. Beye (2009). "Sex
determination in honeybees: two separate mechanisms induce and maintain the female
pathway." PLoS Biol 7(10): e1000222.
Gerton, J. L. and R. S. Hawley (2005). "Homologous chromosome interactions in meiosis:
diversity amidst conservation." Nat Rev Genet 6(6): 477-487.
Gowen, J. W. (1919). "A Biometrical Study of Crossing Over. on the Mechanism of Crossing
over in the Third Chromosome of DROSOPHILA MELANOGASTER." Genetics 4(3): 205-250.
Graffelman, J., D. J. Balding, A. Gonzalez-Neira and J. Bertranpetit (2007). "Variation in
estimated recombination rates across human populations." Hum Genet 122(3-4): 301-310.
Grey, C., F. Baudat and B. de Massy (2009). "Genome-wide control of the distribution of
meiotic recombination." PLoS Biol 7(2): e35.
Gumienny, T. L., E. Lambie, E. Hartwieg, H. R. Horvitz and M. O. Hengartner (1999). "Genetic
control of programmed cell death in the Caenorhabditis elegans hermaphrodite germline."
Development 126(5): 1011-1022.
Hedrick, P. W. (2007). "Sex: differences in mutation, recombination, selection, gene flow,
and genetic drift." Evolution 61(12): 2750-2771.
Heil, C. S. S. and M. A. F. Noor (2012). "Zinc Finger Binding Motifs Do Not Explain
Recombination Rate Variation within or between Species of Drosophila." PLoS ONE 7(9):
e45055.
Hellmann, I., I. Ebersberger, S. E. Ptak, S. Paabo and M. Przeworski (2003). "A neutral
explanation for the correlation of diversity with recombination rates in humans." Am J Hum
Genet 72(6): 1527-1535.
Hey, J. (2004). "What's So Hot about Recombination Hotspots?" PLoS Biol 2(6): e190.
23
Hey, J. and R. M. Kliman (2002). "Interactions between natural selection, recombination and
gene density in the genes of Drosophila." Genetics 160(2): 595-608.
Hill, W. G. and A. Robertson (1966). "The effect of linkage on limits to artificial selection."
Genetical Research 8(3): 269-294.
Hillers, K. J. and A. M. Villeneuve (2003). "Chromosome-wide control of meiotic crossing
over in C. elegans." Curr Biol 13(18): 1641-1647.
Hopper, A. K. K. J. H., Benjamin D, (1975). "Mating type and sporulation in yeast. II. Meiosis,
recombination, and radiation sensitivity in an aa diploid with altered sporulation control."
Genetics 80: 61-76.
Hyppa, R. W. and G. R. Smith (2010). "Crossover invariance determined by partner choice
for meiotic DNA break repair." Cell 142(2): 243-255.
Jeffreys, A. J., L. Kauppi and R. Neumann (2001). "Intensely punctate meiotic recombination
in the class II region of the major histocompatibility complex." Nat Genet 29(2): 217-222.
Jeffreys, A. J. and R. Neumann (2005). "Factors influencing recombination frequency and
distribution in a human meiotic crossover hotspot." Human Mol Genet 14(15): 2277-2287.
Jensen-Seaman, M. I., T. S. Furey, B. A. Payseur, Y. Lu, K. M. Roskin, C. F. Chen, M. A. Thomas,
D. Haussler and H. J. Jacob (2004). "Comparative recombination rates in the rat, mouse, and
human genomes." Genome Res 14(4): 528-538.
Jiao, K., L. Salem and R. Malone (2003). "Support for a Meiotic Recombination Initiation
Complex: Interactions among Rec102p, Rec104p, and Spo11p." Molecular and Cellular
Biology 23(16): 5928-5938.
John, B. (2005). Meiosis. New York, Cambridge University Press.
Joyce, E. F., A. Paul, K. E. Chen, N. Tanneti and K. S. McKim (2012). "Multiple barriers to
nonhomologous DNA end joining during meiosis in Drosophila." Genetics 191(3): 739-746.
Kauppi, L., A. J. Jeffreys and S. Keeney (2004). "Where the crossovers are: recombination
distributions in mammals." Nat Rev Genet 5(6): 413-424.
24
Keeney, S. (2001). "Mechanism and control of meiotic recombination initiation." Curr Topics
Dev Biol Volume 52: 1-53.
Keeney, S. (2008). "Spo11 and the Formation of DNA Double-Strand Breaks in Meiosis."
Genome Dyn Stab 2: 81-123.
Kidwell, M. G. (1972). "Genetic change of recobination value in Drosophila melanogaster. II.
Simulated natural selection." Genetics 70(3): 433-443.
Kim, Y. and R. Nielsen (2004). "Linkage disequilibrium as a signature of selective sweeps."
Genetics 167(3): 1513-1524.
Kliman, R. M. and J. Hey (1993). "Reduced natural selection associated with low
recombination in Drosophila melanogaster." Mol Biol Evol 10(6): 1239-1258.
Kliman, R. M. and J. Hey (2003). "Hill-Robertson interference in Drosophila melanogaster:
reply to Marais, Mouchiroud and Duret." Genet Res 81(2): 89-90.
Koehler, K. E., J. P. Cherry, A. Lynn, P. A. Hunt and T. J. Hassold (2002). "Genetic control of
mammalian meiotic recombination. I. Variation in exchange frequencies among males from
inbred mouse strains." Genetics 162(1): 297-306.
Kolas, N. K., A. Svetlanov, M. L. Lenzi, F. P. Macaluso, S. M. Lipkin, R. M. Liskay, J. Greally, W.
Edelmann and P. E. Cohen (2005). "Localization of MMR proteins on meiotic chromosomes
in mice indicates distinct functions during prophase I." J Cell Biol 171(3): 447-458.
Kong, A., D. F. Gudbjartsson, J. Sainz, G. M. Jonsdottir, S. A. Gudjonsson, B. Richardsson, S.
Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien, S. T. Palsson, M. L. Frigge, T. E.
Thorgeirsson, J. R. Gulcher and K. Stefansson (2002). "A high-resolution recombination map
of the human genome." Nat Genet 31(3): 241-247.
Kulathinal, R. J., S. M. Bennett, C. L. Fitzpatrick and M. A. Noor (2008). "Fine-scale mapping of
recombination rate in Drosophila refines its correlation to diversity and divergence." Proc
Natl Acad Sci U S A 105(29): 10051-10056.
Kuzminov, A. (1999). "Recombinational repair of DNA damage in Escherichia coli and
bacteriophage lambda." Microbiol Mol Biol Rev 63(4): 751-813, table of contents.
25
Larracuente, A. M., T. B. Sackton, A. J. Greenberg, A. Wong, N. D. Singh, D. Sturgill, Y. Zhang, B.
Oliver and A. G. Clark (2008). "Evolution of protein-coding genes in Drosophila." Trends
Genet 24(3): 114-123.
Lenormand, T. (2003). "The evolution of sex dimorphism in recombination." Genetics
163(2): 811-822.
Lenormand, T. and J. Dutheil (2005). "Recombination difference between sexes: a role for
haploid selection." PLoS Biol 3(3): e63.
Marais, G., D. Mouchiroud and L. Duret (2001). "Does recombination improve selection on
codon usage? Lessons from nematode and fly complete genomes." Proc Natl Acad Sci U S A
98(10): 5688-5692.
Maynard Smith, J. and J. Haigh (1974). "The hitch-hiking effect of a favorable gene." Genet.
Res. 23: 23-35.
McKim, K. S., B. L. Green-Marroquin, J. J. Sekelsky, G. Chin, C. Steinberg, R. Khodosh and R. S.
Hawley (1998). "Meiotic synapsis in the absence of recombination." Science 279(5352):
876-878.
McKim, K. S. and A. Hayashi-Hagihara (1998). "mei-W68 in Drosophila melanogaster
encodes a Spo11 homolog: evidence that the mechanism for initiating meiotic
recombination is conserved." Genes Dev 12(18): 2932-2942.
McVean, G. A. and B. Charlesworth (2000). "The effects of Hill-Robertson interference
between weakly selected mutations on patterns of molecular evolution and variation."
Genetics 155(2): 929-944.
Muñoz-Fuentes, V., A. Di Rienzo and C. Vilà (2011). "Prdm9, a major determinant of meiotic
recombination hotspots, is not functional in dogs and their wild relatives, wolves and
coyotes." PLoS ONE 6(11): e25498.
Myers, S., L. Bottolo, C. Freeman, G. McVean and P. Donnelly (2005). "A fine-scale map of
recombination rates and hotspots across the human genome." Science 310(5746): 321-324.
Myers, S., C. Freeman, A. Auton and P. Donnelly… (2008). A common sequence motif
associated with recombination hot spots and genome instability in humans. Nature genetics.
26
Nachman, M. W. (2002). "Variation in recombination rate across the genome: evidence and
implications." Curr Opin Genet Dev 12(6): 657-663.
Nordborg, M., T. T. Hu, Y. Ishino, J. Jhaveri, C. Toomajian, H. Zheng, E. Bakker, P. Calabrese, J.
Gladstone, R. Goyal, M. Jakobsson, S. Kim, Y. Morozov, B. Padhukasahasram, V. Plagnol, N. A.
Rosenberg, C. Shah, J. D. Wall, J. Wang, K. Zhao, T. Kalbfleisch, V. Schulz, M. Kreitman and J.
Bergelson (2005). "The pattern of polymorphism in Arabidopsis thaliana." PLoS Biol 3(7):
e196.
Ohta, T. (1976). "Simple model for treating evolution of multigene families." Nature
263(5572): 74-76.
Ohta, T. (1976). "Simulation studies on the evolution of amino acid sequences." J Mol Evol
8(1): 1-12.
Padhukasahasram, B. and B. Rannala (2013). "Meiotic gene-conversion rate and tract length
variation in the human genome." Eur J Hum Genet.
Paigen, K. and P. Petkov (2010). "Mammalian recombination hot spots: properties, control
and evolution." Nat Rev Genet 11(3): 221-233.
Parvanov, E. D., P. M. Petkov and K. Paigen (2010). "Prdm9 Controls Activation of
Mammalian Recombination Hotspots." Science 327(5967): 835.
Petes, T. D. (2001). "Meiotic recombination hot spots and cold spots." Nat Rev Genet 2(5):
360-369.
Plough, H. H. (1917). "The Effect of Temperature on Linkage in the Second Chromosome of
Drosophila." Proc Natl Acad Sci U S A 3(9): 553-555.
Plug, A. W., J. Xu, G. Reddy, E. I. Golub and T. Ashley (1996). "Presynaptic association of
Rad51 protein with selected sites in meiotic chromatin." Proc Natl Acad Sci U S A 93(12):
5920-5924.
Presgraves, D. C. (2005). "Recombination enhances protein adaptation in Drosophila
melanogaster." Current Biology 15(18): 1651-1656.
Prieler, S., A. Penkner, V. Borde and F. Klein (2005). "The control of Spo11's interaction with
meiotic recombination hotspots." Genes Dev 19(2): 255-269.
27
Rockman, M. V. and L. Kruglyak (2009). "Recombinational landscape and population
genomics of Caenorhabditis elegans." PLoS Genet 5(3): e1000419.
Rockman, M. V., S. S. Skrovanek and L. Kruglyak (2010). "Selection at linked sites shapes
heritable phenotypic variation in C. elegans." Science 330(6002): 372-376.
Roeder, G. S. (1995). "Sex and the single cell: meiosis in yeast." Proc Natl Acad Sci U S A
92(23): 10450-10456.
Santos, J. L. (1999). "The relationship between synapsis and recombination: two different
views." Heredity 82(1): 1-6.
Ségurel, L., E. M. Leffler and M. Przeworski (2011). "The Case of the Fickle Fingers: How the
PRDM9 Zinc Finger Protein Specifies Meiotic Recombination Hotspots in Humans." PLoS
Biol 9(12): e1001211.
Serrentino, M. E. and V. Borde (2012). "The spatial regulation of meiotic recombination
hotspots: are all DSB hotspots crossover hotspots?" Exp Cell Res 318(12): 1347-1352.
Singh, N. D., D. R. Criscoe, S. Skolfield, K. P. Kohl, E. S. Keebaugh and T. A. Schlenke (2015).
"EVOLUTION. Fruit flies diversify their offspring in response to parasite infection." Science
349(6249): 747-750.
Singh, N. D., E. A. Stone, C. F. Aquadro and A. G. Clark (2013). "Fine-scale heterogeneity in
crossover rate in the Garnet-Scalloped region of the Drosophila melanogaster X
Chromosome." Genetics.
Smith, N. G. and A. Eyre-Walker (2002). "Adaptive protein evolution in Drosophila." Nature
415(6875): 1022-1024.
Smukowski, C. S. and M. A. Noor (2011). "Recombination rate variation in closely related
species." Heredity (Edinb) 107(6): 496-508.
Sollier, J., W. Lin, C. Soustelle, K. Suhre, A. Nicolas, V. Geli and C. de La Roche Saint-Andre
(2004). "Set1 is required for meiotic S-phase onset, double-strand break formation and
middle gene expression." EMBO J 23(9): 1957-1967.
28
Sommermeyer, V., C. Beneut, E. Chaplais, M. E. Serrentino and V. Borde (2013). "Spp1, a
member of the Set1 Complex, promotes meiotic DSB formation in promoters by tethering
histone H3K4 methylation sites to chromosome axes." Mol Cell 49(1): 43-54.
Steiner, S., J. Kohli and K. Ludin (2010). "Functional interactions among members of the
meiotic initiation complex in fission yeast." Curr Genet 56(3): 237-249.
Steiner, W. W., E. M. Steiner, A. R. Girvin and L. E. Plewik (2009). "Novel Nucleotide
Sequence Motifs That Produce Hotspots of Meiotic Recombination in Schizosaccharomyces
pombe." Genetics 182(2): 459-469.
Stephan, W. and C. H. Langley (1998). "DNA polymorphism in lycopersicon and crossingover per physical length." Genetics 150(4): 1585-1593.
Stevison, L. S. and M. A. Noor (2010). "Genetic and evolutionary correlates of fine-scale
recombination rate variation in Drosophila persimilis." J Mol Evol 71(5-6): 332-345.
Sturtevant, A. H., C. B. Bridges and T. H. Morgan (1919). "The Spatial Relations of Genes."
Proc Natl Acad Sci U S A 5(5): 168-173.
Sun, X., L. Huang, T. E. Markowitz, H. G. Blitzblau, D. Chen, F. Klein and A. Hochwagen (2015).
"Transcription dynamically patterns the meiotic chromosome-axis interface." Elife 4.
Sung, P. and D. L. Robberson (1995). "DNA strand exchange mediated by a RAD51-ssDNA
nucleoprotein filament with polarity opposite to that of RecA." Cell 82(3): 453-461.
Symington, L. S. (2002). "Role of RAD52 epistasis group genes in homologous
recombination and double-strand break repair." Microbiol Mol Biol Rev 66(4): 630-670,
table of contents.
Symington, L. S. (2014). "End resection at double-strand breaks: mechanism and
regulation." Cold Spring Harb Perspect Biol 6(8).
Tsai, I. J., D. Bensasson, A. Burt and V. Koufopanou (2008). "Population genomics of the wild
yeast Saccharomyces paradoxus: Quantifying the life cycle." Proc Natl Acad Sci U S A
105(12): 4957-4962.
29
Wallberg, A., S. Glemin and M. T. Webster (2015). "Extreme recombination frequencies
shape genome variation and evolution in the honeybee, Apis mellifera." PLoS Genet 11(4):
e1005189.
Walsh, J. B. (1983). "Role of biased gene conversion in one-locus neutral theory and genome
evolution." Genetics 105(2): 461-468.
Weiner, B. M. and N. Kleckner (1994). "Chromosome pairing via multiple interstitial
interactions before and during meiosis in yeast." Cell 77(7): 977-991.
Wyman, C. and R. Kanaar (2006). "DNA double-strand break repair: all's well that ends
well." Annu Rev Genet 40: 363-383.
Yin, Y. and S. Smolikove (2013). "Impaired resection of meiotic double-strand breaks
channels repair to nonhomologous end joining in Caenorhabditis elegans." Mol Cell Biol
33(14): 2732-2747.
Zickler, D. and N. Kleckner (1999). "Meiotic chromosomes: integrating structure and
function." Annu Rev Genet 33: 603-754.
Zickler, D. and N. Kleckner (2015). "Recombination, Pairing, and Synapsis of Homologs
during Meiosis." Cold Spring Harb Perspect Biol 7(6).
30
CHAPTER 2: The Drosophila early ovarian transcriptome
provides insight to the molecular causes of recombination rate
variation across genomes
2.0.1 Preface
Chapter 2 appears here as a reprinting of an article with the same title published within
BMC Genomics in 2013 (Volume 14, Issue 794). Formatting and minor alterations have been
made for consistency. The full original, unaltered document may be found at
http://www.biomedcentral.com/1471-2164/14/794.
2.0.2 Abstract
Background: Evidence in yeast indicates that gene expression is correlated with
recombination activity and double-strand break (DSB) formation in some hotspots and
studies of nucleosome occupancy in yeast and mice suggest that open chromatin influences
the formation of DSBs. In Drosophila melanogaster, high-resolution recombination maps
show an excess of DSBs within annotated transcripts relative to intergenic sequences. The
impact of active transcription on recombination landscapes, however, remains unexplored
in a multicellular organism. We then investigated the transcription profile during early
meiosis in D. melanogaster females to obtain a glimpse at the relevant transcriptional
dynamics during DSB formation, and test the specific hypothesis that DSBs preferentially
target transcriptionally active genomic regions.
Results: Our study of transcript profiles of early- and late-meiosis using mRNA-seq
revealed, 1) significant differences in gene expression, 2) new genes and exons, 3) parentof-origin effects on transcription in early-meiosis stages, and 4) a nonrandom genomic
distribution of transcribed genes. Importantly, genomic regions that are more actively
31
transcribed during early meiosis show higher rates of recombination, and we ruled out DSB
preference for genic regions that are not transcribed.
Conclusions: Our results provide evidence in a multicellular organism that transcription
during the initial phases of meiosis increases the likelihood of DSB and give insight into the
molecular determinants of recombination rate variation across the D. melanogaster genome.
We propose that a model where variation in gene expression plays a role altering the
recombination landscape across the genome which could provide a molecular, heritable and
plastic mechanism to observed patterns of recombination variation, from the high level of
intra-specific variation to the known influence of environmental factors and stress
conditions.
2.1 Background
High-resolution transcription profiles offer insight into a wide array of biological
questions including the identification of genes involved in specific molecular processes, the
understanding of cellular fate and organ differentiation, the importance of genic and
epigenetic factors, and the complex response to environmental conditions (Wang, Gerstein
et al. 2009). With the rise of sequencing technologies such as RNA-seq and supporting
methodologies, researchers are now able to obtain gene expression profiles (sensu levels of
transcripts), potentially identifying rare or novel transcript forms that are only present in
specific cells and/or at very precise developmental times (Tang, Barbacioru et al. 2009).
One such cell population of interest lies within the anterior portions of the Drosophila ovary,
where mitotic precursor cells begin their development into functional eggs and meiotic
recombination occurs.
The Drosophila ovary has served as a model for meiosis (Lake and Hawley 2012),
embryo patterning (Roth and Lynch 2009), and stem cell differentiation (Kirilly and Xie
32
2007, Spradling, Fuller et al. 2011). Drosophila females have two ovaries comprised of 10 to
20 tube-like structures, called ovarioles, clustered together with a spatiotemporal
organization of progressively developing oocytes (Spradling 1993). Oogenesis in Drosophila
starts within the anterior compartment of the ovariole, the germarium, where mitotic stem
cells produce cystoblasts that undergo further cell division generating a large 16-cell cyst
with a single cystocyte becoming the oocyte. Before exiting the germarium as a stage-1 egg
chamber, the primary oocyte will have entered pachytene and undergone meiotic
recombination. These anteriormost portions of the Drosophila ovariole represent a highly
active community of cells, regulated with remarkable fidelity, and yet, constitute only a
small fraction of the entire ovary (Morris and Spradling 2011). Previous whole-genome
transcriptome analyses of whole ovaries therefore offer only an amalgamated sight of its
developmental and cellular complexity, limiting our understanding of the relevant gene
expression activity of the germarium and early meiosis (Parisi, Nuttall et al. 2004, Gan,
Chepelev et al. 2010).
The process of meiotic recombination in D. melanogaster females begins with the
initiation of double-strand breaks (DSBs). At a very broad scale, crossovers in Drosophila
are distributed in bell-shaped fashion along chromosomes, with a maximal rate in the
center of a chromosomal arm that tapers off near centromeric and telomeric regions
(Lindsley and Zimm 1992). This is also the case in many other (e.g., mice, humans,
Arabidopsis, etc.) but not all (e.g., Caenorhabditis elegans and C. briggsae) eukaryotes. At
finer scales, recombination maps have revealed substantial variation across chromosomes
in all species analyzed, including Drosophila (Singer, Fan et al. 2006, Coop, Wen et al. 2008,
Kulathinal, Bennett et al. 2008, Rockman and Kruglyak 2009, Cutter and Choi 2010, Dumont
and Payseur 2011, Fledel-Alon, Leffler et al. 2011, Comeron, Ratnappan et al. 2012,
33
McGaugh, Heil et al. 2012). In D. melanogaster, high-resolution mapping of more than
100,000 recombination events at a scale approaching gene-level resolution showed not only
extreme heterogeneity in recombination rates across chromosomes but also that these
landscapes of recombination vary significantly among individuals of the same species
(Comeron, Ratnappan et al. 2012). Even within chromosomal regions traditionally assumed
to have non-reduced recombination rates, crossover rates vary up to 80-fold when crossing
two D. melanogaster strains, and 20-fold after combining genetic maps obtained from eight
crosses of different strains (Comeron, Ratnappan et al. 2012). Beyond the differences across
genomes, between species and within species, there is an important additional layer of
complexity: recombination rates are plastic and influenced by factors such as temperature,
food or maternal age (Stern 1926, Neel 1941, Redfield 1966, Parsons 1988, Priest, Galloway
et al. 2008).
The molecular determinants leading to DSB localization across the Drosophila
genome remain obscure but a number of patterns are beginning to emerge [see (Comeron,
Ratnappan et al. 2012) for details]. First, unlike human and mice recombination hotspots
that are strongly influenced by the presence of the PRDM9-binding DNA motif (Myers,
Bottolo et al. 2005, Baudat, Buard et al. 2010), no PRDM9 motif is detected in Drosophila
(Comeron, Ratnappan et al. 2012, Heil and Noor 2012, Miller, Takeo et al. 2012, Singh, Stone
et al. 2013). Second, analyses in Drosophila reveal many different DNA motifs significantly
enriched in sequences surrounding recombination events, suggesting a fundamental
qualitative difference between human/mouse and Drosophila DSB localization (Comeron,
Ratnappan et al. 2012, Singh, Stone et al. 2013). Third, recombination events tend to occur
within annotated transcript regions thus suggesting a possible association between
transcription, chromatin accessibility, and DSBs that are repaired as recombination events
34
(Comeron, Ratnappan et al. 2012). This latter observation is in agreement with evidence in
the yeast S. cerevisiae where some, but not all, hotspots of recombination increase activity
with open promoters (Petes 2001). Mapping of chromatin accessibility and nucleosome
occupancy in yeast and mice (Kirkpatrick, Wang et al. 1999, Getun, Wu et al. 2010, Pan,
Sasaki et al. 2011) also suggests, albeit more indirectly, that the formation of DSBs could be
influenced by transcription based on the known effect of transcription on chromatin
remodeling and histone modifications. The impact of active transcription on meiotic DSB
localization and recombination landscapes in a multicellular organism remains, however,
unexplored.
Here, we employ RNA-seq to obtain and analyze the whole transcriptome of early
meiotic D. melanogaster cells. We isolated germaria-stage 3 cells to substantially enrich the
fraction of sample that is actively experiencing early meiosis and DSB formation and obtain
a first glimpse of the potential influence of transcription on recombination localization
across the Drosophila genome. Our analyses uncover genes with germarium-specific
expression patterns and novel transcripts. The study of offspring from reciprocal crosses
also reveals distinct parent-of-origin effects that create differences in gene expression
among genetically identical individuals.
Finally, we identify a positive relationship across the genome between transcription
in early meiotic cells and recombination rates. Importantly, recombination events are found
to target actively transcribed genes relative to genes with no detectable transcription thus
allowing us to rule out that the observed association is due to DSB preference for static gene
properties at the level of DNA sequence (e.g., G+C content). These results provide insight
into the molecular determinants of recombination rate variation across the D.melanogaster
35
genome and a clear path for future studies to assess the molecular causes of recombination
variation among individuals and its plastic nature.
2.2 Results & Discussion
2.2.1 General Patterns of the Drosophila Early Meiotic Transcriptome
We isolated mRNAs from meiotic portions of the Drosophila ovary, dissecting the
germarium and stages 1-3, and compared them to later, more developed regions of the
ovary, hereafter referred to as ‘Early’ and ‘Late’, respectively (see Materials and Methods).
We performed ultra-deep mRNA sequencing (mRNA-seq) that obtained over 467 million
(M) of 120bp-long reads. Approximately 80% of these reads mapped correctly to the D.
melanogaster genome reference sequence, with a total average coverage greater than 400x
when mapped to annotated transcripts (Table 2.1). Each of our eight independent samples
sequenced (see Materials and Methods) generated between 34.5 and 60.3M mapped reads,
with a median mapped read count of 49.8M. There was no difference between total mapped
read counts between Early and Late tissues (P = 0.59).
Comparisons of Early versus Late transcript profiles show high similarity, with a
strong correlation coefficient (Spearman's R = 0.952; P < 0.001; see Figure 2.1). There are
7,914 genes expressed in Early regions compared to 7,557 genes in Late regions. These
results suggest that roughly 50% of all genes are actively transcribed, a value similar to the
typical percentage in other Drosophila tissues, based on mRNA seq (Chintapalli, Wang et al.
2007) or array-based comparisons of germarium and testes [46% of all genes expressed;
(Cash and Andrews 2012)]. The detection of 5% more genes being transcribed (i.e., above a
FPKM threshold of 1) in Early relative to Late meiosis (P = 0.015) is accompanied by a
reduced average level of transcription for active genes in early meiosis by more than 13%
(average FPKM of 89.1 and 103.1 for Early and Late stages, respectively; Wilcoxon Matched
36
Pairs Test, Z = 44.5, P < 1x10-12). Similar differences are observed when defining active
genes based on FPKM greater than 0.1 (Z = 41.7, P < 1x10-12).
The Drosophila X chromosome is enriched in genes preferentially or uniquely
expressed in females (i.e., female-biased genes) and deficient in male-biased genes (Parisi,
Nuttall et al. 2003, Ranz, Castillo-Davis et al. 2003, Sturgill, Zhang et al. 2007, Assis, Zhou et
al. 2012, Llopart 2012, Meisel, Malone et al. 2012). Focusing only on the male germline,
however, a recent study has shown that differences between X and autosomes are not
caused by different gene content but to the lack of sex chromosome dosage compensation in
Drosophila testes thus reducing transcript levels of X-linked genes (Meiklejohn and
Presgraves 2012). In our deep-sequencing study, we see that the early ovarian
transcriptome shows the expected “female” bias with actively expressed genes unequally
distributed among chromosomes: 60.4% of genes on the X chromosome are transcribed
compared with 55.3% in autosomes (R2 = 20.9, P = 4.8x10-6; see Figure 2.2). A more extreme
difference is observed when defining active genes based on FPKM > 0.1 (79.5 vs 70.0% for X
and autosomes, R2 = 89.2, P < 1x10-12). Notably, this overrepresentation of actively
transcribed genes on the X chromosome is less apparent in Late meiotic stages (e.g., 56.5%
on the X compared to 53.4% in autosomes, R2 = 7.23, P = 0.007).
Finally, we observe that expressed genes are not distributed randomly across
chromosomes, but are instead physically clustered (Wald–Wolfowitz or Run’s test, P < 1x108
for all levels of expression analyzed). When defining actively expressed genes as FPKM > 1,
clusters contain an average of 3.5 consecutive genes (5.8 genes when FPKM > 0.1). These
results are in agreement with data from other Drosophila tissues and conditions, with small
clusters of functionally related, highly co-expressed genes (Boutanaev, Kalmykova et al.
2002, Spellman and Rubin 2002, Weber and Hurst 2011).
37
Table 2.1 mRNA-seq statistics for each sample
Strain
Condition*
Gross reads
Mapped reads
% Mapped
Avg. depth
208
Early
44,938,695
34,472,998
76.71
84.3
Late
52,572,707
44,868,154
85.34
119.2
Early
67,255,860
48,750,637
72.49
119.0
Late
61,344,832
51,546,801
84.03
121.5
Early
61,305,389
50,863,571
82.97
107.3
Late
71,871,368
60,296,814
83.9
122.9
Early
57,532,606
45,646,995
79.34
86.9
Late
50,982,155
37,072,046
72.72
93.5
Early
231,032,550
179,734,201
77.8
457.2
Late
236,771,062
193,783,815
81.84
397.4
375
375Fx208M
375Mx208F
Combined
*Early
and Late indicate Drosophila Early- and Late-ovarian transcriptome, respectively.
38
Figure 2.1 Comparison of Log10 FPKM values for Drosophila Early- and
Late- ovarian transcriptome
Orange points indicate significantly differentially expressed genes based on FDR-corrected
significance level of 5% (q < 0.05). Spearman’s ρ = 0.952 (P < 1x10-12).
Figure 2.2 Transcriptional differences between autosomes and the X chromosome
(A) Mean FPKM values for autosomes and the X chromosome. Error bars represent +/− 1 standard error. (B)
Percentage of total transcribed genes across each chromosome. Error bars represent 90% confidence intervals.
Green: Early-ovarian transcriptome, Orange: Late-ovarian transcriptome.
39
2.2.2 Differentially Expressed Genes in Early Meiotic Tissues
We observed 1,191 genes with differences in FPKM between Early and Late meiotic
tissues at nominal P < 0.05, with 376 genes showing a significant difference after correcting
for multiple testing (q < 0.05; see Materials and Methods). The degree of differential
expression ranges from +241-fold to -2060-fold in the early relative to late tissues, with a
median difference of 1.66-fold among genes with significant differences. We observe a bias
towards overall down-regulation of genes in Early versus Late tissues (approximately five
times more genes are significantly down-regulated than up-regulated) that cannot be
explained by read bias in the respective samples.
The top ten over- and under-expressed genes in the Early sample are listed in Table
2.2. The use of DAVID (see Materials and Methods) to classify genes into GO categories
reveals that the terms ‘proteolysis’ and ‘peptidase’ are significantly enriched within our topten up-regulated genes in our Early sample (FDR-corrected modified Fisher exact P =
0.0001 and 0.001, respectively). Furthermore, all of the known genes (sensu annotated in
the Drosophila Genome, r.5.47) within this group are serine-type endopeptidases involved
in proteolysis. Why there is such a bias towards genes involved in proteolysis is difficult to
explain, but a similar pattern has been noted in the apex of the testis in Drosophila (Cash
and Andrews 2012). We suggest that the overrepresentation of serine endopeptidases may
be due to the required breakdown of many meiotic proteins following their utilization in
meiosis in order to prevent erroneous aggregation of many self-assembling protein
complexes that may interact with DNA. The analysis of the 312 significantly down-regulated
genes suggests enrichment in the GO terms phosophoproteins, RNA splicing, nucleotide
binding, phosophate metabolic processes, and ribonucleotide binding. Analysis of the top
40
ten most extreme down-regulated genes does not indicate overrepresentation of any GO
term after correcting for multiple tests.
These results indicate an enrichment of serine proteases in early versus late ovarian
development and a concurrent down-regulation of the majority of genes in Early tissues.
Interestingly, many of the top ten up-regulated genes were shown to be down-regulated in
array experiments based comparisons between the whole ovary and whole fly. (Chintapalli,
Wang et al. 2007). This result emphasizes that whole-ovary experiments might have lacked
sufficient power to detect important genes involved in subregions of the developing ovary.
2.2.3 New Genes and Isoforms
We applied the Cufflinks algorithm to our combined data sets and identified up to
6,004 transcript forms (genes, exons, or noncoding RNAs) that were absent from the D.
melanogaster genome annotation (r. 5.47, September 2012). When we conservatively
restricted the list to only those items that were detected in two or more samples and further
removed items with FPKM < 1, the set still contained 220 entries. Notably, 47 highconfidence items were identified with lengths greater than 300bp, minimally repetitive
sequences and reads that reliably mapped to predicted splice junctions. Additionally, a
visual inspection shows that a minimum of 13 of these new transcripts are independent of
other annotated gene entries and have clear exon-intron structures thus strong candidates
for new genes, while the rest are either novel splicing forms or putative ncRNAs.
To validate some of these new transcripts, we designed transcript-specific primers,
extracted total RNA from ovaries and were able to reliably produce RT-PCR products from
seven of ten haphazardly selected novel candidates. We thus, conservatively, estimate a
contribution of ~30-35 novel items to the existing D. melanogaster genome annotation.
Notably, a number of putative novel transcribed sequences mapped uniquely to the so-
41
called chromosome U that consists of unordered, unoriented scaffolds not present in the D.
melanogaster genome (euchomatic or heterochromatic) sequences. These results add to the
notion that the actual number of unnotated genes and isoforms is still high in this model
organism. Ultra-deep sequencing studies focusing on specific cell populations and variable
conditions are therefore needed to fill this annotation gap that can have important
consequences in genomic and evolutionary analyses.
Table 2.2
Top ten differentially expressed genes by fold-change
Transcript
Over-regulated in Early* CG17475-RA
CG31267-RA
CG32833-RA
CG42704-RA
CG18417-RA
CG43074-RA
CG47205-RA
CG31266-RB
CG31681-RA
CG15254-RA
Under-regulated in Early Vml-RA
λTry-RA
CG8997-RA
CG7916-RA
CG12057-RA
CG7953-RA
CG33306-RA
chrUextra:28564682777
CG11911-RA
CG18585-RA
Biological process
Proteolysis
Proteolysis
Proteolysis
Unknown
Proteolysis
Unknown
Unknown
Proteolysis
Proteolysis
Proteolysis
d/v axis
specification
Proteolysis
Unknown
Unknown
Unknown
Unknown
Unknown
Probable rRNA
Proteolysis
Proteolysis
Early
FPKM
13.95
8.22
2.91
62.64
1.16
11.81
7.12
3.44
2.74
2.42
0.11
Late
FPKM
0.06
0.06
0.04
0.98
0.02
0.24
0.15
0.09
0.07
0.06
221.40
Fold
change
240.9
140.7
65.3
64.2
48.7
48.7
46.3
40.1
39.1
37.9
−2,059.9
q value
1.02x10-8
1.81x10-6
9.43x10-3
2.08x10-8
7.64x10-3
3.06x10-5
1.40x10-3
1.74x10-3
4.71x10-3
1.40x10-3
<1x10-12
0.07
1.16
0.78
1.74
0.68
0.10
20.61
3.27
47.99
31.21
68.09
26.10
3.71
709.30
−46.0
−41.4
−39.8
−39.1
−38.2
−37.7
−34.4
4.39x10-3
4.91x10-10
1.48x10-9
6.71x10-7
2.34x10-9
1.62x10-4
2.00x10-2
0.75
0.07
21.88
1.72
−29.0
−24.9
7.91x10-8
5.45x10-3
*Early indicates Drosophila Early-ovarian transcriptome. 2.2.3 New Genes and Isoforms
2.2.4 Parent-of-Origin Effects in the Early Meiotic Tissue
Differences in gene expression between genetically identical offspring from
reciprocal crosses indicate that maternal and/or paternal effects alter expression. The
molecular causes of these parent-of-origin effects include genomic imprinting (through
42
epigenetic modification during gametogenesis), cytoplasmic effects of the egg and sperm, or
mitochondrial contributions to nuclear transcription. To investigate parent-of-origin effects
in the Early meiotic tissue in females, we studied two homozygous D. melanogaster parental
strains (strains RAL-208 and RAL-375 from Raleigh, NC (Mackay, Richards et al. 2012)) and
the heterozygous offspring from reciprocal crosses. We identify genes with a parent-oforigin transcription pattern as those genes that show differential expression between
offspring of reciprocal crosses and focused on the subset of these genes that change in
transcript levels between offspring of reciprocal crosses in the same direction as maternal
strains differ between them (i.e., parent-of origin effects with maternal-like transcript
levels).
The comparison of offspring of reciprocal crosses reveals that there are more genes
with parent-of origin effects with maternal-like transcript levels in the Early- than in the
Late-ovarian development tissues (1041 and 554 genes, respectively; P < 1x1012).Interestingly,
there is an excess of genes with parent-of origin effects with maternal-like
transcript levels that resemble transcription in the RAL-208 maternal strain than genes
with transcription pattern resembling the RAL-375 maternal strain (P < 10-12). We
expanded this study by investigating allele-specific transcript ratios of heterozygous
offspring and observed an excess of the RAL-208 allele in both reciprocal crosses (P < 10-12)
that is higher when the maternal strain is RAL-208 (Wilcoxon Matched Pairs Test, P =
0.004). These results not only reveal the presence of variable parent-of-origin effects acting
on transcript abundance but also an overrepresentation of dominant effects in RAL-208
relative to RAL-375.
We also identified an enrichment of a common set of GO terms associated with
genes showing parent-of origin effects with maternal-like transcript levels, many of which
43
are involved in development and differentiation (Supplementary Table 2.1S). When the
Early and Late datasets are combined, we recover similar GO term hits as were obtained for
Early tissues alone (Supplementary Table 2.3S). We thus interpret this pattern as a clear
signal of parent-of-origin effects in the transcriptome of the Drosophila ovary, with
maternal-like gene expression that is mostly relevant to Early-ovary development.
2.2.5 Transcription is Associated with Increased Recombination Rates
Ultra-high resolution mapping of recombination events in Drosophila revealed that
meiotic DSBs (detected as combined non-crossover and crossover events) occur
preferentially in annotated transcriptional units (Comeron, Ratnappan et al. 2012). We thus
hypothesized that gene transcription increases the probability of DSB formation in
Drosophila and influence the recombination landscapes across chromosomes. Alternatively,
the preference of DSB for genic units could be associated with other characteristics such
nucleotide composition, reduced average nucleotide diversity relative to intergenic regions,
presence of specific DNA motifs in promoter and intronic regions, etc. Although the
topography of recombination landscapes in S. cerevisiae and D. melanogaster are
dramatically different in terms of hotspot activity and localization, evidence based on some
hotspots in yeast suggests promoter and/or transcriptional activity affects recombination
activity (Petes 2001). The effects of transcription on DSB formation could be either direct
via reduced nucleosome occupancy and increased chromatin accessibility, or more indirect
as consequence of histone modifications.
To evaluate our hypothesis, we now focused on the expression profile during early
D. melanogaster meiosis and compared the transcriptional landscape with recombination
rate variation across the D. melanogaster genome. Note that we anticipate the presence of a
44
fraction of cells other than those where DSB formation occurs in our Early-meiosis sample.
We argue, however, that our sample is enriched in recombining cells and therefore, even if
we may not recover the precise transcriptional profile at the time/cells where DSBs occur,
genomic regions with no evidence of transcription will be particularly informative when
defining coldspots of transcription during DSB formation.
To this end, we used estimates of recombination rates across the D. melanogaster
genome that were experimentally obtained after genotyping 139 million informative SNPs
and mapping more than 100,000 recombination events at a scale approaching gene-level
resolution (see (Comeron, Ratnappan et al. 2012) for details). We then compared measures
of transcription in Early meiosis with these high-resolution recombination landscapes. The
analysis of adjacent 100-kb regions reveals a positive association between recombination
rates and both the number of genes transcribed per interval (Spearman’s R = 0.175, P =
3.1x10-10) and the total length of the transcribed regions per interval (R = 0.122, P = 1.2x105;
see Figure 2.3). To capture the possible effect of transcription levels we also obtained a
measure of overall transcriptional activity within a genomic interval (OTA), defined as the
Log10-transformed sum of the product FPKM x transcript length for each gene within a given
genomic interval (100 kb in our study). In our study, OTA is positively associated with
recombination rates across the genome (R = 0.137, P = 8.4x10-7). Multiple regression
analysis shows, however, that the number (P = 0.006) and total length of transcribed
sequences (P = 0.035) in a region are more relevant than OTA (P > 0.4) predicting
recombination rates at the 100-kb scale.
Notably, the relationship between measures of transcription and recombination
(Figure 2.3) appears to be highly contingent upon regions that lack transcription relative to
regions with transcription. There is a significant difference in recombination rates between
45
regions with no transcription and regions with one or more transcribed genes (MannWhitney test, P < 1x10-6). This result is consistent with the idea that our study preferentially
captures the consequences of coldspots for transcription during DSB formation in our Earlymeiosis sample.
Figure 2.3 Relationship between transcription and recombination rates
(A) Mean recombination rate in cM/Mb (centimorgans per megabase) for genomic regions grouped according to
the number of genes transcribed (FPKM > 0.1) within each 100-kb region. Spearman’s R = 0.168 (P = 1.5x10-9)
based on non-overlaping 100-kb regions. (B) Mean recombination rate in cM/Mb for regions grouped according
to the total region transcribed within each 100-kb region. R = 0.123 (P = 1.1x10-5). Error bars represent +/− 1
standard error.
The high-resolution genetic maps of D. melanogaster (see above) also allowed the
localization of more than 5,000 DSBs delimited by 500 bp or less (Comeron, Ratnappan et
al. 2012). Here, we take advantage of these highly localized meiotic DSB events to
investigate their distribution at the scale of single genes and intergenic regions. We observe
that intergenic sequences have fewer DSBs than expected but, importantly, we detect a
difference between genes transcribed and genes not transcribed (Figure 2.4). There is a
significant excess of DSB within transcribed genes relative to random distribution (P =
5.1x10-6), while no preference/avoidance is observed for genes with no evidence of being
transcribed (P > 0.4). These results show that the preference of DSB to be located within
46
annotated genic regions in Drosophila is not merely a consequence of DNA properties of
genes such higher G+C content than noncoding sequences or reduced sequence diversity,
presence of DNA regulatory motifs in promoter regions and introns, etc. This result is also
in agreement with the previous analysis of recombination rates and nucleotide composition
showing that there is no positive association between recombination rates and G+C content
(P > 0.20; (Comeron, Ratnappan et al. 2012)). Instead, the detection of recombination
events targeting actively transcribed genes relative to genes with no detectable
transcription strongly suggests that gene expression during early meiosis has a causal effect
on DSB location and formation and, ultimately, recombination rates.
Finally, we investigated the effect of transcription levels on DSB presence. To this
end, we divided genes with detectable transcription into three groups for low-, mediumand high-transcription levels. We observe that among genes with detectable transcription,
DSBs target genes preferentially lowly transcribed genes (Figure 2.4), with a negative
relationship between transcription levels and recombination. Again, these results may
evidence the expected heterogeneity of cell populations within our Early-meiosis tissue or,
alternatively, a more complex interplay between transcription, histone modification and
turnover, and chromatin accessibility for DSBs.
47
Figure 2.4 Relative presence of DSBs across the genome
Analyses based on the 5,610 DSB events delimited by 500 bp or less described in [20]. The relative presence is
measured as the ratio of the number of DSBs observed within each category to the number expected based on a
random distribution of DSBs across the genome. Conservatively, we classified genes as showing no active
transcription when FPKM < 0.001 and groups of genes with low-, medium- and high-transcription represent
levels of target potential associated with transcription (FPKM x transcript length); 33, 46 and 21% of active
genes belong to the low-, medium- and high- transcription groups, respectively. Probabilities (shown above each
bar) associated with the relative presence of DSBs were obtained based on 10,000 independent replicates of the
5,610 DSBs randomly distributed across the genome.
48
2.3 Conclusions
We obtained and compared the transcription profiles of Early- and Late-meiosis in
D. melanogaster females with mRNA-seq and ultra-deep coverage. We identified significant
differences in gene expression, new genes and exons, and a pattern of parent-of-origin
effects with maternal-like expression that is particularly evident in Early-meiosis stages. We
also described that Early-meiosis transcription occurs more often on the X chromosome and
that there is physical clustering of actively transcribed genes across chromosomes. In terms
of gene categories, we report that many genes involved in proteolysis are highly expressed
in early meiosis, which may be a result of the rapid degradation of meiotic proteins
following their utilization in order to prevent erroneous, self-assembling aggregates
(Ringrose and Paro 2007). Our study and results underscore the limitations of using
heterogeneous cellular and tissue samples when searching for biologically relevant features
specific to particular developmental times and cell sets. In our case, searching for
transcriptional signals present in only meiotic oocytes benefits from not using the whole
ovary—as the oocyte transitions to transcriptionally dormant following the entrance into
stage 1 in D. melanogaster, vastly increasing the influence of supportive nurse cells (Liu,
Buszczak et al. 2006).
Work in yeast has shown that chromatin accessibility and nucleosome occupancy
contribute to variation in the DSB landscape, although other factors may play a more
dominant role in determining the probability of DNA cleavage (Kirkpatrick, Wang et al.
1999, Pan, Sasaki et al. 2011). Studies of nucleosome occupancy in mice meiotic
spermatocytes also suggest that open chromatin structure directs, at least in part, the
formation of DSBs (Getun, Wu et al. 2010). Indirectly, these studies suggest that
recombination landscapes could be influenced by gene expression, as transcription is
49
known to alter chromatin structure. RNA-seq has been used as a powerful method to
determine transcription patterns for specific tissues, cell populations and/or conditions, but
it has heretofore not been exploited as a measure to gather information underlying patterns
of variation in recombination rates across whole chromosomes.
Based on our previous high-resolution genetic maps in Drosophila, here we
investigated the specific hypothesis that DSBs preferentially target transcriptionally active
genomic regions in Drosophila. To our knowledge, our results represent the first evidence in
a multicellular organism that gene expression in early meiotic cells is associated with
increased likelihood of DSBs. Importantly, the preference of DSB targeting annotated
transcripts seems to be related to active transcription and therefore supports the model
that gene expression in meiotic tissues play a role—albeit clearly not the only one—
influencing the landscapes and magnitude of recombination in a particular genomic region.
Indeed, although the observed association between transcription levels and recombination
rates is highly significant in terms of associated probability, it is weak in terms of the
variation in recombination rates that can be explained solely by transcription. As such, the
proposed influence of transcription on DSB formation and recombination landscapes should
be viewed as one of several determinants of DSB localization. The presence of specific DNA
motifs, the vicinity to telomeres/centromeres and other high-order chromatin structures
during early meiosis, are all factors likely to also play a role. Transcriptome data of specific
cell types, possibly using novel transgenic methods, together with detailed genetic analyses
are needed to determine the relative role of gene expression influencing DSB formation and,
ultimately, recombination rates across the Drosophila genome.
Recombination rates are an evolving and variable trait with detectable differences
between species as well as within species. This inter-individual and inherited variation has
50
been documented for the total number of recombination events per meiosis or per
chromosome as well as in terms of the distribution across chromosomes in a number of
species, including D. melanogaster (Kidwell 1972, Abdullah and Charlesworth 1974, Brooks
and Marks 1986, Williams, Goodman et al. 1995, Koehler, Cherry et al. 2002, Neumann and
Jeffreys 2006, Yandeau-Nelson, Nikolau et al. 2006, Esch, Szymaniak et al. 2007, Coop, Wen
et al. 2008, Grey, Baudat et al. 2009, Rockman and Kruglyak 2009, Dumont, White et al.
2011). Further, classic Drosophila genetics studies expose clear plasticity in the distribution
of recombination rates across the genome as a result of biotic and abiotic factors (Stern
1926, Neel 1941, Redfield 1966, Priest, Galloway et al. 2008) that would also act upon
inherited inter-individual variation. We propose that our model, in which variation in gene
expression plays a role altering the likelihood of DSB formation and thus the landscape of
recombination across chromosomes, could easily reconcile many of these observations and
provide a molecular, heritable and plastic mechanism to a number of observed patterns of
recombination, from the high level of intra-specific variation, to in influence of
environmental factors and stress conditions. The concept that gene expression may act as
“plastic” and heritable modifier of recombination, directly or epigenetically, is particularly
relevant to evolutionary models on the maintenance of recombination. Our proposed model
could represent a direct link between stressful conditions and increased recombination,
either region-specific or genome-wide, the very same circumstances where recombination
may be most favorable (Parsons 1988, Hadany and Beker 2003, Hadany and Beker 2003,
Agrawal, Hadany et al. 2005).
51
2.4 Methods
2.4.1 Drosophila Stocks and Tissue Preparation
We generated two crosses using 2 highly inbred strains (RAL-208 and -375) from
the Drosophila Genetic Reference Panel (DGRP) (Mackay, Richards et al. 2012) that have
been previously sequenced and recombination-mapped to high resolution. These
populations were collected in Raleigh (NC, USA) and subjected to 20 generations of full sib
mating. Freshly eclosed virgin females were collected from both inbred lines and crosses
(males RAL-208 x females RAL-375 and its reciprocal cross) and allowed to mature for 72
hours at 23.5C. Ovaries from each of the four genotypes were dissected in RNA-Later
Reagent (Quiagen) using forceps and dissecting probe. Ovarioles were teased apart and
early meiotic portions (Germaria to Stage 3) removed using electrolytically sharpened
tungsten needles, resulting in four ‘Early’ and four ‘Late’ tissue preparations (Kalistratov
and Bashkirov 1964).
2.4.2 Illumina Library Preparation and Sequencing
Total RNA was prepared from ovaries, ovaries with early meiotic regions removed,
and early meiotic regions following an optimized protocol for the Quiagen RNEasy kit
(Qiagen, Valencia, CA). mRNA was isolated using the Invitrogen Dynabead mRNA
Purification kit, with two additional wash steps. mRNA was fragmented with a cation
solution from New England Biolab’s NEBNext Kit, ethanol precipitated, and cDNA synthesis
performed with the NEBNext Kit. End repair, dA-Tailing, and adapter ligation of custom
adapters was also performed with the NEB Next kit following an optimized manufacturer’s
suggested protocol. 300-350bp adapter ligated fragments were isolated from a 2% low-melt
agarose gel and PCR enriched for 13 cycles. The PCR enriched libraries were validated by
running an aliquot on a standard agarose gel. Products were purified and concentration
52
obtained with Quant-iT TM PicoGreen ® dsDNA Reagent and Kits (Invitrogen, CA, USA) on a
Turner BioSystems TBS-380 Fluorometer. In total we generated eight Illumina Libraries,
with two independently generated libraries per genotype to obtain adequate biological and
technical replicates that were also run in separate Illumina lanes. We ran two lanes with
four multiplexed libraries each. Single-read 120 bp fragments were sequenced on an
Illumina Hi-Seq 2000 at the University of Iowa DNA Core Facility.
2.4.3 Sequence Alignment and Expression Analyses
Illumina data were separated by tag using FastX Barcode Splitter and concatenating
the two lanes of data for each tag respectively. All further analyses were performed within
Galaxy, an accessible bioinformatics framework capable of next-generation sequencing data
analysis (Giardine, Riemer et al. 2005, Blankenberg, Von Kuster et al. 2010, Goecks,
Nekrutenko et al. 2010). Summary statistics were gathered using FastQC. The 5’ adapter
sequence was then removed from each sample and 3’ ends trimmed until reaching a quality
score greater than ten using FastqTrimmer. The groomed data was then mapped to the D.
melanogaster reference genome (Apr. 2006 BDGP R5/dm3) using TopHat v1.4.0 (Roberts,
Pimentel et al. 2011, Trapnell, Roberts et al. 2012).
We then used the Cufflinks package 2.0.2 (Trapnell, Roberts et al. 2012) to assemble
transcripts, obtain their relative abundance and find differentially expressed genes. After
assembling transcripts, CuffMerge was used for merging and annotation analysis and
measures of expression for every transcript associated with a particular gene were obtained
in FPKM (Supplementary Figure 2.1S). Expression calculations for early and late ovary
development were based on two sets of replicates (Early samples of RAL-375 males x RAL208 females and its reciprocal cross, and Late samples of RAL-375 males x RAL-208 females
and its reciprocal cross) with two biological replicates per genotype and condition (Early or
53
Late). Classic-FPKM normalization was performed with pooled estimates of dispersion
(negative binominal) following (Trapnell, Hendrickson et al. 2013). We then utilized the
Cuffdiff 2 algorithm (Trapnell, Hendrickson et al. 2013) within Cufflinks 2.0.2 to calculate
differential expression at both the gene and transcript levels. In short, differential gene
expression was calculated using FPKM values for every gene while incorporating expression
level variances during significance testing. This was performed by first deriving a
dispersion model describing variances of fragment counts across replicates, which was then
used to calculate the variances on a gene’s relative expression across replicates following
the method described in (Trapnell, Hendrickson et al. 2013) (Supplementary Figures 2.2S2.4S). Genes were considered to be expressed if each sample had a minimum of ten reads
mapped and were above an FPKM of 0.1 unless noted explicitly. Genes were considered to
be differentially expressed if the prior expression requirements were satisfied and reached
an FDR-corrected significance level of 5% (q < 0.05).
2.4.4 Novel Gene Identification
Potentially novel genes were first identified by CuffLinks as significant reads
mapping to unannotated regions of the dm3 genome that fit our expression criteria.
Cufflinks initially identified 6004 potentially novel items. Restricting this list to only those
that were detected in two or more samples reduced the number to 1308, and then filtering
for only those expressed at reliable levels above one FPKM in at least one sample reduced
the set to 220 entries. From this filtered list, we manually identified 47 items with lengths
greater than approximately 300bp, were minimally repetitive, and possessed reads that
reliably mapped to predicted splice junctions. We identified 13 of these items to be
candidate novel genes, based on a more stringent visual inspection and identification of
apparent intron-exon structures.
54
We performed RT-PCR on a subset of our identified potentially novel genes with
probable open reading frames that were missing from both tracks in an attempt to confirm
expression of the novel transcript. PCR primers were designed for regions with significant
RNA-seq reads mapped that spanned more than 300 base pairs. Primers were first tested
with genomic DNA as a template, and then with whole ovary cDNA as a template. Primers
and conditions available upon request. A gene was considered novel if the mRNA track
contained it but it was unannotated, or if it was missing from both mRNA tracks and the
DM3 genome annotation.
2.4.5 Parent-of-Origin Effects
To study parent-of-origin effects in early and late tissues, we first identified genes
that are significantly differentially expressed between offspring of reciprocal crosses (RAL375 males x RAL-208 females and its reciprocal cross). We then focused on those genes
with parent-of-origin effect that have levels of transcription in the offspring resembling the
levels of transcription observed in the crosses’ maternal strain (RAL-208 or -375). To
investigate allele-specific transcription in heterozygous offspring we obtained the set of
reads that uniquely map to only one of the D. melanogaster parental strains with zero
mismatches but not to the other parental strain, and vice versa, using MOSAIK assembler
(http://bioinformatics.bc.edu/marthlab/Mosaik). We also removed all reads that would
differentially map to one parental sequence and not to the other if one of the reference
sequences contained one or more ‘N’s for this read. Additionally, we only studied genes with
a minimum 100 allele-specific reads to increase accuracy in the allelic ratios.
Gene-term enrichment analyses were performed with DAVID (Huang da, Sherman et
al. 2009), utilizing the BP_FAT subset of gene ontology (GO) terms to identify enriched
biological themes. We then combined the early and late tissue GO data for each strain and
55
repeated the analysis. We report p-values from the DAVID analysis according to the EASE
score, a modified Fishers exact test (Huang da, Sherman et al. 2009). FDR-corrected EASE
scores are reported utilizing the Benjamini-Hochberg multiple-test correction procedure
employed by DAVID.
2.4.6 Genomic Distribution of Transcribed Genes
In order to test whether transcribed and untranscribed genes were distributed
randomly across the genome, we performed a run’s test for randomness. To determine the
number of runs, we separated genes into two categories of transcriptional activity: genes
transcribed at greater levels than 1 FPKM and those that were not. Along the length of the
genome, switches from one category to another counted as a completed run. Overlapping
genes were counted separately following this scheme.
2.4.7 Recombination vs Expression Analysis
To test the hypothesis that gene expression is associated with recombination rates
across the genome, we first generated landscapes of expression for each chromosome. We
calculated the number of genes expressed at a threshold greater than 1 FPKM (unless noted
otherwise) and the number of kilobases transcribed (counting overlapping transcript
regions only once) per 100kb adjacent intervals. We also obtained a measure of overall
transcriptional activity within a genomic interval (OTA), defined as the Log10-transformed
sum of the product FPKM x transcript length for each gene within a given genomic interval.
High-resolution recombination landscapes for adjacent 100kb regions across the whole
genome were obtained from (Comeron, Ratnappan et al. 2012).
The datasets supporting the results of this article are available in the NCBI SRA repository
[http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP032523]
56
2.5 Supplementary Information
Table 2.3S Top Enriched1 GO Terms among genes with parent-of origin effects with
maternal-like transcript levels for Early- and Late-Ovarian samples
Transcription
Condition2 bias (Strain)3 Term
structural constituent of vitelline
membrane
vitelline memb. form. in chorion-cont.
eggshell
375
vitelline membrane formation
Early
208
Count
4
6.35
3.16x10-07 4.26x10-05
4
6.35
6.48x10-06 2.36x10-03
4
6.35
6.48x10-06 2.36x10-03
ovarian follicle cell development
8
12.7
2.44x10-05 4.44x10-03
extracellular matrix organization
4
6.35
3.45x10-05 4.19x10-03
cell morphogenesis
72
9.3
1.68x10-13 3.37x10-10
cellular component morphogenesis
79
10.21
3.18x10-13 3.17x10-10
neuron differentiation
68
8.79
3.44x10-13 2.29x10-10
neuron development
61
7.88
6.16x10-13 3.08x10-10
ribonucleotide binding
120
15.5
8.10x10-13 5.35x10-10
6
22.22
1.09x10-03
n.s.
n.s.
neuron development
375
Late
208
1 The
FDRcorrected q
value
Percent
of Total
P Value
neuron differentiation
6
22.22
2.26x10-03
behavior
6
22.22
2.79x10-03
n.s.
n.s.
transcription regulator activity
7
25.93
5.41x10-03
regulation of transcription
7
25.93
8.79x10-03
n.s.
4.37x10-08
contractile fiber
12
2.92
3.56x10-11
contractile fiber part
11
2.68
2.55x10-10 3.13x10-07
sarcomere
10
2.43
8.28x10-10 1.02x10-06
myofibril
10
2.43
1.83x10-09 2.24x10-06
chorion
9
2.19
1.66x10-07 2.03x10-04
external encapsulating structure
9
2.19
2.77x10-07 3.40x10-04
top 5 most enriched GO terms are shown for each category. 2 Early and Late indicate Drosophila Early- and
Late-ovarian transcriptome. 3 Transcription bias indicates the maternal strain towards which a gene shows
similarity while showing differential transcription levels between the offspring of reciprocal crosses. (n.s., q >
0.05).
57
Table 2.4S Top Enriched1 GO Terms among genes with parent-of origin effects with
maternal-like transcript levels
Transcription bias
(Strain)2
375
Term
structural constituent of vitelline
membrane
vitelline membrane formation in
chorion-containing eggshell
n
Percent of
Total
4
6.45
4
6.45
vitelline membrane formation
4
6.45
ovarian follicle cell development
8
12.9
extracellular matrix organization
4
7
2
7
9
6
8
5
6
6
0
5
2
6.45
cell morphogenesis
cellular component morphogenesis
208
neuron differentiation
cell morphogenesis involved in
differentiation
neuron development
cell morphogenesis involved in neuron
differentiation
1 The
9.81
10.76
9.13
7.63
8.17
7.08
P
value
3.16x1
0-7
6.48x1
0-6
6.48x1
0-6
2.44x1
0-5
3.45x1
0-5
1.10x1
0-14
1.80x1
0-14
8.80x1
0-14
9.00x1
0-14
2.07x1
0-13
2.08x1
0-12
FDR-corrected
q value
4.26x10-5
2.36x10-3
2.36x10-3
4.44x10-3
4.19x10-3
2.20x10-11
3.46x10-11
1.72x10-10
1.76x10-10
4.06x10-10
4.08x10-09
top 5 most enriched GO terms are shown for each category. 2 Transcription bias indicates the maternal
strain towards which a gene shows similarity while showing differential transcription levels between the
offspring of reciprocal crosses.
58
Figure 2.5S FPKM distribution density across genes
Blue region: Early-ovarian transcriptome;
Orange region: Late-ovarian transcriptome.
Figure 2.6S Volcano plot of genes by significance
and fold change.
59
Figure 2.7S Early vs Late Log2 fold difference histogram
following normalization
Figure 2.8S P-value distribution following normalization
and testing with CuffDiff v2.0.2 before correcting for
multiple tests. This histogram displays the approximate
expected distribution of significance following proper
normalization—harboring signals of true p
60
2.6 Chapter 2 References
Abdullah, N. F. and B. Charlesworth (1974). "Selection for reduced crossing over in
Drosophila melanogaster." Genetics 76(3): 447-451.
Agrawal, A. F., L. Hadany and S. P. Otto (2005). "The evolution of plastic recombination."
Genetics 171(2): 803-812.
Assis, R., Q. Zhou and D. Bachtrog (2012). "Sex-biased transcriptome evolution in
Drosophila." Genome Biol Evol 4(11): 1189-1200.
Baudat, F., J. Buard, C. Grey, A. Fledel-Alon, C. Ober, M. Przeworski, G. Coop and B. de Massy
(2010). "PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and
Mice." Science 327(5967): 836-840.
Blankenberg, D., G. Von Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko
and J. Taylor (2010). "Galaxy: a web-based genome analysis tool for experimentalists." Curr
Protoc Mol Biol Chapter 19: Unit 19 10 11-21.
Boutanaev, A. M., A. I. Kalmykova, Y. Y. Shevelyov and D. I. Nurminsky (2002). "Large
clusters of co-expressed genes in the Drosophila genome." Nature 420(6916): 666-669.
Brooks, L. D. and R. W. Marks (1986). "The organization of genetic variation for
recombination in Drosophila melanogaster." Genetics 114(2): 525-547.
Cash, A. C. and J. Andrews (2012). "Fine scale analysis of gene expression in Drosophila
melanogaster gonads reveals Programmed cell death 4 promotes the differentiation of
female germline stem cells." BMC Dev Biol 12: 4.
Chintapalli, V. R., J. Wang and J. A. Dow (2007). "Using FlyAtlas to identify better Drosophila
melanogaster models of human disease." Nat Genet 39(6): 715-720.
Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination
in Drosophila melanogaster." PLoS Genet 8(10): e1002905.
Coop, G., X. Wen, C. Ober, J. K. Pritchard and M. Przeworski (2008). "High-resolution
mapping of crossovers reveals extensive variation in fine-scale recombination patterns
among humans." Science 319(5868): 1395-1398.
61
Coop, G., X. Wen, C. Ober, J. K. Pritchard and M. Przeworski (2008). "High-resolution
mapping of crossovers reveals extensive variation in fine-scale recombination patterns
among humans." Science 319(5868): 1395-1398.
Cutter, A. D. and J. Y. Choi (2010). "Natural selection shapes nucleotide polymorphism
across the genome of the nematode Caenorhabditis briggsae." Genome Res 20(8): 11031111.
Dumont, B. L. and B. A. Payseur (2011). "Genetic analysis of genome-scale recombination
rate evolution in house mice." PLoS Genet 7(6): e1002116.
Dumont, B. L., M. A. White, B. Steffy, T. Wiltshire and B. A. Payseur (2011). "Extensive
recombination rate variation in the house mouse species complex inferred from genetic
linkage maps." Genome Res 21(1): 114-125.
Esch, E., J. M. Szymaniak, H. Yates, W. P. Pawlowski and E. S. Buckler (2007). "Using
crossover breakpoints in recombinant inbred lines to identify quantitative trait loci
controlling the global recombination frequency." Genetics 177(3): 1851-1858.
Fledel-Alon, A., E. M. Leffler, Y. Guan, M. Stephens, G. Coop and M. Przeworski (2011).
"Variation in human recombination rates and its genetic determinants." PLoS ONE 6(6):
e20321.
Gan, Q., I. Chepelev, G. Wei, L. Tarayrah, K. Cui, K. Zhao and X. Chen (2010). "Dynamic
regulation of alternative splicing and chromatin structure in Drosophila gonads revealed by
RNA-seq." Cell Res 20(7): 763-783.
Getun, I. V., Z. K. Wu, A. M. Khalil and P. R. Bois (2010). "Nucleosome occupancy landscape
and dynamics at mouse recombination hotspots." EMBO Rep 11(7): 555-560.
Giardine, B., C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D.
Blankenberg, I. Albert, J. Taylor, W. Miller, W. J. Kent and A. Nekrutenko (2005). "Galaxy: a
platform for interactive large-scale genome analysis." Genome Res 15(10): 1451-1455.
Goecks, J., A. Nekrutenko and J. Taylor (2010). "Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life
sciences." Genome Biol 11(8): R86.
Grey, C., F. Baudat and B. de Massy (2009). "Genome-wide control of the distribution of
meiotic recombination." PLoS Biol 7(2): e35.
62
Hadany, L. and T. Beker (2003). "Fitness-associated recombination on rugged adaptive
landscapes." J Evol Biol 16(5): 862-870.
Hadany, L. and T. Beker (2003). "On the evolutionary advantage of fitness-associated
recombination." Genetics 165(4): 2167-2179.
Heil, C. S. and M. A. Noor (2012). "Zinc finger binding motifs do not explain recombination
rate variation within or between species of Drosophila." PLoS ONE 7(9): e45055.
Huang da, W., B. T. Sherman and R. A. Lempicki (2009). "Systematic and integrative analysis
of large gene lists using DAVID bioinformatics resources." Nat Protoc 4(1): 44-57.
Kalistratov, G. F. and A. A. Bashkirov (1964). "[Use of a Serial Portable Apparatus for
Electrolytic Sharpening of Surgical Instruments in the Preparation of Mental
Microelectrodes]." Biull Eksp Biol Med 58: 122-123.
Kidwell, M. G. (1972). "Genetic change of recobination value in Drosophila melanogaster. II.
Simulated natural selection." Genetics 70(3): 433-443.
Kirilly, D. and T. Xie (2007). "The Drosophila ovary: an active stem cell community." Cell Res
17(1): 15-25.
Kirkpatrick, D. T., Y. H. Wang, M. Dominska, J. D. Griffith and T. D. Petes (1999). "Control of
meiotic recombination and gene expression in yeast by a simple repetitive DNA sequence
that excludes nucleosomes." Mol Cell Biol 19(11): 7661-7671.
Koehler, K. E., J. P. Cherry, A. Lynn, P. A. Hunt and T. J. Hassold (2002). "Genetic control of
mammalian meiotic recombination. I. Variation in exchange frequencies among males from
inbred mouse strains." Genetics 162(1): 297-306.
Kulathinal, R. J., S. M. Bennett, C. L. Fitzpatrick and M. A. Noor (2008). "Fine-scale mapping of
recombination rate in Drosophila refines its correlation to diversity and divergence." Proc
Natl Acad Sci U S A 105(29): 10051-10056.
Lake, C. M. and R. S. Hawley (2012). "The molecular control of meiotic chromosomal
behavior: events in early meiotic prophase in Drosophila oocytes." Annu Rev Physiol 74:
425-451.
63
Lindsley, D. L. and G. G. Zimm (1992). The genome of Drosophila melanogaster. . San Diego,
CA, Academic Press.
Liu, J. L., M. Buszczak and J. G. Gall (2006). "Nuclear bodies in the Drosophila germinal
vesicle." Chromosome Research 14(4): 465-475.
Llopart, A. (2012). "The rapid evolution of X-linked male-biased gene expression and the
large-X effect in Drosophila yakuba, D. santomea, and their hybrids." Mol Biol Evol 29(12):
3873-3886.
Mackay, T. F., S. Richards, E. A. Stone, A. Barbadilla, J. F. Ayroles, D. Zhu, S. Casillas, Y. Han, M.
M. Magwire, J. M. Cridland, M. F. Richardson, R. R. Anholt, M. Barron, C. Bess, K. P.
Blankenburg, M. A. Carbone, D. Castellano, L. Chaboub, L. Duncan, Z. Harris, M. Javaid, J. C.
Jayaseelan, S. N. Jhangiani, K. W. Jordan, F. Lara, F. Lawrence, S. L. Lee, P. Librado, R. S.
Linheiro, R. F. Lyman, A. J. Mackey, M. Munidasa, D. M. Muzny, L. Nazareth, I. Newsham, L.
Perales, L. L. Pu, C. Qu, M. Ramia, J. G. Reid, S. M. Rollmann, J. Rozas, N. Saada, L. Turlapati, K.
C. Worley, Y. Q. Wu, A. Yamamoto, Y. Zhu, C. M. Bergman, K. R. Thornton, D. Mittelman and R.
A. Gibbs (2012). "The Drosophila melanogaster Genetic Reference Panel." Nature
482(7384): 173-178.
McGaugh, S. E., C. S. Heil, B. Manzano-Winkler, L. Loewe, S. Goldstein, T. L. Himmel and M. A.
Noor (2012). "Recombination modulates how selection affects linked sites in Drosophila."
PLoS Biol 10(11): e1001422.
Meiklejohn, C. D. and D. C. Presgraves (2012). "Little evidence for demasculinization of the
Drosophila X chromosome among genes expressed in the male germline." Genome Biol Evol
4(10): 1007-1016.
Meisel, R. P., J. H. Malone and A. G. Clark (2012). "Disentangling the relationship between
sex-biased gene expression and X-linkage." Genome Res 22(7): 1255-1265.
Miller, D. E., S. Takeo, K. Nandanan, A. Paulson, M. M. Gogol, A. C. Noll, A. G. Perera, K. N.
Walton, W. D. Gilliland, H. Li, K. K. Staehling, J. P. Blumenstiel and R. S. Hawley (2012). "A
Whole-Chromosome Analysis of Meiotic Recombination in Drosophila melanogaster." G3
(Bethesda) 2(2): 249-260.
Morris, L. X. and A. C. Spradling (2011). "Long-term live imaging provides new insight into
stem cell regulation and germline-soma coordination in the Drosophila ovary."
Development 138(11): 2207-2215.
64
Myers, S., L. Bottolo, C. Freeman, G. McVean and P. Donnelly (2005). "A fine-scale map of
recombination rates and hotspots across the human genome." Science 310(5746): 321-324.
Neel, J. V. (1941). "A relation between larval nutrition and the frequency of crossing over in
the third chromosome of Drosophila melanogaster." Genetics 26(5): 506-516.
Neumann, R. and A. J. Jeffreys (2006). "Polymorphism in the activity of human crossover
hotspots independent of local DNA sequence variation." Hum Mol Genet 15(9): 1401-1411.
Pan, J., M. Sasaki, R. Kniewel, H. Murakami, H. G. Blitzblau, S. E. Tischfield, X. Zhu, M. J. Neale,
M. Jasin, N. D. Socci, A. Hochwagen and S. Keeney (2011). "A hierarchical combination of
factors shapes the genome-wide topography of yeast meiotic recombination initiation." Cell
144(5): 719-731.
Parisi, M., R. Nuttall, P. Edwards, J. Minor, D. Naiman, J. Lu, M. Doctolero, M. Vainer, C. Chan, J.
Malley, S. Eastman and B. Oliver (2004). "A survey of ovary-, testis-, and soma-biased gene
expression in Drosophila melanogaster adults." Genome Biol 5(6): R40.
Parisi, M., R. Nuttall, D. Naiman, G. Bouffard, J. Malley, J. Andrews, S. Eastman and B. Oliver
(2003). "Paucity of genes on the Drosophila X chromosome showing male-biased
expression." Science 299(5607): 697-700.
Parsons, P. A. (1988). "Evolutionary rates: effects of stress upon recombination." Biological
Journal of the Linnean Society 35(1): 49-68.
Petes, T. D. (2001). "Meiotic recombination hot spots and cold spots." Nat Rev Genet 2(5):
360-369.
Priest, N. K., L. F. Galloway and D. A. Roach (2008). "Mating frequency and inclusive fitness
in Drosophila melanogaster." Am Nat 171(1): 10-21.
Ranz, J. M., C. I. Castillo-Davis, C. D. Meiklejohn and D. L. Hartl (2003). "Sex-dependent gene
expression and evolution of the Drosophila transcriptome." Science 300(5626): 1742-1745.
Redfield, H. (1966). "Delayed mating and the relationship of recombination to maternal age
in Drosophila melanogaster." Genetics 53(3): 593-607.
Ringrose, L. and R. Paro (2007). "Polycomb/Trithorax response elements and epigenetic
memory of cell identity." Development 134(2): 223-232.
65
Roberts, A., H. Pimentel, C. Trapnell and L. Pachter (2011). "Identification of novel
transcripts in annotated genomes using RNA-Seq." Bioinformatics 27(17): 2325-2329.
Rockman, M. V. and L. Kruglyak (2009). "Recombinational landscape and population
genomics of Caenorhabditis elegans." PLoS Genet 5(3): e1000419.
Roth, S. and J. A. Lynch (2009). "Symmetry breaking during Drosophila oogenesis." Cold
Spring Harb Perspect Biol 1(2): a001891.
Singer, T., Y. Fan, H. S. Chang, T. Zhu, S. P. Hazen and S. P. Briggs (2006). "A high-resolution
map of Arabidopsis recombinant inbred lines by whole-genome exon array hybridization."
PLoS Genet 2(9): e144.
Singh, N. D., E. A. Stone, C. F. Aquadro and A. G. Clark (2013). "Fine-scale heterogeneity in
crossover rate in the Garnet-Scalloped region of the Drosophila melanogaster X
Chromosome." Genetics.
Spellman, P. T. and G. M. Rubin (2002). "Evidence for large domains of similarly expressed
genes in the Drosophila genome." J Biol 1(1): 5.
Spradling, A., M. T. Fuller, R. E. Braun and S. Yoshida (2011). "Germline stem cells." Cold
Spring Harb Perspect Biol 3(11): a002642.
Spradling, A. C. (1993). Developmental genetics of oogenesis. The Development of
Drosophila melanogaster. M. a. M.-A. Bate, A. Cold Spring Harbor, N.Y., Cold Spring Harbor
Press. 1: 1–70.
Stern, C. (1926). "An effect of temperature and age on crossing-over in the first chromosome
of Drosophila melanogaster." Proc Natl Acad Sci U S A 12(8): 530-532.
Sturgill, D., Y. Zhang, M. Parisi and B. Oliver (2007). "Demasculinization of X chromosomes
in the Drosophila genus." Nature 450(7167): 238-241.
Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau, B. B. Tuch, A.
Siddiqui, K. Lao and M. A. Surani (2009). "mRNA-Seq whole-transcriptome analysis of a
single cell." Nat Methods 6(5): 377-382.
66
Trapnell, C., D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn and L. Pachter (2013).
"Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature
Biotechnology 31(1): 46-53.
Trapnell, C., A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L.
Rinn and L. Pachter (2012). "Differential gene and transcript expression analysis of RNAseq experiments with TopHat and Cufflinks." Nat Protoc 7(3): 562-578.
Wang, Z., M. Gerstein and M. Snyder (2009). "RNA-Seq: a revolutionary tool for
transcriptomics." Nat Rev Genet 10(1): 57-63.
Weber, C. C. and L. D. Hurst (2011). "Support for multiple classes of local expression clusters
in Drosophila melanogaster, but no evidence for gene order conservation." Genome Biol
12(3): R23.
Williams, C. G., M. M. Goodman and C. W. Stuber (1995). "Comparative recombination
distances among Zea mays L. inbreds, wide crosses and interspecific hybrids." Genetics
141(4): 1573-1581.
Yandeau-Nelson, M. D., B. J. Nikolau and P. S. Schnable (2006). "Effects of trans-acting
genetic modifiers on meiotic recombination across the a1-sh2 interval of maize." Genetics
174(1): 101-112.
67
CHAPTER 3: In silico prediction of recombination rate variation
across the Drosophila melanogaster genome based on multiple
DNA motif analysis
3.0.1 Preface
Chapter 3 appears here as a reprinting of an article by Adrian, Cruz-Corchado, and Comeron
with the same title submitted for review in 2015. Formatting and minor alterations have
been made for consistency.
3.0.2 Abstract
In all eukaryotic species examined, meiotic crossovers occur non-randomly along
chromosomes. The cause for this nonrandom distribution remains poorly understood but
the presence of specific DNA motifs has been shown to play a contributory role in crossover
localization a number of species. In humans and mice, a motif targeted by the protein
PRDM9 is strongly associated with crossover hotspots but even in this paradigmatic case
motif presence alone is a poor predictor of crossover distribution when studied at a wholegenome scale. In Drosophila, contrary to the human and mouse case, no PRDM9 homolog
exists and recent studies suggest that many different motifs are enriched near
experimentally determined recombination events. Here, we present genomic and
bioinformatic analyses in D. melanogaster to investigate whether any DNA motif
distribution can be used to predict crossover variation across the whole genome using
machine-learning algorithms including Random Forests (RF) and multivariate adaptive
regression splines (MARS). MARS, in particular, generates models of crossover distribution
that expose a combinatorial, non-linear influence of motif presence able to account for more
than 40% of the variance in crossover rates genome-wide. We show that highly predictive
motifs share structural similarities that suggest secondary and tertiary DNA structures can
be important factors in crossover localization. We also show that transcriptional activity
68
during early meiosis and differences in motif use among chromosome arms add to the
predictive power of the models. Our work presents a more detailed picture of crossover
determination in Drosophila that includes DNA motif effects and a potential mechanistic
explanation to the known plasticity in recombination rates, thus paving the road for further
understanding of the multifactorial genetic and epigenetic nature of crossover distribution
across genomes.
3.1 Background
Meiosis is a pervasive process among eukaryotes and the meiotic machinery is
heavily conserved (Keeney 2001). Yet, the rate of meiotic recombination and crossover, in
particular, exhibits an astounding degree of variation across genomes as well as between
closely related species, populations of the same species, and even among individuals of the
same population (Neel 1941; Parsons 1988; Kim et al. 2007; Coop et al. 2008; Kulathinal et
al. 2008; Mancera et al. 2008; Kong et al. 2010; Dumont et al. 2011; Fledel-Alon et al. 2011;
Ross et al. 2011; Smukowski, Noor 2011; Comeron, Ratnappan, Bailin 2012; McGaugh et al.
2012; Miller et al. 2012; Singh et al. 2013; Gossmann et al. 2014; Liu et al. 2015). Moreover,
crossover rates are affected by other factors such as age, temperature, stressors, etc.,
indicating that a precise description of crossover distribution requires identifying both
heritable genetic variation and epigenetic factors (Brooks 1988; Kong et al. 2002; Hussin et
al. 2011). Indeed, epigenetic marks and active transcription have been implicated in
shaping the distribution of double-strand-breaks (DSB) and ultimately crossover rates in a
number of species including Drosophila melanogaster (Mirouze et al. 2012; Yelina et al.
2012; Adrian, Comeron 2013).
To gain insight into the factors involved in crossover localization much attention has
been given to short DNA sequence motifs near crossovers. Computational analyses of high-
69
resolution maps can identify specific motifs enriched at hotspot regions, but the transition
to biological relevance has not been clear (MacIsaac, Fraenkel 2006; Simcha, Price, Geman
2012). Even when a motif has been associated with a specific process or function, analyses
of motif presence using only DNA primary sequences are rarely predictive enough to
forecast specific patterns such crossover distribution at a whole-genome scale. One of these
cases is the 13-mer motif recognized by the protein PRDM9 in humans and mice, with
PRDM9 promoting histone methylation and crossing over around the DNA motif (Baudat et
al. 2010). Indeed, the PRDM9-associated motif is present in approximately 40-60% of
crossover hotspots in humans (Myers et al. 2008; Hinch et al. 2011). The reverse is much
less often true: the PRDM9 motif occurs over 290,000 times in the human genome while
fewer than 50,000 recombination hotspots have been identified (Ségurel, Leffler,
Przeworski 2011). Therefore, the PRDM9 hotspot-associated motif is not a strong predictor
of crossover distribution at a genome-wide level even in species with bona fide hotspots
that have >100-fold crossover rate relative to coldspots. A recent analysis across the ape
phylogeny supports this conclusion, with enrichment of putative PRDM9 binding in
recombination hotspots when compared to coldspot regions while there is no significant
association between PRDM9 presence and recombination rates when measured broadly
across the genome (Stevison et al. 2015). An equivalent case is observed in the fission yeast
Saccharomyces pombe, where motifs significantly enriched near some hotspots are,
nonetheless, very poor predictors of hotspot localization genome-wide (Fowler et al. 2014).
In Drosophila, high-resolution recombination maps have revealed that crossover
rates may vary up to 20- to 40-fold across genomic regions traditionally assumed to exhibit
limited variation in recombination, describing ‘peaks’ of crossover rates that are far less
extreme and physically discrete as in species with traditional hotspots (Kulathinal et al.
70
2008; Comeron, Ratnappan, Bailin 2012; McGaugh et al. 2012; Singh et al. 2013). Moreover,
Drosophila species, like other species including some placental mammals, do not have
functional PRDM9 orthologs (Oliver et al. 2009; Parvanov, Petkov, Paigen 2010; MuñozFuentes, Di Rienzo, Vilà 2011; Heil, Noor 2012). The 13-mer motif associated with human
hotspots that is recognized by PRDM9 is not observed near crossover events in Drosophila
(Comeron, Ratnappan, Bailin 2012; Heil, Noor 2012). In fact, sequence analyses in D.
melanogaster have identified not one but many DNA motifs significantly enriched near
crossover events (Comeron, Ratnappan, Bailin 2012; Miller et al. 2012; Singh et al. 2013).
Current data, therefore, suggest that Drosophila has mammalian-like hotspots and that
crossover localization may be influenced by the combined effects of several motifs; we
hypothesize that crossover-inducing motifs may be more evolutionary stable than in
species with bona fide hotspots (e.g., humans). This scenario would support the possibility
that studies of motif presence to some degree may be informative at a species-level
describing genome-wide recombination variation. Whether the many crossover-associated
motifs have predictive power describing variation in crossover rates across Drosophila
genomes has not been explored. Whether the potential effects of motif presence on
crossover localization are individual or synergistic, or whether the same set of motifs plays
a role in crossover localization in different genomic regions remains unknown aswell.
Here we present an investigation on the potential capability of predicting crossover
variation across the genome of D. melanogaster based solely on the analysis of genomic
sequence and the distribution of specific motifs. We first generated landscapes of motif
presence that takes into account the probabilistic nature of motif sequences, background
composition, and the numerous false positives expected in any large-scale genomic study.
Using machine-learning techniques we show that the presence of multiple motifs can
71
explain a significant fraction of the observed variation in crossover distribution at a wholegenome scale. Our quantitative models can explain more than 40% of the genome-wide
variation in crossover rates without studying meiotic products, and are particularly
accurate (more than 60% positive rate) at detecting the genomic regions with the highest
10% and lowest 10% crossover rates. We show that while effect of each motif is small, there
is a multifactorial and non-linear influence of motif presence in crossover localization, with
the predictive power of all models increasing with additional motifs. We also show that
transcriptional activity during early meiosis adds prediction power to the models and thus
explicitly includes a potential mechanistic explanation to the known plasticity in
recombination rates. We also report for the first time that motif effects on crossover rates
vary among chromosome arms.
3.2 Results & Discussion
3.2.1 Generation of Genome-Wide landscapes of DNA motifs
The study of almost 2,000 crossover events localized with high resolution (less than
500 bp) exposed many DNA motifs enriched near/at crossover events (Comeron,
Ratnappan, Bailin 2012) and we used the reported position frequency matrix (PFM) of
these motifs to generate corresponding landscapes of motif presence across the whole
reference genome (see Methods). We avoided searching for ‘word’ matches of the most
common nucleotides (with or without including ambiguous sites using IUPAC nucleotide
code) to capture the probabilistic nature of most DNA motif composition. Instead, and for
each motif, we applied a genomic scan to assign a likelihood Li to every k-mer sequence to
fit the PFM (with i and k indicating a genomic position and the length of the motif,
respectively). We then generated a genome-wide null distribution of Li (RLi) based on
random shuffling of nucleotide and dinucleotide composition and used the null distribution
72
of RLi to obtain a threshold for observed Li that would represent a desired False-Discovery
Rate (FDR) (eg. LFDR=0.01 when the FDR is 1%). We call a motif to be present at position i
only when Li > LFDR. This approach allows applying any arbitrary FDR and, importantly,
takes into account the number of sites under study as well as background nucleotide
composition.
FDR-corrected motifs predictably show greater variation in composition and
decreasing similarity to the initial seed motif (increased bit scores) when FDR increases
(see Figure 3.5S). Increasing FDR not only causes a fast increase in overall motif count but
can also alter the relative distribution across the genome, reinforcing the need of using
stringent FDR in genome-wide motif analyses. Unless otherwise indicated, we focused on
FDR-corrected landscapes of motif presence at 100-kb scale with FDR set at 1%. We chose
to use a conservative FDR of 1%. A FDR of 1% maximizes dynamic range and further
recovers a sequence logo nearly indistinguishable from that of the one produced using an
FDR < 1%, all while restricting the fraction of false positives to an acceptable threshold. All
20 motifs investigated (see Methods) show a wide-range distribution across the genome
(Table 3.1S), ranging from zero to a maximum of 139 motif matches per 100-kb region with
an average presence that varies from 0.08 (motif M10) to 39 (M6). Genome-wide presence
ranges from 95 (M10) to a maximum of over 46,000 motif hits (M6) even when FDR is set to
1% and highlights the need for caution when interpreting the biological relevance of
individual motifs.
3.2.2 Variation in motif presence among chromosome arms
We observe that many motifs are not equally distributed among chromosome arms
(Table 3.1S). All 16 motifs present more than 1,000 times genome-wide show a significant
departure from random distribution (Kruskal-Wallis test based on the comparison of motif
73
presence per 100-kb region among chromosome arms, P < 0.001 in all cases). This result
opens the possibility that the different chromosome arms might utilize different DNA
motifs, individually or in combination, as localizing factors for crossover localization (see
below). Moreover, the unusual pattern of motif distribution is not associated with
differences between autosomal arms vs. chromosome X since 14 out of the 16 motifs also
show significant presence heterogeneity among the four autosomal arms (P < 0.001).
3.2.3 Genomic co-occurrence of motifs
We also observe that the presence of these motifs across the genome is highly
spatially associated, with more that 50% of all pairwise genome-wide motif distributions
showing Spearman’s ρ > 0.15 (P < 1 x 10-8), a percentage that reaches 75% when the
uncommon motifs (fewer than 1,000 counts genome-wide) are removed from the study
(average pairwise ρ = 0.30). Note that spatially-randomized motifs do not produce any
association (an average ρ of 7 x 10-6 and a maximum pairwise ρ of 0.15 out of 50 million
simulations of randomized motifs). Because our method to call motifs included background
composition and fully avoids overlapping motifs, the existence of high spatial covariance of
motif presence opens up the possibility of multifactorial regulation of crossover
localization.
3.2.4 Motif presence is correlated with crossover rates across the genome
At first glance the observed variation of motif presence on a chromosome level
visually follows the historical description of broad-scale crossover rate variation in D.
melanogaster (see Figure 3.1). Motif occurrence is reduced near telomeric and centromeric
regions of all chromosomes concomitant with a tendency for reduced crossover rates
(Morton, Rao, Yee 1976; Lindsley, Zimm 1992; Comeron, Ratnappan, Bailin 2012; Miller et
al. 2012). Although the fraction of the genome initially used to infer motifs from
74
experimentally characterized crossovers was minimal (less than 1% of the genome), we
studied potential genome-wide associations between variation in motif presence and
crossover rates using estimates of recombination completely independent of the
experimental genetic map used to find the initial motifs. To this end, we used highresolution population estimates of crossover rate across the genome based on patterns of
linkage disequilibrium (Chan, Jenkins, Song 2012) applied to D. melanogaster African
Rwanda (RG) and Zambia (ZI) populations [(Pool et al. 2012; Lack et al. 2015); see Methods
for details]. Unless noted, analyses are based on crossover estimates of the RG population
(see Methods).
The direct comparison of crossover rates and motif presence at 100-kb scale shows
a very strong positive association for 15 motifs, with Spearman’s ρ ranging between 0.15 (P
= 3 x 10-7) and 0.51 (P < 1 x 10-16) (Figure 3.2 and Figure 3.6S). Because the 20 motifs
analyzed were originally identified based on a very small fraction of the genome, the
detection of a strong association for many motifs at a genome-wide scale supports the initial
approach for detecting sequence signatures of crossover localization genome-wide. Not all
motifs reported to be enriched near crossover events, however, are positively associated
with crossover rates at a genome-wide scale, with 5 motifs showing ρ < 0.05 (P > 0.1),
mostly attributable to their very limited presence across the genome once FDR-correction is
applied; see Table 3.1S).
We observe that some motifs show clear differences among chromosome arms in
terms of association with crossover rates (Figure 3.2, Figure 3.6S and Figure 3.7S). For
instance, variation in M7 presence shows a very strong association with crossover rates
across the X chromosome (ρ = 0.51, P = 3x10-16) while it shows no association across
autosomal arms (P > 0.1 in all autosomal arms). M2, on the other hand, shows strong
75
positive association with crossover rates across all four autosomal arms (ρ > 0.28; P < 2x105)
but not across the X chromosome (ρ = 0.085; P > 0.1) This difference in potential causal
effects of motif presence on crossover localization, however, does not seem to be simply an
autosomal vs. X effect since other motifs (e.g. M1, M6, M11, M13) show more complex
patterns.
In an effort to obtain a baseline model that considers multiple motifs to explain
crossover distribution we first performed LASSO (Least Absolute Shrinkage and Selection
Operator) regression (Tibshirani 1996; Hastie et al. 2009) (see Methods for details). LASSO
regression is a data mining technique that favors solutions with fewer parameter values
under a linear model, simultaneously performing variable selection and simplifying model
interpretation. LASSO exposes six heavily weighted motifs (in order of importance M3, M5,
M1, M7, M2, and M18), all positively associated with crossover rates. With these six motifs,
LASSO fits a linear model of motif presence that explains 30% of the variation in crossover
rates genome-wide (ρ = 0.55, P < 2.2x10-16). Note that while motifs M5, M3 and M1 were
among the motifs with the highest association with crossover rates based on simple
individual non-parametric Spearman’s correlations (Figure 3.2 and Figure 3.6S), motifs M7,
M2 and M18 were not in the top six. Easing the constraints of LASSO (ρ + 1 S.E.; Figure 3.8S)
allows all but one motif variable to enter the model but this highly complex model exhibits
little improvement in performance (ρ = 0.59, P < 2.2x10-16).
76
Figure 3.1 Genomic Landscape of Motif 3
Estimates of motif presence per 100 kb across chromosome arms 2L, 2R, 3L, 3R and the X chromosome. Motif 3
(M3) presence is assigned after applying a 1% FDR (see Methods).
77
Figure 3.2 Probability heatmap of correlation between
motif presence and crossover rates for different
chromosome arms.
Probability of Spearman’s ρ of motif presence and crossover rates
calculated genome-wide and for each chromosome arm separately based
on non-overlapping 100kb regions. Only correlations with P < 0.01 are
shown; regions in white are above this threshold. Motifs are ordered
based on Spearman’s ρ, from stronger (M5; ρ = 0.513, P = 7x10-81) to
weaker (M10; ρ = 0.013, n.s.) in genome-wide analyses.
78
3.2.5 A predictive model of variation in crossover rates across the genome based on sequence
motif occurrence.
In order to investigate how accurate a model of crossover variation based on
variation in motif presence could be, we applied two machine learning methods. We first
used Random Forests (RF) (Breiman 2001; Lee et al. 2005; Banfield et al. 2007) as a form of
supervised learning to discriminate between regions (classes) of different crossover rates,
particularly between low and high rates. We later constructed a quantitative predictive
model using Multivariate Adaptive Regression Splines (MARS) (Friedman 1991; Friedman,
Roosen 1995; Hastie et al. 2009) (see Methods for details). MARS is an approach that splits
predictive variables into several intervals, and allows potential nonlinear relationships over
these different intervals with any degree of interaction between variables. Importantly,
MARS allows obtaining a final explicit and continuous model of crossover rates based on the
combined presence of multiple motifs.
3.2.6 Random Forests (RF) categorical modeling
We split 100-kb regions into ten approximately equally sized bins, each containing a
class from the lowest to highest 10% crossover rate, and applied RF to classify crossover
classes. The correctness of the RF models is measured by true positive rate (accuracy) and
the area under the curve (AUC) that indicates the ability of the model to discriminate
between the different classes, with AUC scores ranging from 0.5 (indicating that a model has
no discriminatory ability) to 1 (indicating that the model can discriminate perfectly among
classes). Note that RF does not directly generate probability values associated with the
whole model. We, therefore, obtained the statistical significance of RF models by comparing
the accuracy and AUC generated by the model and the accuracy and AUC generated by the
application of the same RF method when crossover classes are randomized among 100-kb
regions (250,000 randomizations per model).
79
Using ten crossover classes as our class variables, we first trained a random forests model
using the presence of 20 motifs to later add chromosome arm (see above) and transcription
data (Adrian, Comeron 2013). Genome-wide, the accuracy (true positive percentage) for
models with 20 motifs is 26.1% versus a random accuracy of ~10% (P < 4x10-6), with mean
AUC = 0.658 (P < 4x10-6). The model performs radically better for the top and bottom
crossover classes (see below). Accuracy for the top and bottom 10% class is 78.6 and 60.2%
(more than six-fold enriched; P < 4 x 10-6 in both cases), exposing a large deviation from
the random expectations. Enrichment based on AUC shows an equivalent pattern, with AUC
= 0.892 and 0.876 for the top and bottom classes, respectively (P < 4x10-6 in both cases).
The study of motif presence alone correctly classifies crossover class within one step of
their true class 49% of the time among all recombination classes, indicating that even when
our model fails to accurately predict a class, it often falls into the adjacent bin.
Including information on the chromosome arm and the genomic distribution of
transcribed genes in early meiosis generates a model that increases the accuracy of the RF
model to 27.8% and generates a mean AUC of 0.725 (P < 4x10-6 in both cases). Mean
accuracy for the top class increases to 83.3% (AUC = 0.947; P < 6.6x10-6) and a 7% false
positive rate, which evidences the high predictive power of motif presence for these regions
(Figure 3.3). The addition of other genomic properties, such the number of annotated genes
or proportion of exonic sequence in each window does not significantly affect model
accuracy (data not shown) and these variables are never within the top ten variables in
term of importance within the model (positions 18 and 21, respectively) ranked by
information gain. Similarly, GC content per window is neither ranked highly by information
gain criteria nor does it have a substantial impact on classification accuracy. Taken together,
80
the results indicate a significant influence of motif presence on crossover rate variation and
support a role for active transcription during early meiosis in crossover localization.
3.2.7 MARS modeling
We evaluated the quality of MARS models focusing on R2GCV scores (the MARS
estimate of how well this model would perform on new data) based on the conservative 10fold cross-validation approach (see Methods for details). The simplest model that considers
the variable presence of motifs across the genome is able to explain ~45% (R2GCV = 0.447) of
the genome-wide variation in crossover distribution (Figure 3.4A and 3.4B), and identifies
six motifs with significant contribution to the model (M7, M3, M5, M14, M18 and M6). This
predictive model improves even further when including information on transcription data
(R2GCV = 0.47) or chromosome arm (R2GCV = 0.57). The most complete model that includes
motif presence, transcription and chromosome arms explains more than 59% of the
variation in crossover genome-wide. This complex model recognizes chromosome arm,
transcription data and 8 motifs as significantly important within the model. Many of these
predictive variables exhibit clear non-linear effect on crossover rates including
transcription data and most motifs (e.g. M2, M5, M7 and M18; Figure 3.4C). Table 3.2S
shows additional results based on less conservative MARS validation methods). Analyses
based on population estimates of crossover rates using the ZI population instead of the RG
population generate similar results (Table 3.2S).
Notably, the motifs and their ranking by order of importance differs between
models investigating only motif presence and models including transcription and
chromosome arm information. In particular, motifs such as M17, M12 and M2 become part
of the model when interactions with chromosome arms are allowed, in agreement with the
observation that these motifs showed positive association with crossover rates in some but
81
not all chromosomal arms (Figure 3.2). Combined, these results expose that crossover
variation across the genome is significantly influenced by the presence of multiple motifs
and by interactions between motif utilization, chromosome arm, and transcription activity
in early meiosis.
In D. melanogaster, crossover frequency is severely reduced near centromeres and,
to a lesser degree, near telomeres—the so-called “centromere effect” (Beadle 1932). We,
therefore, investigated the possible influence of motif distribution on crossover rates after
excluding these sub-centromeric and telomeric regions from the study. MARS generates
models with R2GCV that range between 0.405 (only motif presence) and 0.539 (motifs +
chromosome arms + transcription activity in early meiosis). These results show that
variation in motif distribution maintains significant power explaining variation in crossover
rates across genomic regions considered of average or high recombination.
Finally, we investigated the predictive power of motif presence along individual
chromosomal arms to eliminate any possible effects associated with differences in motif
presence (see above) and total crossover rates among arms. MARS analyses show (Figure
3.4A) a weaker but still high predictive power describing variation in crossover rates based
on motif presence alone (R2GCV ranging between 0.16 and 0.36); when transcription data is
also included, the predictive power is increased (R2GCV ranging between 0.19 and 0.46).
82
Figure 3.3 True-positive rate generated by a Random Forests (RF) model
True positive rate (accuracy) is given for 10 crossover classes, from class A (indicating the class with lowest
10% crossover rate) to class J (indicating the class with highest 10% crossover rate). Random accuracy
(uninformative model) per class is 10% (horizontal dashed line). Bar colors indicate significance of departing
random expectations assessed by randomized simulations. The model tested utilized all motifs, chromosome
arm, and transcription estimates to predict recombination classes (see Methods for details).
83
Figure 3.4 MARS predictive models of crossover rates.
Chrom.
A) Estimates of the predictive power of the
MARS models (R2GCV) using only motif
presence or when the models include
transcription data and/or chromosome arms
as predictive variables. Results based on
genome-wide analyses (red) and after
removing sub-telomeric and –centromeric
regions (trimmed genome) (blue). B)
Relationship between observed crossover
rates (see Methods) and predicted crossover
rates based on MARS with a model that
includes variation in motif presence genomewide. C) Examples of MARS predictions of the
influence of motif presence on crossover rates.
84
3.3 Conclusions
To obtain FDR-corrected landscapes of motif presence across the genome we used a
flexible approach based on the likelihood of each k-mer sequence of nucleotides to
correspond to a given PFM to later calculate and apply an arbitrary FDR. The use of a FDR
instead of a direct arbitrary probability is necessary to limit the extent of false positives and
this parameter should be tuned appropriately for each investigation. We set FDR at 1%
because it generates a large number of motif hits while still producing PFMs equivalent to
those obtained under the most stringent FDR of 0.1%, thus suggesting that false positives
do not seriously alter motif detection. We also confirmed that a single PFM per motif is
adequate across the whole genome by comparing the top 50 and bottom 50 recombinationcorrelated regions by residual (which followed an approximately normal distribution). We
found no significant differences between PFMs recovered from these two classes of regions
indicating that a single PFM per motif can be used at genome-wide scale and that motif
count, rather than motif sequence or quality, is a more important predictor of crossover in
our data.
3.3.1 Motif Composition
Several of the motifs with strongest association with crossover rates or in terms of
importance within RF and MARS models share an enrichment of poly-A/T or [A/T]n tracts.
It has been well established in gel-mobility and X-ray crystallographic studies that repeated
instances of A/T tracts can produce a bend in the DNA helix axis, with the in-phase
repetition of these elements contributing to larger overall bends (Travers 1990; Dlakic,
Harrington 1998; Hizver et al. 2001). Furthermore, there is evidence to suggest that
polyA/polyT bending tracts serve important functions, altering protein-DNA binding
specificities and regulating transcription (Travers 1990; Perez-Martin, de Lorenzo 1997).
85
AA, TT, AT dinucleotides are also a factor in nucleosomal positioning (Segal et al. 2006),
which could have a direct impact on chromatin accessibility, consistent with hypotheses
where such accessibility is required for DSB induction. In fact, a poly-A tract is at the center
of a proposed common (CoHR) motif of DSBs in yeast (Blumental-Perry et al. 2000).
Moreover, the association between A/T rich motifs and crossovers is a motif-specific and
not a large-scale nucleotide composition phenomenon since there is only a nominal effect of
overall A+T content on the distribution of crossover rates at 100-kb scale (ρ = -0.057, P =
0.006) that becomes non-significant when analyzing noncoding sites only [P > 0.25;
(Comeron, Ratnappan, Bailin 2012)]. The strong signal associated with [CA]n motifs across
the D. melanogaster genome is also interesting because these motifs are markers for Z-DNA
formation (Herbert, Rich 1996) and have been shown to be significantly enriched near
hotspots of recombination in S. pombe and S. cerevisiae (Treco, Arnheim 1986; Fowler et al.
2014). Our data, therefore, support the concept that the secondary and tertiary DNA
structures are relevant to meiotic DSB fine-scale localization, acting at a very local scale
within larger genomic or chromatin features.
Intriguingly, motif M6 shows a strong influence on crossover rates within MARS
models that analyze motifs presence (either genome-wide or after removing sub-telomeric
and –centromeric regions). M6 [MCO4 in (Comeron, Ratnappan, Bailin 2012)] includes the
7-mer CCTCCCT sequence first associated with hotspot determination in humans and is the
core sequence of the longer 13-mer motif recognized by the zinc finger protein PRDM9
(Jeffreys, Neumann 2002; Jeffreys, Neumann 2005; Myers et al. 2005; Myers et al. 2008;
Baudat et al. 2010). Moreover, this core CCTCCCT motif is not only observed in D.
melanogaster (this study) but was previously observed in regions with high crossover rates
in D. pseudoobscura (Kulathinal et al. 2008). Because there is no PRDM9-homolog in
86
Drosophila species (Oliver et al. 2009; Heil, Noor 2012) and the complete PRDM9-associated
13-mer motif is not observed near crossovers (Comeron, Ratnappan, Bailin 2012; Heil, Noor
2012), the observation that a motif including its core sequence CCTCCCT may play a
detectable role describing crossover localization is unexpected and could indicate
unappreciated commonalties in crossover-associated motifs between mammals and insects.
3.3.2 Differences among Chromosome Arms
Our study has exposed not only that the presence (per 100kb) of many motifs varies
among chromosome arms but also that the association between motif presence and
crossover rates differs depending on chromosome arm. MARS analyses reveal detectable
interactions between chromosome arm and motif presence when describing crossover
rates and, in agreement, the order of motif importance within MARS models varies when
chromosome arm is included as predictor. Notably, there is no simple split in the model
between autosomes vs. X chromosome. Our results imply that different large-scale genomic
regions or whole chromosome arms may utilize different combination of DNA motifs as
localizing factors for crossover formation (and potentially DSBs). Outside of differential
rates of evolution (Meisel, Connallon 2013; Llopart 2015) or average rates of crossover
among arms, it is hard to conceptualize how large chromosomal domains would be different
from any other in terms of motif utilization for DSB formation and resolution. At this time,
we can only hypothesize that such a potential mechanistic link could be associated with
spatio-temporal movement and nuclear localization of the chromosome arms during early
meiosis (Parvinen, Soderstrom 1976; Koszul et al. 2008; Shibuya, Ishiguro, Watanabe
2014). At a more practical level our results also indicate that analyses of motif occurrence
based on a single large genomic region or chromosome may generate an informative view
87
for that region or chromosome, but one that may not be applicable genome-wide, thus
explaining differences in motif detection among studies.
3.3.3 Crossover Localization across the Genome
Our study supports the concept that multiple motifs play a detectable role in
crossover localization in Drosophila. Although high-resolution maps in a number of other
species (including S. cerevisiae, S. pombe, Plasmodium falciparum or Apis mellifera) have
exposed a similar scenario with multiple motifs significantly enriched near crossovers
(Gerton et al. 2000; Cromie et al. 2007; Steiner et al. 2009; Jiang et al. 2011; Bessoltane et al.
2012; Liu et al. 2015), we show that variation in the relative presence of these motifs is in
fact predictive and explains a significant fraction of the observed variation in crossover
distribution across the genome. While the effect of each motif is small, the predictive power
of all models increases with additional motif variables, and we have found non-linear
influences of motif presence in crossover localization in Drosophila. We also show that
variables describing transcriptional activity during early meiosis and, more unexpectedly,
chromosome arm add to the predictive power of the models.
Standard multiple linear and the more complex LASSO linear regression analyses
suggest that 22 and 35%, respectively, of the observed variance in crossover rates across
the genome could be explained by motif presence. However, the application of more
advanced machine learning techniques such RF and MARS allows the study of multiple
variables without the constraints of linear models. Our study shows that RF modeling
allows identifying regions with high and low crossover rates with high accuracy,
particularly those regions with the highest and lowest 10% crossover rates where the
accuracy of the model is at least six-fold greater than random expectation.
88
MARS, on the other hand, generates quantitative predictions while allowing multiple
interactions among motifs and nonlinear effects when describing crossover rates, including
saturation/insensitivity above/below certain ranges. Indeed, MARS exposes significant
interactions among motifs and generates models able to account for more than 40% of the
variance in crossover rates across the D. melanogaster genome based on R2GCV estimates.
MARS also identified important motifs with effects on crossover localization that showed
secondary effects when analyzed using simpler approaches, with potential large non-linear
and combined effects. Note that MARS estimates of R2GCV based on less conservative
methods (Table 3.2S) would imply a much higher influence of motif presence on variance in
crossover distribution but caution should be applied to the interpretation of these high R2
estimates (instead of those based on the more conservative 10-fold cross validation) due to
potential overfitting. Our MARS results based on R2GCV suggest that motif presence in
Drosophila can explain genome-wide patterns of crossover even more than species such as
humans, where patterns of PRDM9 can explain ~18% of the variation in recombination
hotspots (Baudat et al. 2010).
Based on the knowledge obtained from this study and others, a general model is
emerging where crossover distribution is determined by a combination factors acting at
different physical scales (Petes 2001; Kleckner 2006; Pan, Keeney 2007; Pan et al. 2011;
Adrian, Comeron 2013; Borde, de Massy 2013). In D. melanogaster, the centromere effect
describes variation in crossover distribution at the largest scale (hundreds of kb), with a
severe reduction in crossover rates at sub-centromeric/telomeric regions (Beadle 1932).
Our study shows that motif presence plays a significant role describing the observed
heterogeneity in crossover distribution even after removing regions near centromeres and
telomeres, and MARS models perform significantly worse when using the physical position
89
along a chromosome arm relative to the centromere (with 0 and 1 indicating the subcentromeric and –telomeric positions, respectively) as the sole predictor of crossover
variation (R2GCV = 0.046). Moreover, we observe that regions proximal to telomeric and
centromeric regions have fewer motifs positively associated with crossovers (Figure 3.1).
Because crossovers near centromeres increase the probability of non-disjunction events at
the second meiotic division (Koehler et al. 1996) it is tempting to speculate that natural
selection may have played a role in the observed paucity of recombinogenic motifs in such
genomic regions. Combined, our results indicate that the centromere-effect observed today
in D. melanogaster may then be the consequence of both direct mechanistic explanations as
well as long-term evolutionary forces that have reduced the presence of crossoverassociated motifs in sub-centromeric regions.
Beyond the large-scale centromere-effect and the presence of specific motifs, there
is evidence that transcriptionally active genomic regions are enriched for crossovers
(Adrian, Comeron 2013; Aymard et al. 2014) and that many DSB enriched regions are
nuclease sensitive and influenced by nucleosome dynamics (Ohta 1994; Wu, Lichten 1994;
Fan, Petes 1996; Aymard et al. 2014) and/or DNA secondary structures. Other sources of
data show that recombining DNA sequences map to chromatin loops that are later tethered
to the underlying synaptonemal complex and ultimately targeted by the evolutionarily
conserved Spo11 protein (MEI-W68 in Drosophila) (Blat et al. 2002; Moens et al. 2002;
Buhler, Borde, Lichten 2007). The relationship among transcription activity, nucleosome
dynamics, DNA secondary structures and chromatin loop size and distribution in early
meiosis is not yet known. Our results show that the multifactorial effect of motif presence is
an important layer of information likely acting at a very local scale within much larger
chromatin domains. This scenario would explain the abundance of recombination-
90
associated motifs throughout many organisms, including the PRDM9-associated motif that
has been shown to be important in many mammalian systems; it further recognizes that
motifs, transcription and chromatin structures must work together to initiate
recombination. Some of the variation in crossover distribution is still unexplained and we
propose that models of crossover localization would benefit from linking genetic and
mechanistic explanations to the observed heritable but epigenetic variability in across
genomes. Analyses investigating the influence of environmental effects on transcription,
chromatin accessibility and crossover distribution using controlled crosses may provide key
information to develop such more precise models.
3.4 Methods
3.4.1 Motif Landscape Generation
Studies designed to assess the potential explanatory power of motif presence
describing recombination events at genome-wide level first requires the generation of motif
landscapes, a bioinformatic exercise that is not trivial. The difficulties in generating such
landscapes include motif detection algorithms able to recognize probabilistic models of
motif sequences instead of consensus ‘words’ (Schneider 2002; Hu, Li, Kihara 2005;
D'haeseleer 2006; Das, Dai 2007; Hartmann et al. 2013), the inclusion of background
nucleotide and dinucleotide composition (Simcha, Price, Geman 2012)., and the numerous
false positives expected in any large-scale genomic study. In order to form motif frequency
estimates across chromosomes, we developed a suite of custom python scripts designed to
take an input of position frequency matrix, PFM, generated by MEME (Bailey et al. 2009),
apply a sliding-windows approach to estimate the likelihood of each stretch of DNA of
containing the MEME identified motif, and finally apply a FDR to classify a sequence to be
classified as motif. We used the top 20 MEME PFMs by E-value that were generated from
91
(Comeron, Ratnappan, Bailin 2012) that identified motifs that were enriched in sequences
containing a crossover event within 500bp.
In more detail, motifs were first identified using a sliding window approach as
strings of k-mer nucleotides matching the input PFM at an initial probability (Li, at genomic
position i) greater than the arbitrary threshold of 1x10-30 (allowing approximately three
complete mismatches). We then generated a null distribution of Li (RLi) from randomly
shuffled genome sequence. Shuffling was carried out both by randomly shuffling each
genomic or chromosomal nucleotide individually and shuffling in pairs to preserve
dinucleotide structure, obtaining a more realistic null approximation (Ding, Lorenz, Chuang
2012). From this list of RLi probabilities, we generated a null distribution of likelihoods that
allows estimating the highest values at any desired FDR (e.g., FDR of 1% would allow 1% of
RLi > LFDR-0.01). We chose to utilize the 1% FDR-corrected estimates because those estimates
yielded realistic approximations of frequency, low noise, and maximal correlation with rates
of crossing over. Stretches of k-mer bases with Li > LFDR-0.01 were then scored as motifs to be
present. Note that the set of FDR-corrected motifs generate PFMs that are similar but not
an exact match to the initial set of seed PFMs, which is not unexpected due to the limited
number and genomic distribution of the original set of sequences analyzed and the
influence of background nucleotide composition (see Figure S1). Additionally, when
comparing motif landscapes to experimentally measured recombination rates (Comeron,
Ratnappan, Bailin 2012), we masked those regions that were used in generating the motifs,
though the difference between masked and unmasked genomes was small, and the data sets
generated nearly identical results.
Using positional information of each motif location, we generated sliding-window
estimates of FDR-corrected motif presence, or ‘motif landscapes’, for window sizes of 100-
92
kb across the genome. Unless noted, analyses were carried out using non-overlapping
windows across the whole genome. To remove sub-centromeric and –telomeric regions
with strongly reduced crossover rates we followed (Comeron 2014): sub-centromeric
regions were assigned by starting at the centromere and moving into the chromosome arm
until a minimum of 3 consecutive 100-kb windows showed crossover rates >1 cM/Mb, and
sub-telomeric regions were assigned in an equivalent manner. Chromosome positions and
gene annotations were based on the D. melanogaster dm3 assembly and annotation release
5.47 (http://flybase.org/). Sequence logos were generated using the R package seqLogo
(Bembom 2007). Unless noted, landscape graphics and statistical analyses were conducted
in the R programming language.
3.4.2 Model Generation and Attribute Selection
Multiple linear, isotonic, and LASSO regression models were created in R or Java
programming environments using the isoreg and glmnet packages and the WEKA v3.6.1
software package [(Hall et al. 2009); http://www.cs.waikato.ac.nz/ml/weka/]. For these
models, all motif variables were included in the analysis unless otherwise removed by the
model. LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani 1996; Hastie
et al. 2009) is a data mining technique that favors solutions with fewer parameter values
under a linear model, simultaneously performing variable selection and simplifying model
interpretation. LASSO tends to produce coefficients that are small or zero and attribute
selection is supplied by the attributes with nonzero coefficients. The intensity of
regularization (or shrinkage) within LASSO is controlled by the regularization/shrinkage
parameter (). Unless noted, we used WEKA software (v.3.6) to apply LASSO and used a 
that minimizes the cross validated mean squared error plus 0.5 standard error.
93
We also utilized the WEKA implementation of Random Forests (RF) for
classification. RF is a nonparametric approach useful for detecting associations when there
are large numbers of predictor variables with the possibility that each variable has
relatively weak effects (Breiman 2001; Banfield et al. 2007). Briefly, RF classification
constructs a collection of many independent decision trees, sampling both the data and
attributes randomly with replacement. The remaining, unused data is classified using the
collection of trees, with the classification of each item being based upon the result mode of
the RF. Here, we generated 1000 trees of unrestricted depth with Log2(Attribute Number)
+1 random attributes in each individual tree. When calculating probabilities for RF
estimates, our simulations were limited to 250 trees in order to speed computation at the
expense of increased variance, and therefore represent conservative estimates. We
generated a Random Forests model with an increasing variable number to determine the
effect of additional motifs on the model. Each model generated was evaluated using 10-fold
cross validation and tested versus a ZeroR null model which classifies all instances solely
based on the majority (mode) class. In all cases, the Random Forest model performed
significantly better at classification (two-tailed t-test, P < 0.05) than the null model unless
otherwise noted. In order to select the best features for use in model generation, we ranked
all features by the information gain criterion implemented in WEKA. Information gain is the
measure of the contribution of a particular feature to the model. After ranking, we extracted
the top features. When referring to a number of variables, we include only that number of
topmost ranked attributes.
We applied multivariate adaptive regression splines (MARS) (Friedman 1991;
Friedman, Roosen 1995; Hastie et al. 2009) using the software suite Salford Predictive
Modeler (v.7) from Salford Systems (http://www.salford-systems.com). MARS is a form of
94
regression analysis that splits predictive variables into several intervals, allows potential
nonlinear relationships over different intervals (basis functions) and combines individual
models as a final quantitative and predictive model. Importantly, MARS also allows for any
degree of interaction between variables. The quality of MARS models can be ascertained
using the generalized cross-validation (GCV) criterion (Craven, Wahba 1979), with GCV
evaluating the fit of the model penalizing complexity and the optimal model being the one
with the lowest GCV score. The measure R2GCV is the MARS estimate of how well this model
would perform on new data. Alternative estimates of model quality such standard R2 only
considers the fit of the model to the data and can be subject to substantial overfitting and
are not reported (we obtained R2 > R2GCV in all cases). Because the number of data points is
not very large (n=1,191 independent 100-kb regions), we avoided portioning the data into
training and test samples and, instead, we applied cross-validation and the MARS legacy
modes to estimate the optimal model and its performance based on the GCV criterion
(Friedman 1991). Because under the legacy mode MARS builds a sequence of models using
all available data (and could therefore overestimate the performance of the model), we
show MARS results based on 10-fold cross-validation to train and test to classifiers unless
specifically noted (in all cases, a 10-fold validation mode generates smaller R2GCV than when
a legacy mode is applied). Finally, variable importance under MARS was measured by the
Gini index (Breiman et al. 1984) and it is shown in terms of relative importance
(percentage) of variables as compared to the best one.
3.4.3 Population-scaled high-resolution crossover maps
We calculated the population-scaled recombination rate for two African populations of D.
melanogaster using the program LDhelmet (Chan, Jenkins, Song 2012). LDhelmet is a
statistical method that allows estimating fine-scale recombination rates across genomes
95
based on patterns of linkage disequilibrium, where the parameter estimated is the
population-scaled crossover rate per bp and generation (ρLD); ρLD = 2 Ne r, where Ne is the
effective population size of the population and r is the rate of crossover per bp and
generation in females. Note therefore that estimates of ρLD by LDhelmet represent historic
estimates of crossover for the population or species under analysis.
We analyzed the Rwanda (RG) and Zambia (ZI) populations because both are from
the sub-Saharan ancestral range of D. melanogaster, which minimizes the non-equilibrium
effects caused by recent expansion observed in western Africa and non-African D.
melanogaster populations and show low levels of admixture (Pool et al. 2012; Lack et al.
2015). Moreover, RG and ZI represent a relatively large sample of strains with no
chromosomal inversions (Pool et al. 2012; Lack et al. 2015). We obtained the genomic
sequences from the Drosophila Genome Nexus (Lack et al. 2015) and analyzed only strains
with no evidence of chromosomal inversions in any chromosomal arm. In total our analysis
included 19 RG sequences (RG10, RG13N, RG15, RG19, RG22, RG24, RG28, RG2, RG32N,
RG33, RG34, RG35, RG38N, RG39, RG4N, RG6N, RG7, RG8) and 20 ZI sequences ( ZI184,
ZI250, ZI252, ZI271, ZI311N, ZI320, ZI324, ZI332, ZI344, ZI378, ZI386, ZI398, ZI402,
ZI418N, ZI420, ZI455N, ZI457, ZI477, ZI517, ZI85). Unless noted, we used results from
analyzing the RG population because it combines a relatively large sample of strains with no
chromosomal and the lowest and well characterized levels of admixture (Pool et al. 2012),
thus allowing masking admixture regions. Following (Chan, Jenkins, Song 2012) we applied
a block penalty of 50 and an effective mutation rate of 0.006 per base pair. The data was
dived in blocks of 1000 SNPs, with 200 SNPs of overlap. For each block we ran LDhelmet for
500,000 iterations after 100,000 iterations of burn-in. Recombination maps for each
chromosomal arm were analyzed as non-overlapping adjacent windows of 100 Kb. At
96
100kb-scale, RG and ZI show a highly correlated, albeit different, population-based
crossover maps (Spearman’s ρ= 0.76; P < 1x10-16).
3.5
Supplementary Information
Table 3.1S Statistics of motif presence in D. melanogaster.
97
Table 3.2S Summary of MARS models of crossover distribution in D. melanogaster.
Table 3.2S Summary of MARS models of crossover distribution in D. melanogaster.
98
Figure 3.5S Effect of false discovery thresholds on motif presence and sequence.
A) Estimates of motif 1 (M1) presence across chromosome arm 2R. Purple line indicates presence of the motif
(P < 0.05) with no FDR correction. Green and red lines indicate motif presence after applying 5% and 1% FDR
correction. Blue line indicates motif presence based on match to strict consensus sequence and no FDR
correction. B‐E) Motif logos formed from the collected motif matches from each above estimates of motif
presence.
99
Figure 3.6S Motif logos and individual Spearman’s correlation with crossover rates.
(Continued on next page)
100
Figure 3.6S (Continued) Motif logos numbered by descending E-value of seed PFMs in (Comeron, et al. 2012).
Motif logos based on the position frequency matrix of motifs assigned as present at a 1% FDR. Logos are
presented with positions (y-axes) weighted by information content (x-axes) per site. Spearman’s ρ of motif
presence and crossover rates based on non-overlapping 100kb regions across the whole genome. RG and ZI
indicate results using high-resolution population-estimates of crossover rates from the Rwanda (RG) and
(Zambia) ZI populations, respectively (see Methods for details). Exp.Rec. indicates results based on
experimentally measured crossover rates (Comeron, et al. 2012) after removing the sequences used to identify
crossover-enriched motifs.
Figure 3.7S Correlation between motif presence and crossover rates for different
chromosome arms.
Spearman’s (ρ) of motif presence and crossover rates for each chromosome arm separately based on nonoverlapping 100kb regions. Only correlations with P < 0.05 are shown.
101
Figure 3.8S LASSO coefficient paths.
Coefficient paths for a LASSO model across lambda (regularization) values (x‐axis). Motifs in figure legend are
shown in the order they enter the model (M3 first, M5 second, etc.)
102
3.5
Chapter 3 References
Adrian, A, J Comeron. 2013. The Drosophila early ovarian transcriptome provides
insight to the molecular causes of recombination rate variation across genomes.
BMC Genomics 14.
Aymard, F, B Bugler, CK Schmidt, et al. 2014. Transcriptionally active chromatin
recruits homologous recombination at DNA double-strand breaks. Nat Struct Mol
Biol 21:366-374.
Bailey, TL, M Boden, FA Buske, M Frith, CE Grant, L Clementi, J Ren, WW Li, WS
Noble. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids
Res 37:W202-208.
Banfield, RE, LO Hall, KW Bowyer, WP Kegelmeyer. 2007. A comparison of decision
tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29:173-180.
Baudat, F, J Buard, C Grey, A Fledel-Alon, C Ober, M Przeworski, G Coop, B de Massy.
2010. PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans
and Mice. Science 327:836-840.
Beadle, GW. 1932. A possible influence of the spindle fibre on crossing-over in
Drosophila. Proc Natl Acad Sci U S A 18:160-165.
Bembom, O. 2007. seqLogo: An R package for plotting DNA sequence logos. R
package version 1.30.0. .
Bessoltane, N, C Toffano-Nioche, M Solignac, F Mougel. 2012. Fine scale analysis of
crossover and non-crossover and detection of recombination sequence motifs in the
honeybee (Apis mellifera). PLoS ONE 7:e36229.
Blat, Y, RU Protacio, N Hunter, N Kleckner. 2002. Physical and functional
interactions among basic chromosome organizational features govern early steps of
meiotic chiasma formation. Cell 111:791-802.
Blumental-Perry, A, D Zenvirth, S Klein, I Onn, G Simchen. 2000. DNA motif
associated with meiotic double-strand break regions in Saccharomyces cerevisiae.
EMBO Rep 1:232-238.
103
Borde, V, B de Massy. 2013. Programmed induction of DNA double strand breaks
during meiosis: setting up communication between DNA and the chromosome
structure. Curr Opin Genet Dev 23:147-155.
Breiman, L. 2001. Random Forests. Machine Learning 45:5-32.
Breiman, L, J Friedman, R Olshen, C Stone. 1984. Classification and regression trees.
Boca Raton: CRC Press.
Brooks, LD. 1988. The evolution of recombination rates. In: REaL Michod, B.R.,
editor. The evolution of sex. Sunderland, MA: Sinauer Associates. p. 87-105.
Buhler, C, V Borde, M Lichten. 2007. Mapping meiotic single-strand DNA reveals a
new landscape of DNA double-strand breaks in Saccharomyces cerevisiae. PLoS Biol
5:e324.
Chan, AH, PA Jenkins, YS Song. 2012. Genome-wide fine-scale recombination rate
variation in Drosophila melanogaster. PLoS Genet 8:e1003090.
Comeron, JM. 2014. Background selection as baseline for nucleotide variation across
the Drosophila genome. PLoS Genet 10:e1004434.
Comeron, JM, R Ratnappan, S Bailin. 2012. The many landscapes of recombination in
Drosophila melanogaster. PLoS Genet 8:e1002905.
Coop, G, X Wen, C Ober, JK Pritchard, M Przeworski. 2008. High-resolution mapping
of crossovers reveals extensive variation in fine-scale recombination patterns
among humans. Science 319:1395-1398.
Craven, P, G Wahba. 1979. Smoothing noisy data with spline functions. Numerische
Mathematik 31:377-403.
Cromie, GA, RW Hyppa, HP Cam, JA Farah, SI Grewal, GR Smith. 2007. A discrete
class of intergenic DNA dictates meiotic DNA break hotspots in fission yeast. PLoS
Genet 3:e141.
104
D'haeseleer, P. 2006. How does DNA sequence motif discovery work? Nature
Biotechnology 24.
Das, M, H-K Dai. 2007. A survey of DNA motif finding algorithms. BMC
Bioinformatics 8:S21.
Ding, Y, W Lorenz, J Chuang. 2012. CodingMotif: exact determination of
overrepresented nucleotide motifs in coding sequences. BMC Bioinformatics 13:32.
Dlakic, M, RE Harrington. 1998. Unconventional helical phasing of repetitive DNA
motifs reveals their relative bending contributions. Nucleic Acids Res 26:4274-4279.
Dumont, BL, MA White, B Steffy, T Wiltshire, B Payseur. 2011. Extensive
recombination rate variation in the house mouse species complex inferred from
genetic linkage maps. Genome Res 21:114-125.
Fan, Q-Q, TD Petes. 1996. Relationship between nuclease-hypersensitive sites and
meiotic recombination hot spot activity at the HIS4 locus of Saccharomyces
cerevisiae. Mol Cell Biol 16:16.
Fledel-Alon, A, EM Leffler, Y Guan, M Stephens, G Coop, M Przeworski. 2011.
Variation in human recombination rates and its genetic determinants. PLoS ONE
6:e20321.
Fowler, KR, M Sasaki, N Milman, S Keeney, GR Smith. 2014. Evolutionarily diverse
determinants of meiotic DNA break and recombination landscapes across the
genome. Genome Res 24:1650-1664.
Friedman, JH. 1991. Multivariate adaptive regression splines. The annals of
statistics:1-67.
Friedman, JH, CB Roosen. 1995. An introduction to multivariate adaptive regression
splines. Stat Methods Med Res 4:197-217.
Gerton, JL, J DeRisi, R Shroff, M Lichten, PO Brown, TD Petes. 2000. Global mapping
of meiotic recombination hotspots and coldspots in the yeast Saccharomyces
cerevisiae. Proc Natl Acad Sci U S A 97:11383-11390.
105
Gossmann, TI, AW Santure, BC Sheldon, J Slate, K Zeng. 2014. Highly variable
recombinational landscape modulates efficacy of natural selection in birds. Genome
Biol Evol 6:2061-2075.
Hall, M, E Frank, G Holmes, B Pfahringer, P Reutemann, IH Witten. 2009. The WEKA
data mining software: an update. SIGKDD Explor. Newsl. 11:10-18.
Hartmann, H, EW Guthöhrlein, M Siebert, S Luehr, J Söding. 2013. P-value based
regulatory motif discovery using positional weight matrices. Genome Res 23:181194.
Hastie, T, R Tibshirani, J Friedman, T Hastie, J Friedman, R Tibshirani. 2009. The
elements of statistical learning: Springer.
Heil, CS, MA Noor. 2012. Zinc finger binding motifs do not explain recombination
rate variation within or between species of Drosophila. PLoS ONE 7:e45055.
Herbert, A, A Rich. 1996. The biology of left-handed Z-DNA. J Biol Chem 271:1159511598.
Hinch, AG, A Tandon, N Patterson, et al. 2011. The landscape of recombination in
African Americans. Nature 476:170-175.
Hizver, J, H Rozenberg, F Frolow, D Rabinovich, Z Shakked. 2001. DNA bending by an
adenine--thymine tract and its role in gene regulation. Proc Natl Acad Sci U S A
98:8490-8495.
Hu, J, B Li, D Kihara. 2005. Limitations and potentials of current motif discovery
algorithms. Nucleic Acids Res 33:4899-4913.
Hussin, J, MH Roy-Gagnon, R Gendron, G Andelfinger, P Awadalla. 2011. Agedependent recombination rates in human pedigrees. PLoS Genet 7:e1002251.
Jeffreys, AJ, R Neumann. 2002. Reciprocal crossover asymmetry and meiotic drive in
a human recombination hot spot. Nat Genet 31:267-271.
106
Jeffreys, AJ, R Neumann. 2005. Factors influencing recombination frequency and
distribution in a human meiotic crossover hotspot. Human Mol Genet 14:22772287.
Jiang, H, N Li, V Gopalan, et al. 2011. High recombination rates and hotspots in a
Plasmodium falciparum genetic cross. Genome Biol 12:R33.
Keeney, S. 2001. Mechanism and control of meiotic recombination initiation. Curr
Topics Dev Biol Volume 52:1-53.
Kim, S, V Plagnol, TT Hu, C Toomajian, RM Clark, S Ossowski, JR Ecker, D Weigel, M
Nordborg. 2007. Recombination and linkage disequilibrium in Arabidopsis thaliana.
Nat Genet 39:1151-1155.
Kleckner, N. 2006. Chiasma formation: chromatin/axis interplay and the role(s) of
the synaptonemal complex. Chromosoma 115:175-194.
Koehler, KE, CL Boulton, HE Collins, RL French, KC Herman, SM Lacefield, LD
Madden, CD Schuetz, RS Hawley. 1996. Spontaneous X chromosome MI and MII
nondisjunction events in Drosophila melanogaster oocytes have different
recombinational histories. Nat Genet 14:406-414.
Kong, A, DF Gudbjartsson, J Sainz, et al. 2002. A high-resolution recombination map
of the human genome. Nat Genet 31:241-247.
Kong, A, G Thorleifsson, DF Gudbjartsson, et al. 2010. Fine-scale recombination rate
differences between sexes, populations and individuals. Nature 467:1099-1103.
Koszul, R, KP Kim, M Prentiss, N Kleckner, S Kameoka. 2008. Meiotic chromosomes
move by linkage to dynamic actin cables with transduction of force through the
nuclear envelope. Cell 133:1188-1201.
Kulathinal, RJ, SM Bennett, CL Fitzpatrick, MA Noor. 2008. Fine-scale mapping of
recombination rate in Drosophila refines its correlation to diversity and divergence.
Proc Natl Acad Sci U S A 105:10051-10056.
107
Lack, JB, CM Cardeno, MW Crepeau, W Taylor, RB Corbett-Detig, KA Stevens, CH
Langley, JE Pool. 2015. The Drosophila genome nexus: a population genomic
resource of 623 Drosophila melanogaster genomes, including 197 from a single
ancestral range population. Genetics 199:1229-1241.
Lee, JW, JB Lee, M Park, SH Song. 2005. An extensive comparison of recent
classification tools applied to microarray data. Comput Stat Data Anal 48:869-885.
Lindsley, DL, GG Zimm. 1992. The genome of Drosophila melanogaster San Diego,
CA: Academic Press.
Liu, H, X Zhang, J Huang, JQ Chen, D Tian, LD Hurst, S Yang. 2015. Causes and
consequences of crossing-over evidenced via a high-resolution recombinational
landscape of the honey bee. Genome Biol 16:15.
Llopart, A. 2015. Parallel faster-x evolution of gene expression and protein
sequences in Drosophila: beyond differences in expression properties and protein
interactions. PLoS ONE 10:e0116829.
MacIsaac, KD, E Fraenkel. 2006. Practical strategies for discovering regulatory DNA
sequence motifs. PLoS Comput Biol 2:e36.
Mancera, JM, L Vargas-Chacoff, A Garcia-Lopez, A Kleszczynska, H Kalamarz, G
Martinez-Rodriguez, E Kulczykowska. 2008. High density and food deprivation
affect arginine vasotocin, isotocin and melatonin in gilthead sea bream (Sparus
auratus). Comp Biochem Physiol A Mol Integr Physiol 149:92-97.
McGaugh, SE, CS Heil, B Manzano-Winkler, L Loewe, S Goldstein, TL Himmel, MA
Noor. 2012. Recombination modulates how selection affects linked sites in
Drosophila. PLoS Biol 10:e1001422.
Meisel, RP, T Connallon. 2013. The faster-X effect: integrating theory and data.
Trends Genet 29:537-544.
Miller, DE, S Takeo, K Nandanan, et al. 2012. A Whole-Chromosome Analysis of
Meiotic Recombination in Drosophila melanogaster. G3 (Bethesda) 2:249-260.
108
Mirouze, M, M Lieberman-Lazarovich, R Aversano, E Bucher, J Nicolet, J Reinders, J
Paszkowski. 2012. Loss of DNA methylation affects the recombination landscape in
Arabidopsis. Proc Natl Acad Sci U S A.
Moens, PB, NK Kolas, M Tarsounas, E Marcon, PE Cohen, B Spyropoulos. 2002. The
time course and chromosomal localization of recombination-related proteins at
meiosis in the mouse are compatible with models that can resolve the early DNADNA interactions without reciprocal recombination. J Cell Sci 115:1611-1622.
Morton, NE, DC Rao, S Yee. 1976. An inferred chiasma map of Drosophila
melanogaster. Heredity (Edinb) 37:405-411.
Muñoz-Fuentes, V, A Di Rienzo, C Vilà. 2011. Prdm9, a major determinant of meiotic
recombination hotspots, is not functional in dogs and their wild relatives, wolves
and coyotes. PLoS ONE 6:e25498.
Myers, S, L Bottolo, C Freeman, G McVean, P Donnelly. 2005. A fine-scale map of
recombination rates and hotspots across the human genome. Science 310:321-324.
Myers, S, C Freeman, A Auton, P Donnelly, G McVean. 2008. A common sequence
motif associated with recombination hot spots and genome instability in humans.
Nat Genet 40:1124-1129.
Neel, JV. 1941. A relation between larval nutrition and the frequency of crossing
over in the third chromosome of Drosophila melanogaster. Genetics 26:506-516.
Ohta, KS, T.; Nicolas, A. 1994. Changes in chromatin structure at recombination
initiation sites during yeast meiosis. EMBO 13:9.
Oliver, PL, L Goodstadt, JJ Bayes, Z Birtle, KC Roach, N Phadnis, SA Beatson, G Lunter,
HS Malik, CP Ponting. 2009. Accelerated Evolution of the Prdm9 Speciation Gene
across Diverse Metazoan Taxa. PLoS Genet 5:e1000753.
Pan, J, S Keeney. 2007. Molecular cartography: mapping the landscape of meiotic
recombination. PLoS Biol 5:e333.
109
Pan, J, M Sasaki, R Kniewel, et al. 2011. A hierarchical combination of factors shapes
the genome-wide topography of yeast meiotic recombination initiation. Cell
144:719-731.
Parsons, PA. 1988. Evolutionary rates: effects of stress upon recombination. Biol J
Linn Soc 35:49-68.
Parvanov, ED, PM Petkov, K Paigen. 2010. Prdm9 Controls Activation of Mammalian
Recombination Hotspots. Science 327:835.
Parvinen, M, KO Soderstrom. 1976. Chromosome rotation and formation of synapsis.
Nature 260:534-535.
Perez-Martin, J, V de Lorenzo. 1997. Clues and consequences of DNA bending in
transcription. Annu Rev Microbiol 51:593-628.
Petes, TD. 2001. Meiotic recombination hot spots and cold spots. Nat Rev Genet
2:360-369.
Pool, J, R Corbett-Detig, R Sugino, et al. 2012. Population Genomics of sub-saharan
Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet
8:e1003080.
Ross, JA, DC Koboldt, JE Staisch, HM Chamberlin, BP Gupta, RD Miller, SE Baird, ES
Haag. 2011. Caenorhabditis briggsae recombinant inbred line genotypes reveal
inter-strain incompatibility and the evolution of recombination. PLoS Genet
7:e1002174.
Schneider, TD. 2002. Consensus Sequence Zen. Appl Bioinformatics 1.
Segal, E, Y Fondufe-Mittendorf, L Chen, A Thastrom, Y Field, IK Moore, JP Wang, J
Widom. 2006. A genomic code for nucleosome positioning. Nature 442:772-778.
Ségurel, L, EM Leffler, M Przeworski. 2011. The Case of the Fickle Fingers: How the
PRDM9 Zinc Finger Protein Specifies Meiotic Recombination Hotspots in Humans.
PLoS Biol 9:e1001211.
110
Shibuya, H, K Ishiguro, Y Watanabe. 2014. The TRF1-binding protein TERB1
promotes chromosome movement and telomere rigidity in meiosis. Nat Cell Biol
16:145-156.
Simcha, D, ND Price, D Geman. 2012. The Limits of De Novo DNA Motif Discovery.
PLoS ONE 7:e47836.
Singh, ND, EA Stone, CF Aquadro, AG Clark. 2013. Fine-scale heterogeneity in
crossover rate in the Garnet-Scalloped region of the Drosophila melanogaster X
Chromosome. Genetics.
Smukowski, CS, MA Noor. 2011. Recombination rate variation in closely related
species. Heredity (Edinb) 107:496-508.
Steiner, WW, EM Steiner, AR Girvin, LE Plewik. 2009. Novel nucleotide sequence
motifs that produce hotspots of meiotic recombination in Schizosaccharomyces
pombe. Genetics 182:459-469.
Stevison, LS, AE Woerner, JM Kidd, JL Kelley, KR Veeramah, KF McManus, CD
Bustamante, MF Hammer, JD Wall. 2015. The time-scale of recombination rate
evolution in great apes. bioRxiv doi:10.1101/013755.
Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. J. R. Stat Soc.
B-Methodological 58:267-288.
Travers, AA. 1990. Why bend DNA? Cell 60:177-180.
Treco, D, N Arnheim. 1986. The evolutionarily conserved repetitive sequence
d(TG.AC)n promotes reciprocal exchange and generates unusual recombinant
tetrads during yeast meiosis. Mol Cell Biol 6:3934-3947.
Wu, T-C, M Lichten. 1994. Meiosis-induced double-strand break sites determined by
yeast chromatin structure. Science 263:3.
Yelina, NE, K Choi, L Chelysheva, et al. 2012. Epigenetic remodeling of meiotic
crossover frequency in Arabidopsis thaliana DNA methyltransferase mutants. PLoS
Genet 8:e1002844.
111
CHAPTER 4: Environmental contribution to recombination
variation.
4.1 Introduction
Since the early 1900’s researchers have realized that meiotic recombination exhibits an
astounding degree of variation. This variation in recombination rates persists at varying
levels—from within a single individual’s genome, to large differences between individuals
of the same species. These differences are predicted to alter the amount of standing
polymorphism throughout the genome, and for that reason, variation in recombination
rates is an important feature that affects the evolution of a species. Given the substantial
amount of variation in crossing over rates, one might be surprised to find that the meiotic
machinery and process of meiosis is heavily conserved (Keeney 2001). In fact, the process
of meiosis is pervasive throughout all animal life on earth with relatively few exceptions.
Given the apparent paradox exhibited by the near-universal conservation of
recombination, but vast variation in its distribution, explaining this variation has been slow.
Many factors have been heretofore implicated in affecting the distribution of recombination
events, such as meiotic transcription (Adrian and Comeron 2013), epigenetic marks
(Mirouze, Lieberman-Lazarovich et al. 2012, Yelina, Choi et al. 2012), nucleotide
composition (Miller, Takeo et al. 2012), motifs (Steiner, Steiner et al. 2009, Baudat, Buard et
al. 2010), among others. Despite significant effort, models predicting rates of crossover are
incomplete (Adrian, Cruz Corchado, and Comeron, in review), and require additional data to
incorporate thermodynamic considerations of the recombination pathway.
Early experiments by Kidwell and others (Detlefsen and Clemente 1923, Stern 1926,
Bridges 1927, Kidwell 1972) identified that modulators of recombination rates were
numerous. Aside from recombination rates being altered between species and among
112
populations of a single species, certain environmental factors were quickly appreciated to
have a strong effect on genome wide recombination rates. In particular, maternal age,
temperature, and food quality were determined to affect genome wide recombination rates
(Bridge 1915, Charlesworth & Charlesworth 1985, Plough 1917). The same regions of the
genome would display different linkage maps among populations and under differential
conditions. These observations provided researchers with an abundant assortment of
questions: are there factors that decrease recombination rates in addition to the many
factors that appear to increase recombination rates. Are the factors shaping recombination
responses to differential environments the same as the ones controlling recombination
rates at other levels? Does the distribution of recombination events change, or does a
proportional change occur across the genome?
Recently, researchers have assumed that rearing Drosophila at different
temperatures would change overall recombination rates proportionally, but not the
distribution of the events between phenotypic markers (Singh 2014). Whether this
assumption is valid, or if recombination rates are altered in other regions differentially is
presently unknown. Preliminary work in our lab performed in 2010 suggested that overall
rates and genomic distribution of recombination events would be altered. Furthermore, due
to a collection of previous observations by our lab and others, we felt it was important to
understand how recombination rates are altered in a few specific cases where the
environment was altered. While early studies had a collection of environmental factors that
affect recombination, no one had described how distribution and amount of recombination
changes in response to any factor at a genome-wide, high-resolution (500kb or less) scale.
As a result, we set out to utilize next-generation sequencing to rapidly generate high-
113
resolution maps of crossing over under several conditions that wild Drosophila would
naturally encounter: elevated temperature, decreased temperature, and acetic acid stress.
The results reported herein are the first of a large project describing the impact of
environmental conditions on recombination rates. Preliminarily, we have created a genetic
broad-scale map of flies exposed to 0.5% acetic acid in their growth medium at two time
frames. Thus far, I have generated more than 500 single meiosis library preparations for
sequencing. As sequencing data is not available at the time of writing this thesis, I will
instead show the data from the crosses performed and elaborate upon the preliminary
studies and development of new protocols facilitating the high-throughput development of
fine-scale recombination maps.
4.2 Results & Discussion
4.2.1 Identification of Stress-Inducing Conditions
As we wanted to assay recombination rates during natural stress, I chose to pursue
both age and acid-related stress and how these conditions affect recombination rates of
Drosophila. Acetic acid is the final fermentation product of rotting fruits and is naturally
present in many food sources that Drosophila feed upon. Drosophila are attracted to the
odor of vinegar in low concentration, and repulsed by the excessively strong odor of more
concentrated vinegar (Ai, Min et al. 2010). Rotting bananas—a favorite of Drosophilids
cohabitating with humans—have been shown to contain 0.28-0.74% acetic acid per unit
weight while fermenting (Omura and Honda 2003). Because of its common occurrence in
the natural food of Drosophila, and simultaneously a factor that elicits clear avoidance
behavior at high concentration, we chose to assay recombination rates under acetic acid
stress. Notably, this very experiment (albeit without next-generation sequencing) was
suggested by Gowen (1919) who observed early on that many environmental factors
114
affected recombination rates in our favorite model fly. We first determined what
concentration of acid to use by rearing Drosophila on media containing 0, 0.5, 1, and 2%
acetic acid by volume of standard corn meal agar. Concentrations of 0.5 and 1% acid
exhibited a 20-40% reduction in the number of offspring surviving post-eclosion relative to
control (Figure 4.1). Interestingly, though there was no statistical difference in offspring
number between 0.5% and 1% acid (P>0.05), bottles containing 2% acetic acid produced a
maximum of one living fly across three experiments. We, therefore, chose to continue our
experiments with 0.5% acetic acid.
Figure 4.1 Surviving offspring in media containing acetic acid.
Standard corn meal agar with 0, 0.5, 1, and 2% acetic acid by volume.
Error bars indicate +/- 1 Standard Error across three experiments.
Blue bars: strain 6036. Red bars: strain Raleigh 375
4.2.2 Recombination Frequency among Phenotypic Markers
I crossed mutant flies containing five phenotypic markers on the X chromosome
(strain 6036), with a Drosophila Genetic Reference Panel (DGRP) strain, 375. These
115
heterozygous flies were then placed on acid media with new purebred males to generate
recombinant offspring (Importantly, Drosophila melanogaster males do not exhibit
recombination). Flies were allowed to lay eggs for three days before being moved to new
acid media for days 4-7, and then transferred to a final bottle of acid media. This scheme
allowed us to assay recombination rates of acid-stressed flies and also observe age-related
stress effects. Of 4648 male flies scored for the presence or absence of five markers
independently, we observed, on average, a significant (X2 P < 0.05 in all cases) reduction in
recombination rates across y-w, w-ct, and ct-m intervals (Figure 4.2) following correction
for interference (Kosambi 1944). The m-f interval displayed no significant reduction
following correction. To my knowledge, this is the first report of a reduction in
recombination rates following the application an environmental ‘stress’ to an organism
during meiosis. Notably, however, is that our mutant-carrying strain 6036 displays reduced
fitness relative to wild type due to the phenotypic markers that affect several
developmental processes, namely eclosion (Figure 4.1). For this reason, we will carry out
recombination assays using Illumina sequencing on recombinant, but not phenotypically
affected, females. This approach will allow a decidedly un-biased assay of recombination
rates across the complete genome under stress. If this observed reduction is supported by
our sequencing results, however, the implications bode poorly for this population of flies
experiencing stress. In such a case, a genome-wide reduction in recombination rates would
expectedly reduce the capacity of the population to adapt to its present conditions. The
present assay is limited to just the X-chromosome, and it is likely that other chromosomes
display different alterations in recombination rates. Nevertheless, why these particular
regions exhibit repression of recombination is interesting, and could potentially indicate the
lack of polymorphisms affecting fitness under acetic acid stress on the X chromosome.
116
Figure 4.2 Map distances between select phenotypic markers under acid stress.
B1 Observed
B2 Observed
Expected (Flybase)
25.0
20.0
CM
15.0
10.0
5.0
0.0
y-w
w-ct
ct-m
m-f
INTERVAL
Map distances in centiMorgans among five phenotypic markers. Blue: Observed map distance for acid-stressed
flies from eggs lain within the first three days post-mating. Red: Observed map distance for acid-stressed flies
from eggs lain within three to seven days post-mating. Green: Expected distances obtained from Flybase.org.
4.3 Materials and Methods
4.3.1 Generation of Recombinants
Recombinant flies were generated by first crossing male RAL 375 strain D.
melanogaster with virgin females of strain 6036 which contained homozygous X-linked
markers for yellow, white, cut, miniature, and forked (y, w, ct, m, f, respectively) to allow for
manual scoring of male recombinants and to track that no errors were made during crosses.
Crosses were kept at 23.5°C unless otherwise specified. Heterozygotes were allowed to
eclose and virgin females were crossed with new male 375 flies in vials, and then placed on
cornstarch media containing 0.5% acetic acid after 12 hours. Adults were transferred to
new media (also containing 0.5% acetic acid) after 3, and 7 days post-eclosion. Offspring
males from each bottle were scored for their recombinant phenotype and frozen alongside
females at -20oC until library preparation. Acetic Acid media was prepared by liquefying
prepared cornmeal media with a small amount of distilled water added to assist melting
and to replace evaporation. Milk bottles were filled with either the reliquifyed corn meal
117
media or the media containing an addition of 0.5% glacial acetic acid. Bottles were allowed
to cool and set for 24 hours before use.
4.3.2 DNA Library Preparation
Illumina libraries were prepared using a heavily modified protocol based on
Comeron (2012). Many steps in the procedure had to be modified in order to yield
consistent successful preparations in a high-throughput setting. Single flies were disrupted
for 30 seconds at 50 cycles per second in a Qiagen Tissue Lyser LT. DNA was extracted from
single recombinant female flies in a 96-well format with the Qiagen DNEasy 96-well kit,
following the manufacturer’s insect protocol with additional proteinase K added to
maximize yield. DNA extracts were quantified using a Molecular Biosystems 96-well plate
reader following a modified picogreen-assay protocol using Promega Quant-IT reagent to
detect fluorescence of ds-DNA. DNA extracts were then digested with either MboII, HpyAV,
or Mn1I restriction enzymes at 37°C for 50 minutes, with an inactivation at 56°C for 20
minutes. This digestion was important for several reasons. Firstly, the use of these enzymes
leaves a 3’ adenine overhang that allows for the ligation of sequence specific adapters.
Secondarily, each restriction enzyme leaves a specific and detectable signature that can be
bioinformatically identified, allowing the repeated use of the same adapter sequence with
different restriction enzymes in multiplex.
The digested products then had Illumina specific adapters ligated overnight at 16°C.
The Ligation was terminated at 56°C for 20 minutes, and the USER enzyme (New England
Biolabs) was finally added to excise an internal uracil linker in the linked adapters. These
adapter-ligated products were grouped in sets of five by concentration, and cleaned up
using a modified Serapure magnetic bead mixture (GE Healthcare) to optimally exclude
excess unligated adapters. The cleaned-up reaction was run on a 0.75% agarose gel at 90V
118
for 80 minutes and a 450bp fragment was excised from each reaction and purified using the
Qiagen Gel Cleanup kit following the manufacturers recommended protocol. The size
selected, adapter ligated products were then enriched via high-fidelity polymerase chain
reaction with NEB Phusion HF master mix for 19 to 22 cycles depending on initial
concentration. An aliquot of the library was run on an agarose gel to confirm success and
correct size distribution. Completed libraries were then again quantified on a Molecular
Biosystems plate reader and adjusted to a final concentration of 15nM. Ilumina sequencing
was performed on an Illumina Hi-Seq 2500 with a read length of 125bp.
4.3.3 Recombination rate estimation
For each library, reads will be split based on adapter sequence and restriction
enzyme signature and trimmed 3’ to eliminate the barcode sequence, and 5’ until a quality
score of ten or greater is reached. Quality metrics will be performed to ensure un-biased
sequencing and to verify the absence of sequencing artifacts. Because of the heavily
multiplexed nature of our libraries and sensitivity of our approach to errors, we will
required exact read-matches to the respective genome in order to eliminate untrustworthy
data. In order to do so, we will first loosely map to the Drosophila reference genome, and
then the mapped reads will then be re-mapped to the 6036 and 375 genomes. Heterzygous
sites and ambiguous bases will be removed from the analysis. For each sample, crossover
sites will be inferred by ‘switches’ to either 6036 or 375 at polymorphic sites as per
Comeron 2012.
4.4
Chapter 4 References
Adrian, A. and J. Comeron (2013). "The Drosophila early ovarian transcriptome provides
insight to the molecular causes of recombination rate variation across genomes." BMC
Genomics 14(794).
119
Ai, M., S. Min, Y. Grosjean, C. Leblanc, R. Bell, R. Benton and G. S. Suh (2010). "Acid sensing by
the Drosophila olfactory system." Nature 468(7324): 691-695.
Baudat, F., J. Buard, C. Grey, A. Fledel-Alon, C. Ober, M. Przeworski, G. Coop and B. de Massy
(2010). "PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and
Mice." Science 327(5967): 836-840.
Bridges, C. B. (1927). "The Relation of the Age of the Female to Crossing over in the Third
Chromosome of Drosophila Melanogaster." J Gen Physiol 8(6): 689-700.
Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination
in Drosophila melanogaster." PLoS Genet 8(10): e1002905.
Detlefsen, J. A. and L. S. Clemente (1923). "Genetic Variation in Linkage Values." Proc Natl
Acad Sci U S A 9(5): 149-156.
Gowen, J. W. (1919). "A Biometrical Study of Crossing Over. on the Mechanism of Crossing
over in the Third Chromosome of DROSOPHILA MELANOGASTER." Genetics 4(3): 205-250.
Keeney, S. (2001). Mechanism and control of meiotic recombination initiation. Current
Topics in Developmental Biology, Academic Press. Volume 52: 1-53.
Kidwell, M. G. (1972). "Genetic change of recombination value in Drosophila melanogaster.
II. Simulated natural selection." Genetics 70(3): 433-443.
Kosambi, D. D. (1944). "The estimation of map distance from recombination values." Annals
of Eugenics 12: 172-175.
Miller, D. E., S. Takeo, K. Nandanan, A. Paulson, M. M. Gogol, A. C. Noll, A. G. Perera, K. N.
Walton, W. D. Gilliland, H. Li, K. K. Staehling, J. P. Blumenstiel and R. S. Hawley (2012). "A
Whole-Chromosome Analysis of Meiotic Recombination in Drosophila melanogaster." G3
(Bethesda) 2(2): 249-260.
Mirouze, M., M. Lieberman-Lazarovich, R. Aversano, E. Bucher, J. Nicolet, J. Reinders and J.
Paszkowski (2012). "Loss of DNA methylation affects the recombination landscape in
Arabidopsis." Proceedings of the National Academy of Sciences.
Omura, H. and K. Honda (2003). "Feeding responses of adult butterflies, Nymphalis
xanthomelas, Kaniska canace and Vanessa indica, to components in tree sap and rotting
120
fruits: synergistic effects of ethanol and acetic acid on sugar responsiveness." J Insect
Physiol 49(11): 1031-1038.
Steiner, W. W., E. M. Steiner, A. R. Girvin and L. E. Plewik (2009). "Novel Nucleotide
Sequence Motifs That Produce Hotspots of Meiotic Recombination in Schizosaccharomyces
pombe." Genetics 182(2): 459-469.
Stern, C. (1926). "An effect of temperature and age on crossing-over in the first chromosome
of Drosophila melanogaster." Proc Natl Acad Sci U S A 12(8): 530-532.
Yelina, N. E., K. Choi, L. Chelysheva, M. Macaulay, B. de Snoo, E. Wijnker, N. Miller, J. Drouaud,
M. Grelon, G. P. Copenhaver, C. Mezard, K. A. Kelly and I. R. Henderson (2012). "Epigenetic
Remodeling of Meiotic Crossover Frequency in Arabidopsis thaliana</italic> DNA
Methyltransferase Mutants." PLoS Genet 8(8): e1002844.
121
CHAPTER 5: Conclusions & Future Directions
5.1 Transcriptional Effects on Crossover Localization
In Chapter 2, I sought to more fully understand the impact of the transcriptional
environment of meiotic cells on fine-scale crossover localization. We found evidence that
active transcription during early meiotic development increases the likelihood of
crossovers in D. melanogaster, and similarly that reduced transcription was correlated with
reduced levels of recombination. This evidence supports the idea that transcriptionally
active chromatin may be more susceptible than quiescent chromatin to DSBs. Our models of
crossover patterning are enriched by this study because gene expression is also
nonrandomly distributed throughout the genome, and displays high variability analogous to
the many levels of variability displayed by recombination rates. Therefore, gene expression
could provide a molecular, heritable, and plastic mechanism to explain the observed
patterns of recombination variation—from the high level of intraspecific variation, to the
known influence of environmental conditions.
I identified that while the tissues are highly similar, there are significant differences
in gene expression between early and late meiotic tissues in the Drosophila germarium.
Many of the genes that are over- or under-expressed are involved in proteolysis, likely
underscoring the requirement for rapid turnover of many stepwise developmental and
organizational processes in the developing oocytes. Our analysis also produced a wealth of
information concerning new genes and exons—many of which mapped to thus far
unassembled regions of the Drosophila genome. This is fairly remarkable given that
Drosophila is one of the most studied and best characterized organisms; these results
suggest the existence of many unknown processes ongoing in Drosophila worthy of
continued research. Furthermore, I showed that there are multiple parent-of-origin effects
122
on transcription even among similar fly lines (founded from females from the same
geographical location). This finding is interesting in that it speaks to the rapidity in which
transcription can be altered from generation to generation during a highly conserved and
regulated process, though more study is needed in order to explore this possibility further.
Finally, I showed that transcribed genes in meiosis are not distributed randomly along the
genome, but are instead frequently clustered. The observation that transcribed genes are
clustered during meiosis likely underlies the openness of chromatin domains along
particular regions of the chromatin, and may contribute to the accessibility of DNA during
recombination initiation.
Despite our exceptionally deep (400x per condition) transcriptional coverage of
meiotic tissues, our study was technically limited in several ways. I was limited in my ability
to isolate the exact cells undergoing DSB formation. In my study, I produced a
transcriptome for those cells experiencing DSB formation along with associated nurse and
epithelial cells; the cells of interest perhaps only consisting of ~5% of the overall cell
population. This limitation was due to the necessity of hand-dissecting the desired regions
with electrolytically sharpened tungsten probes. While I have since successfully
microdissected out the 2a/2b regions of the germaria using laser capture microscopy, the
tools to do so were unavailable at the time the study took place. A clear advancement of my
study would be to utilize laser-capture microdissection in future experiments. Despite this
limitation in the performed study, due to the deep coverage of reads, I can be reasonably
confident that non-transcribed or lowly transcribed genes represent truly silent or ‘quiet’
genes during meiosis, and provides support for our findings that regions of low
recombination are associated with regions of repressed or absent transcription.
123
The exact nature of the relationship between transcription and the likelihood of
recombination is yet unknown. It remains to be seen if transcription itself is the causative
factor, or if some related factor(s) is (are) instead. Future studies should investigate the
potential for chromatin accessibility, and/or possibly related factors, such as histone
methylation, etc. To me, transcription represents a plastic phenomenon that is known to be
altered in varying conditions and among populations—which is the reason that I pursued
the transcriptional avenue rather than say, DNA accessibility for which the intraspecies
variance is less well studied. Additional work is required to parse out the causative agents
and to reveal their underlying molecular mechanisms. For now, models including
transcription appear to be reasonably suited to explaining a nontrivial amount of the
variance in rates of recombination.
5.2 In silico Prediction of Recombination Rates Based on Multiple Motifs
In Chapter 3 I conducted a computational investigation of previously identified
recombination-associated motifs in Drosophila. I found that a combination of motifs can
explain an unprecedented fourth (or more) of the overall variance in genomic
recombination rates. Through the use of careful, FDR-corrected estimates of motif
occurrences, I show that the presence of poly-A containing or poly-[AT] containing motifs is
most significantly predictive of regions of increased recombination, and that the absence of
these motifs is predictive of depressed rates of recombination (and not a byproduct of local
A/T base composition). Widening our model to include additional factors such as
transcription further improves our predictive capabilities, though motif utilization appears
to play the dominant role in determining model accuracy. These observations support a
model where chromatin accessibility is key to specifying broad-range crossover
localization, and further indicates that multiple DNA motifs may be an important factor in
124
the fine scale localization of crossing-over events. Furthermore, the independent
identification of a PRDM9-like core motif suggests that there may be unappreciated
commonalities between mammalian and insect systems with respect to recombination
localization since no PRDM9 homolog has been thus far identified in Drosophila.
Importantly, the use of independent estimates of recombination through maps of
linkage disequilibrium in natural populations of Drosophila supports the validity of our
results (Comeron, Ratnappan et al. 2012). Since the association between these historical
recombination datasets is greater than the association between recombination data from
which the motifs were generated, the most influential motifs may not be as transient as the
PRDM9 motif. Moreover, our observations that these core motifs are utilized differentially
among chromosome arms indicates that recombinational processes may be specified
differently among the chromosomes—a finding that has not been reported yet in any
species thus examined (Adrian, Cruz Corchado, and Comeron). Lastly, the composition of
the most influential motifs speaks to a possible role in the primary structure at sites local to
recombinational events and raises the possibility that certain chromatin configurations
increase the likelihood of a successful DSB or resolution as a crossover.
One barrier for this study was the resolution of crossing over events. While
population estimates for D. melanogaster are accurate at a 100-kb scale, this scale is not
best suited for generating predictive models for recombination rates. This is somewhat
mitigated by our use of linkage disequilibrium based estimates of recombination.
Furthermore, our initial identification of seed motifs was based on a small number of
crossover events that were delimited by 500 base pairs or less (representing less than 5%
of the overall dataset). If recombination rates were accurate at sub-1kb, the initial
enrichment calls for seed motifs would have been more accurate and perhaps more
125
biologically informative. Thus far, such scales of crossover resolution do not exist for any
organism except S. cerevisiae (Mancera, Bourgon et al. 2008). Finally, our study of motifs
would have been bolstered by additional recombination data under different conditions,
allowing us to more properly parse out the effects of our motifs.
I have shown that a combination of motifs and transcriptional activity are predictive
of recombination rates in Drosophila. However future studies should investigate to what
extent these predictive motifs alter recombination rates, and if they vary from species to
species. Key questions remain: Are these motifs utilized in other Drosophilids? Are one or
any combination of these motifs necessary for DSB induction? Do these motifs work to alter
chromatin structure for DSB induction, or are they secondary? In order to address these
questions, a variety of molecular genetics and computational approaches should be
employed in concert. Only with a broad spectrum view of these many factors will we
formulate an accurate model of recombination localization.
5.3 Environmental Impact on Recombination Rates
In Chapter 4, I presented the start of an investigation into how various
environmental conditions affected recombination variation in Drosophila melanogaster.
While it has been well established that many conditions do affect recombination rates,
exactly how and to what extent is largely unknown. Recent studies have even shown that
bacterial infection leads to an increase in the recombination rate between markers in
Drosophila (Singh, Criscoe et al. 2015). However, there is still little known as to the
distribution of those increased number of recombinant events and if this represents a
genome-wide phenomenon. I sought to approach this problem in a systematic way to look
at how age, temperature, and food stress alters recombination landscapes. Herein I
presented preliminary data on acetic acid food stress, a naturally-occurring product of the
126
oxidation of ethanol that is often present in the food of wild Drosophila. From my
preliminary scoring of recombinants, it appears that acetic acid may repress recombination
across the X chromosome. This could, however, be an artifact of differential survival of
recombinants and thus should be taken with caution until more unbiased NGS-Sequencing
based data can be obtained.
This investigation should be continued in order to determine how different
stressors affect recombination landscapes, and whether there is overlap in those
differences. Additionally, I will investigate how recombination landscapes are altered when
multiple stressors are combined: maternal age and temperature stress, maternal age and
acetic acid stress. I am curious to find if the conditions are additive in there effects or if
perhaps a new pattern forms altogether. These findings can then be examined under light of
previous findings to determine if environmental factors are affecting recombination rates
on top of population-level variance, or if environmental effects are separate altogether.
5.4 Summary
Variation in organisms’ recombination rate distribution has been observed since the
early 1900’s. Until relatively recently, the extent of that variation was largely unknown. Our
lab and others have made substantial contributions to this field by identifying and
quantifying the many levels of recombination rate variation in Drosophila. My efforts have
been focused on the causes of this variation at the population and environmental levels by
utilizing a variety of genomics and molecular genetics techniques. Throughout my
investigations, we have learned that valuable information may exist in the primary DNA
sequence of organisms, and that the meiotic environment during DSB induction and
resolution is an important factor contributing to recombination landscapes. We have
learned that motifs may play a major role in the localization of crossovers within the
127
Drosophila genome, and that transcriptional effects also play a role, albeit a much smaller
role. Thus far we can explain up to 60% of the observed variance in recombination rates,
however nearly half remains unexplained. I anticipate that a deeper understanding of how
environmental conditions alter recombination landscapes will prove most useful in
understanding this remaining portion of unexplained variation.
As it stands, models incorporating chromatin accessibility and DNA-binding
directors of recombination seem to be the most promising. However, the specific proteins
and mechanisms responsible for the localization of the vast majority of recombinational
events remain elusive. Recombination-selection based experiments and manipulation of
individual recombination correlates is likely required in order to develop the overall picture
more completely. In the near future, we mustn’t become nearsighted in our adoption of
individual factors (e.g. only PRMD9), but rather be open to the likely fact that many players
are involved in this highly complex song and dance. Characterizing the unifying and speciesspecific factors that alter the many levels of recombination rate variation is a key step in our
understanding. Of course, the fact that the mechanisms responsible for DSB localization also
may themselves vary dramatically from organism to organism. Whatever the underlying
forces are revealed to be, it is increasingly apparent that nature has established a
remarkably plastic and striking method for altering the efficacy of selection in response to
stress.
128
5.5
Chapter 5 References
Adrian, A. and J. Comeron (2013). "The Drosophila early ovarian transcriptome provides
insight to the molecular causes of recombination rate variation across genomes." BMC
Genomics 14(794)
Adrian, A. Cruz Corchado, J. Comeron, J. "In silico prediction of recombination rate variation
across the Drosophila melanogaster genome based on multiple DNA motif analysis" (In
Review).
Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination
in Drosophila melanogaster." PLoS Genet 8(10): e1002905.
Mancera, E., R. Bourgon, A. Brozzi, W. Huber and L. M. Steinmetz (2008). "High-resolution
mapping of meiotic crossovers and non-crossovers in yeast." Nature 454(7203): 479-485.
Singh, N. D., D. R. Criscoe, S. Skolfield, K. P. Kohl, E. S. Keebaugh and T. A. Schlenke (2015).
"EVOLUTION. Fruit flies diversify their offspring in response to parasite infection." Science
349(6249): 747-750.