University of Iowa Iowa Research Online Theses and Dissertations Fall 2015 Fine scale recombination variation in Drosophila melanogaster Andrew B. Adrian University of Iowa Copyright © 2015 Andrew Blake Adrian This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/2175 Recommended Citation Adrian, Andrew B.. "Fine scale recombination variation in Drosophila melanogaster." PhD (Doctor of Philosophy) thesis, University of Iowa, 2015. http://ir.uiowa.edu/etd/2175. Follow this and additional works at: http://ir.uiowa.edu/etd Part of the Biology Commons FINE SCALE RECOMBINATION VARIATION IN DROSOPHILA MELANOGASTER by Andrew B. Adrian A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Biology in the Graduate College of The University of Iowa December 2015 Thesis Supervisor: Associate Professor Josep M. Comeron Copyright by Andrew Blake Adrian 2015 All Rights Reserved Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL _______________________ PH.D. THESIS _______________ This is to certify that the Ph.D. thesis of Andrew Blake Adrian has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Biology at the December 2015 graduation. Thesis Committee: ___________________________________________________ Josep Comeron, Thesis Supervisor ___________________________________________________ Ana Llopart ___________________________________________________ Robert Malone ___________________________________________________ Sarit Smolikove ___________________________________________________ Marc Wold To my grandfather, Nathan Pearson, for instilling in me the value of curiosity. ii ACKNOWLEDGEMENTS Over the course of the last ten years, I have encountered many people that have had an impact on my development as a scientist and as a thoughtful individual. As I approach the final months of my doctoral education, it seems appropriate to acknowledge those people who have meant the most to me. During my tenure as an undergraduate at the University of Alabama in Huntsville, my mentor Bruce Stallsmith inspired me to follow my own interests and how to systematically analyze the world around me. There too I made friends with exceptional individuals who encouraged me to pursue science. Derek and Nick, I am eternally grateful for your friendship. At the University of Iowa, I have learned a great deal from my adviser, Josep Comeron, whom has taught me to figure things out on my own rather than just giving me the answer and has provided thoughtful and well-reasoned feedback throughout my journey in graduate school. My committee members, Bob, Ana, Marc, and Sarit—thank you for your precious time and helpful feedback. While riding the sinusoidal wave of life, I met my soon to be wife, Claire, who has altered my world view more than any other. Thank you for always being there for me and for being you. Furthermore, I am indebted to my family for their persistent and unwavering support. My mother and father have always believed in me, and my sister’s pride for me is continually flattering. Finally, I’d like to thank the proverbial “Reviewer #3” that is always able to bring me down to reality and force me to question my own competence. Claire, I’m glad I thought of that. iii ABSTRACT Natural variation is a principle component of biology. One process that affects levels of natural variation is meiotic recombination—the process by which homologous chromosomes break and interchange genetic information with one another during the formation of gametes. Surprisingly, this factor that shapes levels of natural variation within the genome also exhibits a great deal of variation. That variation in the distribution of recombination rates manifests itself at many levels: within genomes, between individual organisms, across populations, and among species. In most cases, the factors and mechanisms responsible for the non-random patterning of recombination events across the genome remain particularly elusive. Herein, I utilize a combination of bioinformatic and molecular genetic approaches to better explain recombination patterning. I explore several factors that are now known to contribute to the distribution of recombination events across genomes. In particular, I demonstrate that transcriptional activity during meiosis is associated with, and partially predictive of crossing over events in Drosophila melanogaster. Additionally, I present a model which is capable of accounting for approximately 40% of the variation in crossover rates in Drosophila based on the localization of several previously identified DNA motifs. Lastly, I present preliminary data describing how recombination patterns are altered under naturally stressful conditions, a key insight that is necessary for uniting our findings between levels of variation in recombination rates. These findings support a multifactorial model for crossover distribution that includes both genetic and epigenetic factors and will further progress the field in developing a comprehensive understanding of recombination localization. iv PUBLIC ABSTRACT Meiotic recombination, the process by which parental chromosomes are broken and repaired to generate a patchwork of genetic material, maintains genetic diversity by allowing individual sites in the DNA to take on separate evolutionary trajectories. This process of forming and repairing breaks in double-stranded DNA, occurs non-randomly across the genome. The reasons for the non-random distribution of breaks and the factors that affect it are poorly understood in all organisms examined. Because recombination shapes levels of genetic diversity, it is important to understand how and why recombination is varied in order to draw conclusions about the evolution of a species and in order to assist in the association between many diseases and their causative genetic mutations. The work presented in this thesis is an investigation of recombination rate variation in a fruit fly model and the factors that affect its genomic distribution. In particular, I present a study of meiotic gene expression and how recombination rates are affected by genes that are ‘on’ during the formation of double-stranded breaks in the DNA. I also present a study of DNA motifs and how these motifs may be predictive of recombination rate variation. Lastly, I examine how recombination rates are altered under stressful conditions to better understand how the many levels of recombination variation are intertwined. My findings demonstrate that many factors—both genetic and epigenetic—are involved in shaping variation in recombination rates, and a combination of each can account for a large proportion of the variation we observe within genomes. v TABLE OF CONTENTS LIST OF TABLES ...................................................................................................................................................... ix LIST OF FIGURES ...................................................................................................................................................... x LIST OF ABBREVIATIONS ................................................................................................................................... xi CHAPTER 1: Introduction..................................................................................................................................... 1 1.1 Meiotic Recombination .................................................................................................................... 1 1.1.1 A General Overview of Meiotic Recombination ................................................................ 1 1.1.2 General Pathways of DSB Induction, Resolution .............................................................. 3 1.1.3 Advantages of Recombination ................................................................................................. 7 1.1.4 Recombination and Polymorphism ....................................................................................... 8 1.2 Recombination Rate Variation ...................................................................................................... 9 1.2.1 The Many Levels of Recombination Rate Variation ........................................................ 9 1.2.2 Differences Within & Between Species .............................................................................. 10 1.2.3 Sex-Biased Recombination...................................................................................................... 12 1.2.4 Genomic and Environmental Variation in Recombination Rates ........................... 13 1.2.5 Explaining Recombination Variation .................................................................................. 16 1.3 Chapter 1 References ...................................................................................................................... 19 CHAPTER 2: The Drosophila early ovarian transcriptome provides insight to the molecular causes of recombination rate variation across genomes ................................................ 30 2.0.1 Preface ................................................................................................................................................... 30 2.0.2 Abstract................................................................................................................................................. 30 2.1 Background ......................................................................................................................................... 31 2.2 Results & Discussion ....................................................................................................................... 35 2.2.1 General Patterns of the Drosophila Early Meiotic Transcriptome ......................... 35 2.2.2 Differentially Expressed Genes in Early Meiotic Tissues ........................................... 39 2.2.3 New Genes and Isoforms ......................................................................................................... 40 2.2.4 Parent-of-Origin Effects in the Early Meiotic Tissue .................................................... 41 2.2.5 Transcription is Associated with Increased Recombination Rates ........................ 43 2.3 Conclusions ......................................................................................................................................... 48 vi 2.4 Methods ................................................................................................................................................... 51 2.4.1 Drosophila Stocks and Tissue Preparation ...................................................................... 51 2.4.2 Illumina Library Preparation and Sequencing ............................................................... 51 2.4.3 Sequence Alignment and Expression Analyses .............................................................. 52 2.4.4 Novel Gene Identification ........................................................................................................ 53 2.4.5 Parent-of-Origin Effects............................................................................................................ 54 2.4.6 Genomic Distribution of Transcribed Genes ................................................................... 55 2.4.7 Recombination vs Expression Analysis ............................................................................. 55 2.5 Supplementary Information............................................................................................................ 56 2.6 Chapter 2 References ......................................................................................................................... 60 CHAPTER 3: In silico prediction of recombination rate variation across the Drosophila melanogaster genome based on multiple DNA motif analysis ........................................................... 67 3.0.1 Preface .................................................................................................................................................. 67 3.0.2 Abstract ................................................................................................................................................ 67 3.1 Background ............................................................................................................................................ 68 3.2 Results & Discussion .......................................................................................................................... 71 3.2.1 Generation of Genome-Wide landscapes of DNA motifs ............................................ 71 3.2.2 Variation in motif presence among chromosome arms .............................................. 72 3.2.3 Genomic co-occurrence of motifs ......................................................................................... 73 3.2.4 Motif presence is correlated with crossover rates across the genome ................ 73 3.2.5 A predictive model of variation in crossover rates across the genome based on sequence motif occurrence .................................................................................................. 78 3.2.6 Random Forests (RF) categorical modeling .................................................................... 78 3.2.7 MARS modeling............................................................................................................................ 80 3.3 Conclusions ............................................................................................................................................ 84 3.3.1 Motif Composition ...................................................................................................................... 84 3.3.2 Differences among Chromosome Arms ............................................................................. 86 3.3.3 Crossover Localization across the Genome...................................................................... 87 3.4 Methods ................................................................................................................................................... 90 3.4.1 Motif Landscape Generation .................................................................................................. 90 vii 3.4.2 Model Generation and Attribute Selection ....................................................................... 92 3.4.3 Population-scaled high-resolution crossover maps ..................................................... 94 3.5 Supplementary Information............................................................................................................ 96 3.5 Chapter 3 References ...................................................................................................................... 102 CHAPTER 4: Environmental contribution to recombination variation........................................ 111 4.1 Introduction ........................................................................................................................................ 111 4.2 Results & Discussion ....................................................................................................................... 113 4.2.1 Identification of Stress-Inducing Conditions ................................................................ 113 4.2.2 Recombination Frequency among Phenotypic Markers ......................................... 114 4.3 Materials and Methods ................................................................................................................... 116 4.3.1 Generation of Recombinants ............................................................................................... 116 4.3.2 DNA Library Preparation...................................................................................................... 117 4.3.3 Recombination rate estimation.......................................................................................... 118 4.4 Chapter 4 References ...................................................................................................................... 118 CHAPTER 5: Conclusions & Future Directions ....................................................................................... 121 5.1 Transcriptional Effects on Crossover Localization ............................................................. 121 5.2 In silico Prediction of Recombination Rates Based on Multiple Motifs ..................... 123 5.3 Environmental Impact on Recombination Rates ................................................................. 125 5.4 Summary .............................................................................................................................................. 126 5.5 Chapter 5 References ...................................................................................................................... 128 viii LIST OF TABLES Table 2.1 mRNA-seq statistics for each sample ....................................................................................... 37 Table 2.2 Top ten differentially expressed genes by fold-change .................................................... 41 Table 2.3S Top Enriched GO Terms among genes with parent-of origin effects with maternal-like transcript levels for Early- and Late-Ovarian samples.............................................. 56 Table 2.4S Top Enriched GO Terms among genes with parent-of origin effects with maternal-like transcript levels ......................................................................................................................... 57 Table 3.1S Statistics of motif presence in D. melanogaster .................................................................. 96 Table 3.2S Summary of MARS models of crossover distribution in D. melanogaster ...................................................................................................................................................... 97 ix LIST OF FIGURES Figure 1.1 Double strand break repair and recombination during meiosis .................................... 3 Figure 2.1 Comparison of Log10 FPKM values .......................................................................................... 38 Figure 2.2 Transcriptional differences between autosomes and the X chromosome ............... 38 Figure 2.3 Relationship between transcription and recombination rates ..................................... 45 Figure 2.4 Relative presence of DSBs across the genome ..................................................................... 47 Figure 2.5S FPKM distribution density across genes.............................................................................. 58 Figure 2.6S Volcano plot of genes by significance .................................................................................... 58 Figure 2.7S Early vs Late Log2 fold difference histogram .................................................................... 59 Figure 2.8S P-value distribution following normalization ................................................................... 59 Figure 3.1 Genomic Landscape of Motif 3.................................................................................................... 76 Figure 3.2 Probability heatmap of correlation between motif presence and crossover rates for different chromosome arms ........................................................................................................... 77 Figure 3.3 True-positive rate generated by a Random Forests (RF) model.................................. 82 Figure 3.4 MARS predictive models of crossover rates ......................................................................... 83 Figure 3.5S Effect of false discovery thresholds on motif presence and sequence .................... 98 Figure 3.6S Motif logos and individual Spearman’s correlation with crossover rates ............. 99 Figure 3.7S Correlation between motif presence and crossover rates for different chromosome arms .............................................................................................................................................. 100 Figure 3.8S LASSO coefficient paths ........................................................................................................... 101 Figure 4.1 Surviving offspring in media containing acetic acid....................................................... 114 Figure 4.2 Map distances between select phenotypic markers under acid stress .................. 116 x LIST OF ABBREVIATIONS AUC Area under the curve cM CentiMorgan CO Crossover CoHR Common homology region CV Cross-validation DGRP Drosophila genetic reference panel DNA Deoxyribonucleic acid DSB Double strand break FDR False discovery rate FPKM Fragments per kilobase of transcript per million mapped reads GC Gene conversion GCV Generalized cross-validation H3K4me3 Histone 3 lysine 4 trimethylation HR Homologous recombination Kb Kilobase LASSO Least angle selection and shrinkage operator MARS Multiple adaptive regression splines NHEJ Non-homologous end joining PCR Polymerase chain reaction PFM Position frequency matrix RAL Raleigh, NC population RF Random forests RG Rwanda population RNA Ribonucleic acid ZI Zambia population xi 1 CHAPTER 1: Introduction 1.1 Meiotic Recombination 1.1.1 A General Overview of Meiotic Recombination Meiosis (from the Greek meioun, meaning “to make small”) is the highly regulated process in which diploid progenitor cells undergo reductional and equational divisions yielding haploid gametes, ready for fertilization. The process of meiosis by which gametes such as egg and sperm are created, ensures the proper passage of genetic material from one generation to the next in sexually reproducing organisms. The process is tightly regulated at many points, as the consequences of failure in the process can lead to disastrous effects for the next generation of organisms. One particular process within meiosis is of greatest interest to me—recombination. During the process of recombination, cells sacrifice the inherent stability of their double-stranded DNA molecules and intentionally form breaks in the DNA. This dance by which chromosomes are duplicated, broken and repaired, separated, and segregated into cells containing half of the parental DNA content leads to gametes ready for fertilization in order to establish the next generation. The process of meiosis is heavily conserved across evolutionary time, as sex presents itself as the rule, rather than the exception, throughout the great majority of animal life. While many organisms are capable of asexual reproduction, sex appears to be ubiquitous in Eukarya due to need to generate recombinant offspring. Indeed, the machinery necessary for the interchange of homologous DNA seems to have evolved well before the appearance of eukaryotes as a mechanism for DNA repair. Intriguingly, while the process as a whole is heavily conserved, organisms display a wide diversity in terms of the specifics of each particular organism’s program of meiosis. However, one aspect of meiosis that appears extraordinarily near universal is the process of meiotic recombination. One 2 notable exception, however, is that males of D. melanogaster do not exhibit recombination. Very few multicellular organisms are known to lack recombination (and indeed, many, if not most, unicellular organisms possess the capability). Outside of meiosis, homologous recombination exists as a repair mechanism for sporadic double-stranded breaks (DSBs) and other types of damage in somatic cells (Kuzminov 1999, Wyman and Kanaar 2006). Within meiosis, homologous recombination is the process in which homologous chromosomes sacrifice the inherent stability of their double-stranded DNA molecules by programmatically forming double-stranded breaks in the DNA, and then repairing those breaks with their homologous partner, generating recombinant chromosomes containing a patchwork of genetic information from both parental chromosomes. These breaks and their subsequent repair manifest themselves as either crossing-over (CO) events (Figure 1.1), in which a double-Holliday junction is formed and resolved, or as gene-conversion (GC) events, where a strand of DNA is used as the repair template in either a Holliday-dependent (as a crossover that is repaired symmetrically) or –independent (via synthesis dependent strand annealing) manner, yielding unequal resultant products (Mehrotra, Hawley et al 2008). During meiosis, all DSBs are repaired through either CO or GC (Figure 1.1), and thus the sum of both events equals the total DSBs, though this thesis is primarily concerned with those DSBs that are repaired as CO events. While small tracts of heteroduplex DNA are in fact formed in CO, I will exclude these regions in practice, referring to all non-crossovers as just GC, and all crossovers as CO. The simple ability for organisms to reorganize genetic variation within a population over generations yields extraordinary consequences for the evolution of the species. The primary focus of this thesis will be the distribution of these breaks and their subsequent repair as crossovers in meiosis. As we shall see, this process is important evolutionarily for the maintenance of diversity and plays an integral role in sculpting the genomes of species. 3 Figure 1.1 Double strand break repair and recombination during meiosis The DSB repair via double Holliday junction can generate either crossover (CO) or non-crossover (gene conversion, GC) events while the Holliday junction-independent repair (synthesis-dependent strand annealing, SDSA) mechanism causes only GC events. (From Comeron et. al 2012) 1.1.2 General Pathways of DSB Induction, Resolution Because organisms maintain the same number of chromosomes throughout generations, chromosome number before fertilization must be reduced by half. Before this reductional division, the process of recombination occurs—generating a chromosomal patchwork that causes each gamete to be genetically distinct. The steps in which meiosis occurs: replication, recombination, and reduction, are carefully controlled to produce functional gametes and to avoid mistakes which could cause difficulties for the next generation. 4 The initial work in meiosis was mainly produced through observations of segregating chromosomes with light microscopy (Reviewed in Zickler and Kleckner 1999). Subsequently, researchers developed mutants and more advanced methods of detecting meiotic defects (Cox and Game 1974, Game and Mortimer 1974, Hopper 1975). Much of our current understanding comes from work that was performed in the budding yeast, Saccharomyces cerevisiae. As the steps and control mechanisms of meiosis have been found to be widely conserved, these yeast experiments have provided significant insight into the process of meiosis in other organisms (Roeder 1995, Zickler and Kleckner 1999). Prior to the first division in meiotic prophase I, homologous chromosomes pair, align, and undergo homologous recombination with one another. In most organisms recombination is mechanically necessary to mediate the pairing of homologous chromosomes during prophase I. The details and relationships between synapsis and recombination vary among organisms (Santos 1999), but these basic processes appear generally conserved and the steps of each, despite individualities endemic to particular organisms, seem highly regulated (John 2005). For instance, while the initiation of recombination is not required for synapsis in Drosophila (McKim, Green-Marroquin et al. 1998, Da Ines, Gallego et al. 2014) and C. elegans (Bhalla and Dernburg 2008, Da Ines, Gallego et al. 2014), recombination is essential for synapsis in Sacchromyces and all mammals thus far examined (Weiner and Kleckner 1994, Gerton and Hawley 2005, Zickler and Kleckner 2015). In any event, where recombination is required for proper meiosis, disruption in any event leading to the formation or resolution of DSBs is catastrophic. In all recombining organisms, the formation of DSBs is catalyzed by a highly conserved topoisomerase-like protein resembling the Spo11 (Mei-w68 in Drosophila) protein from yeast (McKim and Hayashi-Hagihara 1998). This protein been identified as an 5 element necessary for DSB induction for which recombination is ultimately dependent upon. In yeast, at least nine other accessory proteins have been identified to be required with Spo-11 (MEI4, MER2, REC102, REC103, REC104, REC114, MRE11, RAD50, XRS2, as well as MER1 and MRE2) (Keeney 2001, Jiao, Salem et al. 2003, Prieler, Penkner et al. 2005). Most of these genes we identified to be required for the repair of radiation-induced DNA damage in S. cerevisiae. Spo11 catalyzes breaks in the DNA through a trans-esterification reaction involving a covalent protein-DNA intermediate and is subsequently removed through endonucleolytic cleavage (Reviewed in Keeney 2008) by the Mre11-Rad50-Xrs2 (MRX) complex (Symington 2002). In most cases crossing over (CO) or gene conversion (GC) formation can be produced with exogenous DSB inducing agents, but Spo11 is necessary in their absence (Keeney 2008). Though the distribution of DSBs is demonstrably non-random, the precise mechanism by which Spo11 and its cofactors determines sites to cleave is unknown, and will be a subject of discussion throughout this thesis (Petes 2001, de Massy 2003, Kauppi, Jeffreys et al. 2004). Though there is some evidence in S. pombe in which Spo11 may possess a DNA binding domain, no such consistent recombination-associated signal for the recruitment of Spo11 has thus far been identified (DeWall, Davidson et al. 2005). It is possible that epigenetic labeling of chromatin influences recombination sites, as methylation of Histone 3 Lysine residue 4 (H3K4me3) is enriched near sites of DSB formation, and has further been shown to be recruited to the chromosomal axis by Spp1, where recombination events are carried out (Sollier, Lin et al. 2004, Borde, Robine et al. 2009, Acquaviva, Szekvolgyi et al. 2013, Sommermeyer, Beneut et al. 2013, Sun, Huang et al. 2015). 6 Many breaks are formed across a chromosome while only a few are repaired as crossovers. As an example, Mus musculus generates 300-400 double stranded breaks whereas only a few (20-35) are repaired as crossovers (Plug, Xu et al. 1996, Kolas, Svetlanov et al. 2005, Baudat and de Massy 2007). However, whether sites enriched for DSBs are always enriched for COs remains unclear, though some experiments in S. pombe indicate that some DSB hotspots are preferentially repaired with the sister chromatid (Hyppa and Smith 2010). It remains to be seen how precisely the decision to repair DSBs as crossovers is made. For a crossover event to occur, DSBs are exonucleolytically resected on the 5’ end of the DNA molecule by the MRN complex, leaving a single-stranded 3’ overhand that is free to invade a homologous chromosome sequence (Symington 2002, Yin and Smolikove 2013, Symington 2014). The single stranded overhang becomes bound by Rad51, which promotes pairing and strand exchange (Sung and Robberson 1995). The invading strand is extended and recaptured by the broken chromatid, forming a double Holliday junction (Figure 1). Throughout this process, a proteinaceous scaffolding called the synaptonemal complex is erected to stabilize the synapsed homologs. Not all of the formed Holliday junctions will be repaired as crossovers, but may instead just form small (<1000bp) tracts of heteroduplex DNA known as non-crossover gene conversion events (Bhalla and Dernburg 2008, Comeron, Ratnappan et al. 2012, Padhukasahasram and Rannala 2013). Both events, crossovers and non-crossover gene conversions, generate recombinant products but affect nucleotide polymorphism of populations at different scales—crossovers, while relatively rare, are large evolutionary events while gene conversion events are much more prevalent and act on short, sub-1kb, scales (Ohta 1976, Ohta 1976, Walsh 1983, 7 Padhukasahasram and Rannala 2013). The question remains, however: why does such a complicated system facilitating sexual reproduction exist? 1.1.3 Advantages of Recombination For recombination to exist as pervasively as it does (as a product of natural selection) the advantages of recombination must outweigh the risks—but what are those advantages? The interchanging of bits of genetic information between homologous chromosomes serves several important functions, both mechanistically and evolutionary. Mechanistically, recombination is critical for ensuring proper chromosomal pairing and disjunction in the reductional division. The formation of DSBs and their subsequent repair can, in many organisms, often only properly occur when chromosomes are synapsed. A notable exception to this is Drosophila melanogaster, in which males show no evidence for meiotic recombination. Meiotic crossovers help provide tension to the chromosome as homologous chromosomes align in preparation for separation—aiding the proper unipolar orientation towards the spindle poles (Bhalla and Dernburg 2008). Non-meiotic homologous recombination is an important process by which sporadic somatic DSBs may be repaired. In the event that homologous recombination fails to repair DSBs, the error-prone non-homologous end-joining (NHEJ) repair mechanism may be employed in some organisms such as C. elegans (Joyce, Paul et al. 2012, Yin and Smolikove 2013), potentially leading to fragmentation, chromosomal rearrangements (e.g. translocations), or deletion of genetic information (Bhalla and Dernburg 2008) and for that reason NHEJ, where tested appears to be suppressed during meiosis (Joyce, Paul et al. 2012). In metazoans, failure to repair completely shunts the meiotic cell towards apoptosis (Gumienny, Lambie et al. 1999). HR is therefore the preferred mechanism of repair, as alternatives are suppressed and failure of HR leads to death of the developing oocyte. 8 1.1.4 Recombination and Polymorphism Evolutionarily, recombination has been extensively studied for its role in genetic variation (Begun and Aquadro 1992, Rockman, Skrovanek et al. 2010, Comeron 2014, Wallberg, Glemin et al. 2015), limiting accumulation of deleterious mutations (Comeron, Kreitman et al. 1999, Bachtrog 2003), and enhancing rates of adaptation (Nordborg, Hu et al. 2005, Presgraves 2005, Larracuente, Sackton et al. 2008). Recombination acts to free linked alleles from one another, untangling loci that would otherwise be tied to the fate of other proximal loci. This phenomena is colloquially known as the Hill-Robertson effect (Hill and Robertson 1966), whereby selection at one site in the genome limits the effect of selection at another site when recombination is reduced or absent (Felsenstein 1974). In these arguments, naturally occurring linkage disequilibria—nonrandom associations between alleles—form due to the sheer nature of the DNA polymer chain, and this interrupts the ability for selection to act efficiently at individual sites (Kliman and Hey 1993, Hey and Kliman 2002, Comeron, Williford et al. 2008). Because of linkage disequilibrium, sites that are linked to selected sites also experience reduced effective population size (Ne). Recombination has the effect of increasing Ne and it consequently enhances a population’s ability to respond to selection (See below). This plays out in two models of selection. In the case of positive selection, alleles are pushed to high frequency through selective sweeps and carry with them any linked variants. This has the net effect of reducing genetic variation in areas proximal to the selected locus when linked alleles replace all other variants in the region. Similarly, in models of background (negative) selection, deleterious alleles are removed from populations, likewise carrying all linked variation with the deleterious allele. Once again, this results in a net reduction of genetic diversity that may be ameliorated by recombination breaking linkage between genes, allowing selection to act on smaller regions of the genome. These effects have been well established, especially in Drosophila (Berry, 9 Ajioka et al. 1991, Kliman and Hey 1993, Charlesworth 1996, Barton and Charlesworth 1998, McVean and Charlesworth 2000, Marais, Mouchiroud et al. 2001, Hey and Kliman 2002, Kliman and Hey 2003). In fact, it is this phenomenon of reduced diversity that many evolutionary biologists search for as hallmarks of selection (Betancourt and Presgraves 2002, Smith and Eyre-Walker 2002, Kim and Nielsen 2004). Thus, in order to reach accurate conclusions regarding selection, researchers need to understand the effects of recombination. Recombination enables selection—either positive or negative—to act at more precise points within the genome, increasing the efficacy of selection and preserving important nucleotide diversity that fuels the evolutionary process. Recombination affects levels of standing variation within a population—giving organisms the possibility of an immediate response to environmental change. Genetic variation is of paramount importance to the overall existence of a species, as without fuel to the fires of evolution, species are extinguished—failing to adapt in changing conditions. Genetic variation is the major source of phenotypic and adaptive differences we observe from organism to organism, species to species. Therefore, our understanding of life through the lens of genetics is based upon the study of genetic variation in the genomes of living organisms. Herein, I present an effort to further understand how rates of meiotic recombination are distributed across genomes in an attempt to better understand how this event shapes genomic patterns of nucleotide variation. 1.2 Recombination Rate Variation 1.2.1 The Many Levels of Recombination Rate Variation Given how recombination rates alter the distribution, or “landscape”, of standing polymorphisms across the genome, it is perhaps surprising that the process of recombination itself possesses significant variation (Jensen-Seaman, Furey et al. 2004, 10 Coop, Wen et al. 2008, Smukowski and Noor 2011, Comeron, Ratnappan et al. 2012). Recombination variation presents itself at several levels: between species, among populations of the same species, among individuals of the same population, between sexes, and within individual genomes. Using traditional mapping techniques, early researchers quickly discovered that the rate of crossing over was “…one of the most variable phenomena known” (Gowen 1919, Detlefsen and Clemente 1923). In fact, it was this variability that was key to solving the Castle-Morgan debate of the linear arrangement of genes on chromosomes (Sturtevant, Bridges et al. 1919). In some cases, such as is the case in humans, most of the genome is a veritable recombination desert (recombination infrequent) with a few hotspot oases (regions where recombination is enriched hundreds to thousands of times greater than the genome average rate) harboring >60% of all recombination events (Petes 2001, Myers, Bottolo et al. 2005, Serrentino and Borde 2012). As the focus of this thesis is on Drosophila, it must be noted that D. melanogaster does not possess classical hotspots where regions are enriched thousand-fold, but rather 10-20 fold increases (Hey 2004, Comeron, Ratnappan et al. 2012). Why some regions are more or less likely to recombine than others is a complex story that remains largely unanswered, and will be the focus of this thesis. In the following section I will provide evidence for the many levels of variation that we observe in recombination rates. 1.2.2 Differences Within & Between Species Just as a honeybee is morphologically distinct from a human, so too are the macro- scale recombination rates between the two. Genome-wide average recombination rates for honeybees is approximately 22 centiMorgans per megabase (cM/Mb) (Gempe, Hasselmann et al. 2009, Wallberg, Glemin et al. 2015) while in humans the rate is approximately 1.2 cM/Mb (Jensen-Seaman, Furey et al. 2004). Life histories of each organism may play a role, 11 too. For instance, while the budding yeast, S. cerevisiae, may not recombine every generation in nature [perhaps only once in every 1000 generations in the related S. paradoxus (Tsai, Bensasson et al. 2008)]; when it does undergo meiosis, genome-wide average recombination rate reaches approximately 360 cM/Mb (Connallon and Knowles 2007, Tsai, Bensasson et al. 2008). Furthermore, between-species topology of recombination variation can vary widely. C. elegans and D. melanogaster have similar genome-average recombination rates (2.7 and 2.44 cM/Mb respectively (Rockman and Kruglyak 2009, Fiston-Lavier, Singh et al. 2010) but the distribution of CO is drastically different—with maximum CO rates existing near the edges of C. elegans chromosomes and Drosophila maxima existing centrally between centromeres and telomeres on each chromosome arm (Hillers and Villeneuve 2003, Comeron, Ratnappan et al. 2012). While large differences in recombination existing between species is not necessarily surprising, the fact that large differences also exist within a species is much less easily explained (Koehler, Cherry et al. 2002, Jeffreys and Neumann 2005, Stevison and Noor 2010, Heil and Noor 2012). In closely related populations of Mus musculus, Dumont and Payseur (2011) identified significant variation in the number of meiotic MLH1 foci—a marker of crossovers—with some populations differing from other as much as 35%. Other studies in Homo (Graffelman, Balding et al. 2007) and Drosophila (Comeron, Ratnappan et al. 2012) have identified similar population-scale differences among members of the same species. What remains to be determined, however, is whether the factors affecting interand intra-species recombination rate variation are the same, and if these factors are shared at other levels of recombination variation. 12 1.2.3 Sex-Biased Recombination Recombination also varies between sexes. It has been known for well over 100 years that some organisms have achaiasmatic sex—the lack of recombination in one sex, which has arisen independently at least 30 times throughout evolution (Charlesworth 2005). In general, certain aspects seem to be pervasive which may suggest a constant evolutionary trend. For example, when there is recombination in only one sex, it is almost always the homogametic sex and when there is recombination in both sexes, the homogametic sex often has higher recombination rates, but there are a few exceptions (Lenormand 2003, Hedrick 2007). The homogametic sex possessing higher recombination rates is colloquially known as the Haldane-Huxley rule, and is thought to be a side effect on autosomes of the suppression of recombination between the sex chromosomes (Lenormand and Dutheil 2005). For those heterochiasmate species (those with recombination in both sexes) there seems to be a great range in the ratio of hetero- to homogametic recombination rates—0.35 in the zebrafish, 0.62 in humans, and .83 in sheep (reviewed in Lenormand 2005). Furthermore, the pattern of recombination varies with sex, with human males having higher recombination at telomeric regions and females having higher recombination near the centromere, as noted by Kong and others (Kong, Gudbjartsson et al. 2002), who measured this using 1257 meiotic events detected within 5136 microsatellite markers. Why this discrepancy in recombination between the sexes exists is of considerable debate, with several explanations offered by Haldane (1922), Trivers (1988), Burt et al. (1991) and Lenormand (2003) (reviewed in Hedrick). In particular, Lenormand and Dutheil (2005) suggest that the sex with more intense haploid selection should have less recombination, 13 and indeed this explanation seems adequate for plants and most animals that have repressed male recombination relative to their female counterparts. 1.2.4 Genomic and Environmental Variation in Recombination Rates Intriguingly—and the focus of this thesis—is that recombination events occur non- randomly within individual genomes. Complicating the search for causes of the distribution of recombination rates is that recombination rates and are altered in differing environmental conditions within the same genome (Nachman 2002). These two observations, that recombination events occur nonrandomly and are affected by environmental conditions, suggest interplay between epigenetics (in the context of thermodynamic alterations) and genomics in the localization of recombination events. Thus far, it is unknown if variation within individual genomes is a result of the same factors as variation at other levels. However, we do have many observations of the types of variables that seem to influence recombination rates. For example, recombination rates have long been known to be altered by temperature (Bridges 1927), food quality, maternal age (Plough 1917, Charlesworth and Charlesworth 1985), and exposure to various substances (Plough 1917, Bridges 1927). That recombination rates are altered under different environmental conditions may lend hints to the underlying mechanisms of DSB formation and/or repair with respect to localization. This idea will serve as the basis of the work performed in chapter 4. Whereas certain factors: radiation, age, and temperature were found to alter crossover frequency in Drosophila, others: food moisture content, starvation, or the presence of ferric chloride were found to have no significant effect (Plough 1917). It was also apparent that certain chromosomal regions responded differentially to each factor (Bridges 1927, Abdullah and Charlesworth 1974, Charlesworth and Charlesworth 1985), an 14 early suggestion of underlying chromosome structure effects that largely still continue to elude researchers. The usage of multiple phenotypic markers and repeated crossing to assay crossover frequency continues today. There are recent reports employing this traditional genetic mapping technique in investigating fine scale recombination patterns from the Noor Lab in D. pseudoobscura (Cirulli, Kliman et al. 2007, Kulathinal, Bennett et al. 2008), the Singh Lab in D. melanogaster (Singh, Stone et al. 2013, Singh, Criscoe et al. 2015), and ours (Chapter 4), though more modern techniques involving high-throughput sequencing have begun replacing traditional approaches in each of the above laboratories as well. A hybrid approach incorporating both high-throughput sequencing and traditional mapping is being undertaken by our lab, and will be further discussed in Chapter four. Recombination rates have been reasonably well studied at the level of entire genomes. Whole genome distributions of recombination rates, or “Recombination Landscapes”, reveal an incredible amount of thus far unexplained variation in recombination rates. Recombination landscapes are different for each species; in humans, sperm typing experiments revealed the punctate nature of recombination rates (JensenSeaman, Furey et al. 2004, Graffelman, Balding et al. 2007, Coop, Wen et al. 2008). While humans and Drosophila have similar sex-averaged recombination rates (albeit without recombination in Drosophila males) the chromosomal distribution of recombination varies greatly between the species (Jensen-Seaman, Furey et al. 2004). As a broad pattern, in most eukaryotes, rate of crossing-over peaks near the center of each chromosomal arm with a great reduction of COs in heterochromatic telomeric and centromeric regions (Smukowski and Noor 2011). As more information was gathered about recombination in humans through the typing of sperm using PCR assays, researchers began to notice small regions of the genome of 1-2kb in length with greatly enriched recombination (Jeffreys, Kauppi et al. 15 2001). These so called 'hotspots' of recombination have been found in yeast and mammals, and in humans may account for as much as 60% of recombination in as little as 6% of the genome (Frazer and O'Keefe 2007, Paigen and Petkov 2010, Parvanov, Petkov et al. 2010). This sporadic distribution of a few highly recombining regions can be, in part, explained by the occurrence of the PRDM9-associated motif. This motif, originally identified by Myers (Myers, Freeman et al. 2008) represents a binding site for PRDM9, a histone methyltransferase, but its role in recombination patterning was not apparent until two groups independently linked it to mouse hotspots (Grey, Baudat et al. 2009, Parvanov, Petkov et al. 2010) [The history of PRDM9 is reviewed in (Ségurel, Leffler et al. 2011)]. Other mammals have been shown to harbor PRDM9, though in some cases the PRDM9 gene appears lost (Muñoz-Fuentes, Di Rienzo et al. 2011, Auton, Rui Li et al. 2013). In any case, PRDM9 cannot explain the majority of recombination variance in any species thus far examined and furthermore offer limited predictive power to identify hotspots a priori. For example, the PRDM9 motif appears over 300,000 times in the human genome, but less than 30,000 hotspots have been identified—making it a poor predictor of recombination rates (Ségurel, Leffler et al. 2011). Furthermore, there is evidence that when PRDM9 is knocked down or absent (such as in dogs), recombination sites are shifted towards promoters and recombination rates are more spread out—such as in the case of Drosophila (Grey, Baudat et al. 2009, Brick, Smagulova et al. 2012, Auton, Rui Li et al. 2013). In chapter 3, I present my work on other recombination-associated DNA motifs (repeated strings of short, variable nucleotide sequences) in an attempt to reveal if there are other associated and predictive motifs within the genome of D. melanogaster. 16 1.2.5 Explaining Recombination Variation Having now established the many levels in which species exhibit recombination variation, I now turn to a discussion of how such variation has been supposed to come about. Early speculation about the modification of crossing rates was that variation might occur through how tightly wound chromosomes are. By this logic, winding of DNA would alter the distance between genes and thus alter crossover frequency (Bridges 1927). Further, while early researchers hypothesized that genic modifiers of recombination existed, these researchers were unable to find correlation between specific genes and elevated or depressed crossover frequency (Gowen 1919). We now understand that recombination rates are in fact modified by a variety of factors, which has made progress in the field more difficult. Rates of crossing over are known to yield to selection. Early studies by Kidwell (1972) and Chinnici (1971) demonstrated that artificial selection for increased and (to a much lesser degree) decreased rates of crossing over could be successful in Drosophila. These results reveal that recombination rates, or at least crossing-over, exhibit heritable variation, and are not a sole product of environmental conditions. Broad scale interpretations of heritable recombination landscapes have led to attempts to quantify recombination on a whole-chromosome scale by comparing genetic and physical maps and fitting a seemingly appropriate function to the data that neglects most local variation (Hey and Kliman 2002, Fiston-Lavier, Singh et al. 2010). Indeed, our present concept of genetic maps hinges upon this idea of a species average recombination rate, but these maps are often constructed using divergent strain or combine data from different environmental conditions—a poor choice since both of these variables influence recombination rates. These sorts of maps often correlate well with levels of polymorphism (Begun and Aquadro 17 1992, Hellmann, Ebersberger et al. 2003, Barton 2010, Comeron, Ratnappan et al. 2012). Indeed, the best predictor of recombination rates, thus far, is the amount of nucleotide variation. This has been observed in a vast array of species (Begun and Aquadro 1992, Stephan and Langley 1998, Nachman 2002, Hellmann, Ebersberger et al. 2003). Unfortunately, this appears to be an effect of, rather than a cause of increased recombination (Maynard Smith and Haigh 1974, Charlesworth, Morgan et al. 1993). Due to this chicken-or-egg problem, caution must be employed when attributing genomic signatures as causative recombinatory agents. Sequence motifs have also been found to be correlated with recombination rates in some instances. Different regulatory motifs have been supposed to direct protein machinery to specific locations within in the genome to catalyze the induction of DSBs. These motifs are typically known as hotspot motifs, as they have traditionally been identified to be enriched near punctate sites of heavy recombination. Specific motifs have been shown to activate (though not necessarily or sufficiently) recombination hotspots in humans (Myers, Freeman et al. 2008), mice (Baudet, Lemaitre et al. 2010), and S. pombe (Steiner, Steiner et al. 2009, Steiner, Kohli et al. 2010). Less is known about the impact of motifs on other organisms, especially in those without bona fide recombination hotspots. It is also unknown if these motifs serve a distinctly structural function (by coding for a nuclease sensitive site, for example), or if they represent sites of binding for transcriptional elements or nucleases. Previous results demonstrating the plasticity of recombination in response to either selection or environmental conditions suggest that the localization of DSB events is a multifactorial problem that is unlikely to be resolved using classical or population genetics techniques alone. For this reason, I have employed a variety of techniques to address this growing question. To begin the process of fully understanding recombination rates, and in 18 order to accurately model the distribution of recombination rates across the genome, I began with a highly simplistic model: Reci = Genetic(Rec)i + Epigenetic(Rec)i . Where recombination rate at some site i is determined by the genetic recombination factors at site i plus the epigenetic factors affecting recombination at some site i. In order to accurately model this process, it is necessary to fill in these two parameters. In chapter 2, I detail my attempt to define epigenetic factors affecting recombination and glean insight from the meiotic transcriptional environment (study published in BMC Genomics, Adrian and Comeron 2013). In chapter 3, I utilize simple primary DNA sequence in the search of recombination-modifying motifs in order to better define factors on the genetics side of the equation (study in review, Adrian, Cruz Corchado, Comeron 2015). Finally, in chapter 4, I preliminarily examine the effects of environment on fine scale variation in recombination rates. If we can properly explore the greatest epigenetic and genetic factors affecting recombination, we should ideally be able to accurately model rates of recombination. With such a model of recombination rates we will be able to better understand how nucleotide variation is maintained in the genomes of recombining organisms, which has great implications for the evolution of species. 19 1.3 Chapter 1 References Abdullah, N. F. and B. Charlesworth (1974). "Selection for reduced crossing over in Drosophila melanogaster." Genetics 76(3): 447-451. Acquaviva, L., L. Szekvolgyi, B. Dichtl, B. S. Dichtl, C. de La Roche Saint Andre, A. Nicolas and V. Geli (2013). "The COMPASS subunit Spp1 links histone methylation to initiation of meiotic recombination." Science 339(6116): 215-218. Auton, A., Y. Rui Li, J. Kidd, K. Oliveira, J. Nadel, J. K. Holloway, J. J. Hayward, P. E. Cohen, J. M. Greally, J. Wang, C. D. Bustamante and A. R. Boyko (2013). "Genetic recombination is targeted towards gene promoter regions in dogs." PLoS Genet 9(12): e1003984. Bachtrog, D. (2003). "Adaptation shapes patterns of genome evolution on sexual and asexual chromosomes in Drosophila." Nat. Genet.. 34(2): 215-219. Barton, N. H. (2010). "Genetic linkage and natural selection." Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 365(1552): 2559-2569. Barton, N. H. and B. Charlesworth (1998). "Why sex and recombination?" Science 281(5385): 1986-1990. Baudat, F. and B. de Massy (2007). "Regulating double-stranded DNA break repair towards crossover or non-crossover during mammalian meiosis." Chromosome Res 15(5): 565-577. Baudet, C., C. Lemaitre, Z. Dias, C. Gautier, E. Tannier and M. F. Sagot (2010). "Cassis: detection of genomic rearrangement breakpoints." Bioinformatics 26(15): 1897-1898. Begun, D. J. and C. F. Aquadro (1992). "Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster." Nature 356(6369): 519-520. Berry, A. J., J. W. Ajioka and M. Kreitman (1991). "Lack of polymorphism on the Drosophila fourth chromosome resulting from selection." Genetics 129(4): 1111-1117. Betancourt, A. J. and D. C. Presgraves (2002). "Linkage limits the power of natural selection in Drosophila." Proceedings of the National Academy of Sciences, USA 99(21): 13616-13620. Bhalla, N. and A. F. Dernburg (2008). "Prelude to a division." Annu Rev Cell Dev Biol 24: 397-424. 20 Borde, V., N. Robine, W. Lin, S. Bonfils, V. Geli and A. Nicolas (2009). "Histone H3 lysine 4 trimethylation marks meiotic recombination initiation sites." EMBO J 28(2): 99-111. Brick, K., F. Smagulova, P. Khil, R. D. Camerini-Otero and G. V. Petukhova (2012). "Genetic recombination is directed away from functional genomic elements in mice." Nature 485(7400): 642-645. Bridges, C. B. (1927). "The Relation of the Age of the Female to Crossing over in the Third Chromosome of Drosophila Melanogaster." J Gen Physiol 8(6): 689-700. Burt, A., G. Bell and P. H. Harvey (1991). "Sex differences in recombination." Journal of Evolutionary Biology 4(2): 259-277. Charlesworth, B. (1996). "Background selection and patterns of genetic diversity in Drosophila melanogaster." Genet. Res. 68(2): 131-149. Charlesworth, B. and D. Charlesworth (1985). "Genetic variation in recombination in Drosophila. I. Responses to selection and preliminary genetic analysis." Heredity 54(1): 7183. Charlesworth, B., M. T. Morgan and D. Charlesworth (1993). "The effect of deleterious mutations on neutral molecular variation." Genetics 134(4): 1289-1303. Chinnici, J. P. (1971). "Modification of recombination frequency in Drosophila. II. The polygenic control of crossing over." Genetics 69(1): 85-96. Cirulli, E. T., R. M. Kliman and M. A. F. Noor (2007). "Fine-scale crossover rate heterogeneity in Drosophila pseudoobscura." J Mol Evol 64(1): 129-135. Comeron, J. M. (2014). "Background selection as baseline for nucleotide variation across the Drosophila genome." PLoS Genet 10(6): e1004434. Comeron, J. M., M. Kreitman and M. Aguade (1999). "Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila." Genetics 151(1): 239-249. Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The Many Landscapes of Recombination in Drosophila melanogaster." PLoS Genet 8(10): e1002905. 21 Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination in Drosophila melanogaster." PLoS Genet 8(10): e1002905. Comeron, J. M., A. Williford and R. M. Kliman (2008). "The Hill-Robertson effect: evolutionary consequences of weak selection and linkage in finite populations." Heredity (Edinb) 100(1): 19-31. Connallon, T. and L. L. Knowles (2007). "Recombination rate and protein evolution in yeast." BMC Evol Biol 7: 235. Coop, G., X. Wen, C. Ober, J. K. Pritchard and M. Przeworski (2008). "High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans." Science 319(5868): 1395-1398. Cox, B. and J. Game (1974). "Repair systems in Saccharomyces." Mutat Res 26(4): 257-264. Da Ines, O., M. E. Gallego and C. I. White (2014). "Recombination-independent mechanisms and pairing of homologous chromosomes during meiosis in plants." Mol Plant 7(3): 492501. de Massy, B. (2003). "Distribution of meiotic recombination sites." Trends Genet 19(9): 514522. Detlefsen, J. A. and L. S. Clemente (1923). "Genetic Variation in Linkage Values." Proc Natl Acad Sci U S A 9(5): 149-156. DeWall, K. M., M. K. Davidson, W. D. Sharif, C. A. Wiley and W. P. Wahls (2005). "A DNA binding motif of meiotic recombinase Rec12 (Spo11) defined by essential glycine-202, and persistence of Rec12 protein after completion of recombination." Gene 356: 77-84. Dumont, B. L. and B. A. Payseur (2011). "Genetic analysis of genome-scale recombination rate evolution in house mice." PLoS Genet 7(6): e1002116. Felsenstein, J. (1974). "The evolutionary advantage of recombination." Genetics 78(2): 737756. Fiston-Lavier, A. S., N. D. Singh, M. Lipatov and D. A. Petrov (2010). "Drosophila melanogaster recombination rate calculator." Gene 463(1-2): 18-20. 22 Frazer, L. N. and R. T. O'Keefe (2007). "A new series of yeast shuttle vectors for the recovery and identification of multiple plasmids from Saccharomyces cerevisiae." Yeast 24(9): 777789. Game, J. C. and R. K. Mortimer (1974). "A genetic study of x-ray sensitive mutants in yeast." Mutat Res 24(3): 281-292. Gempe, T., M. Hasselmann, M. Schiott, G. Hause, M. Otte and M. Beye (2009). "Sex determination in honeybees: two separate mechanisms induce and maintain the female pathway." PLoS Biol 7(10): e1000222. Gerton, J. L. and R. S. Hawley (2005). "Homologous chromosome interactions in meiosis: diversity amidst conservation." Nat Rev Genet 6(6): 477-487. Gowen, J. W. (1919). "A Biometrical Study of Crossing Over. on the Mechanism of Crossing over in the Third Chromosome of DROSOPHILA MELANOGASTER." Genetics 4(3): 205-250. Graffelman, J., D. J. Balding, A. Gonzalez-Neira and J. Bertranpetit (2007). "Variation in estimated recombination rates across human populations." Hum Genet 122(3-4): 301-310. Grey, C., F. Baudat and B. de Massy (2009). "Genome-wide control of the distribution of meiotic recombination." PLoS Biol 7(2): e35. Gumienny, T. L., E. Lambie, E. Hartwieg, H. R. Horvitz and M. O. Hengartner (1999). "Genetic control of programmed cell death in the Caenorhabditis elegans hermaphrodite germline." Development 126(5): 1011-1022. Hedrick, P. W. (2007). "Sex: differences in mutation, recombination, selection, gene flow, and genetic drift." Evolution 61(12): 2750-2771. Heil, C. S. S. and M. A. F. Noor (2012). "Zinc Finger Binding Motifs Do Not Explain Recombination Rate Variation within or between Species of Drosophila." PLoS ONE 7(9): e45055. Hellmann, I., I. Ebersberger, S. E. Ptak, S. Paabo and M. Przeworski (2003). "A neutral explanation for the correlation of diversity with recombination rates in humans." Am J Hum Genet 72(6): 1527-1535. Hey, J. (2004). "What's So Hot about Recombination Hotspots?" PLoS Biol 2(6): e190. 23 Hey, J. and R. M. Kliman (2002). "Interactions between natural selection, recombination and gene density in the genes of Drosophila." Genetics 160(2): 595-608. Hill, W. G. and A. Robertson (1966). "The effect of linkage on limits to artificial selection." Genetical Research 8(3): 269-294. Hillers, K. J. and A. M. Villeneuve (2003). "Chromosome-wide control of meiotic crossing over in C. elegans." Curr Biol 13(18): 1641-1647. Hopper, A. K. K. J. H., Benjamin D, (1975). "Mating type and sporulation in yeast. II. Meiosis, recombination, and radiation sensitivity in an aa diploid with altered sporulation control." Genetics 80: 61-76. Hyppa, R. W. and G. R. Smith (2010). "Crossover invariance determined by partner choice for meiotic DNA break repair." Cell 142(2): 243-255. Jeffreys, A. J., L. Kauppi and R. Neumann (2001). "Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex." Nat Genet 29(2): 217-222. Jeffreys, A. J. and R. Neumann (2005). "Factors influencing recombination frequency and distribution in a human meiotic crossover hotspot." Human Mol Genet 14(15): 2277-2287. Jensen-Seaman, M. I., T. S. Furey, B. A. Payseur, Y. Lu, K. M. Roskin, C. F. Chen, M. A. Thomas, D. Haussler and H. J. Jacob (2004). "Comparative recombination rates in the rat, mouse, and human genomes." Genome Res 14(4): 528-538. Jiao, K., L. Salem and R. Malone (2003). "Support for a Meiotic Recombination Initiation Complex: Interactions among Rec102p, Rec104p, and Spo11p." Molecular and Cellular Biology 23(16): 5928-5938. John, B. (2005). Meiosis. New York, Cambridge University Press. Joyce, E. F., A. Paul, K. E. Chen, N. Tanneti and K. S. McKim (2012). "Multiple barriers to nonhomologous DNA end joining during meiosis in Drosophila." Genetics 191(3): 739-746. Kauppi, L., A. J. Jeffreys and S. Keeney (2004). "Where the crossovers are: recombination distributions in mammals." Nat Rev Genet 5(6): 413-424. 24 Keeney, S. (2001). "Mechanism and control of meiotic recombination initiation." Curr Topics Dev Biol Volume 52: 1-53. Keeney, S. (2008). "Spo11 and the Formation of DNA Double-Strand Breaks in Meiosis." Genome Dyn Stab 2: 81-123. Kidwell, M. G. (1972). "Genetic change of recobination value in Drosophila melanogaster. II. Simulated natural selection." Genetics 70(3): 433-443. Kim, Y. and R. Nielsen (2004). "Linkage disequilibrium as a signature of selective sweeps." Genetics 167(3): 1513-1524. Kliman, R. M. and J. Hey (1993). "Reduced natural selection associated with low recombination in Drosophila melanogaster." Mol Biol Evol 10(6): 1239-1258. Kliman, R. M. and J. Hey (2003). "Hill-Robertson interference in Drosophila melanogaster: reply to Marais, Mouchiroud and Duret." Genet Res 81(2): 89-90. Koehler, K. E., J. P. Cherry, A. Lynn, P. A. Hunt and T. J. Hassold (2002). "Genetic control of mammalian meiotic recombination. I. Variation in exchange frequencies among males from inbred mouse strains." Genetics 162(1): 297-306. Kolas, N. K., A. Svetlanov, M. L. Lenzi, F. P. Macaluso, S. M. Lipkin, R. M. Liskay, J. Greally, W. Edelmann and P. E. Cohen (2005). "Localization of MMR proteins on meiotic chromosomes in mice indicates distinct functions during prophase I." J Cell Biol 171(3): 447-458. Kong, A., D. F. Gudbjartsson, J. Sainz, G. M. Jonsdottir, S. A. Gudjonsson, B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien, S. T. Palsson, M. L. Frigge, T. E. Thorgeirsson, J. R. Gulcher and K. Stefansson (2002). "A high-resolution recombination map of the human genome." Nat Genet 31(3): 241-247. Kulathinal, R. J., S. M. Bennett, C. L. Fitzpatrick and M. A. Noor (2008). "Fine-scale mapping of recombination rate in Drosophila refines its correlation to diversity and divergence." Proc Natl Acad Sci U S A 105(29): 10051-10056. Kuzminov, A. (1999). "Recombinational repair of DNA damage in Escherichia coli and bacteriophage lambda." Microbiol Mol Biol Rev 63(4): 751-813, table of contents. 25 Larracuente, A. M., T. B. Sackton, A. J. Greenberg, A. Wong, N. D. Singh, D. Sturgill, Y. Zhang, B. Oliver and A. G. Clark (2008). "Evolution of protein-coding genes in Drosophila." Trends Genet 24(3): 114-123. Lenormand, T. (2003). "The evolution of sex dimorphism in recombination." Genetics 163(2): 811-822. Lenormand, T. and J. Dutheil (2005). "Recombination difference between sexes: a role for haploid selection." PLoS Biol 3(3): e63. Marais, G., D. Mouchiroud and L. Duret (2001). "Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes." Proc Natl Acad Sci U S A 98(10): 5688-5692. Maynard Smith, J. and J. Haigh (1974). "The hitch-hiking effect of a favorable gene." Genet. Res. 23: 23-35. McKim, K. S., B. L. Green-Marroquin, J. J. Sekelsky, G. Chin, C. Steinberg, R. Khodosh and R. S. Hawley (1998). "Meiotic synapsis in the absence of recombination." Science 279(5352): 876-878. McKim, K. S. and A. Hayashi-Hagihara (1998). "mei-W68 in Drosophila melanogaster encodes a Spo11 homolog: evidence that the mechanism for initiating meiotic recombination is conserved." Genes Dev 12(18): 2932-2942. McVean, G. A. and B. Charlesworth (2000). "The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation." Genetics 155(2): 929-944. Muñoz-Fuentes, V., A. Di Rienzo and C. Vilà (2011). "Prdm9, a major determinant of meiotic recombination hotspots, is not functional in dogs and their wild relatives, wolves and coyotes." PLoS ONE 6(11): e25498. Myers, S., L. Bottolo, C. Freeman, G. McVean and P. Donnelly (2005). "A fine-scale map of recombination rates and hotspots across the human genome." Science 310(5746): 321-324. Myers, S., C. Freeman, A. Auton and P. Donnelly… (2008). A common sequence motif associated with recombination hot spots and genome instability in humans. Nature genetics. 26 Nachman, M. W. (2002). "Variation in recombination rate across the genome: evidence and implications." Curr Opin Genet Dev 12(6): 657-663. Nordborg, M., T. T. Hu, Y. Ishino, J. Jhaveri, C. Toomajian, H. Zheng, E. Bakker, P. Calabrese, J. Gladstone, R. Goyal, M. Jakobsson, S. Kim, Y. Morozov, B. Padhukasahasram, V. Plagnol, N. A. Rosenberg, C. Shah, J. D. Wall, J. Wang, K. Zhao, T. Kalbfleisch, V. Schulz, M. Kreitman and J. Bergelson (2005). "The pattern of polymorphism in Arabidopsis thaliana." PLoS Biol 3(7): e196. Ohta, T. (1976). "Simple model for treating evolution of multigene families." Nature 263(5572): 74-76. Ohta, T. (1976). "Simulation studies on the evolution of amino acid sequences." J Mol Evol 8(1): 1-12. Padhukasahasram, B. and B. Rannala (2013). "Meiotic gene-conversion rate and tract length variation in the human genome." Eur J Hum Genet. Paigen, K. and P. Petkov (2010). "Mammalian recombination hot spots: properties, control and evolution." Nat Rev Genet 11(3): 221-233. Parvanov, E. D., P. M. Petkov and K. Paigen (2010). "Prdm9 Controls Activation of Mammalian Recombination Hotspots." Science 327(5967): 835. Petes, T. D. (2001). "Meiotic recombination hot spots and cold spots." Nat Rev Genet 2(5): 360-369. Plough, H. H. (1917). "The Effect of Temperature on Linkage in the Second Chromosome of Drosophila." Proc Natl Acad Sci U S A 3(9): 553-555. Plug, A. W., J. Xu, G. Reddy, E. I. Golub and T. Ashley (1996). "Presynaptic association of Rad51 protein with selected sites in meiotic chromatin." Proc Natl Acad Sci U S A 93(12): 5920-5924. Presgraves, D. C. (2005). "Recombination enhances protein adaptation in Drosophila melanogaster." Current Biology 15(18): 1651-1656. Prieler, S., A. Penkner, V. Borde and F. Klein (2005). "The control of Spo11's interaction with meiotic recombination hotspots." Genes Dev 19(2): 255-269. 27 Rockman, M. V. and L. Kruglyak (2009). "Recombinational landscape and population genomics of Caenorhabditis elegans." PLoS Genet 5(3): e1000419. Rockman, M. V., S. S. Skrovanek and L. Kruglyak (2010). "Selection at linked sites shapes heritable phenotypic variation in C. elegans." Science 330(6002): 372-376. Roeder, G. S. (1995). "Sex and the single cell: meiosis in yeast." Proc Natl Acad Sci U S A 92(23): 10450-10456. Santos, J. L. (1999). "The relationship between synapsis and recombination: two different views." Heredity 82(1): 1-6. Ségurel, L., E. M. Leffler and M. Przeworski (2011). "The Case of the Fickle Fingers: How the PRDM9 Zinc Finger Protein Specifies Meiotic Recombination Hotspots in Humans." PLoS Biol 9(12): e1001211. Serrentino, M. E. and V. Borde (2012). "The spatial regulation of meiotic recombination hotspots: are all DSB hotspots crossover hotspots?" Exp Cell Res 318(12): 1347-1352. Singh, N. D., D. R. Criscoe, S. Skolfield, K. P. Kohl, E. S. Keebaugh and T. A. Schlenke (2015). "EVOLUTION. Fruit flies diversify their offspring in response to parasite infection." Science 349(6249): 747-750. Singh, N. D., E. A. Stone, C. F. Aquadro and A. G. Clark (2013). "Fine-scale heterogeneity in crossover rate in the Garnet-Scalloped region of the Drosophila melanogaster X Chromosome." Genetics. Smith, N. G. and A. Eyre-Walker (2002). "Adaptive protein evolution in Drosophila." Nature 415(6875): 1022-1024. Smukowski, C. S. and M. A. Noor (2011). "Recombination rate variation in closely related species." Heredity (Edinb) 107(6): 496-508. Sollier, J., W. Lin, C. Soustelle, K. Suhre, A. Nicolas, V. Geli and C. de La Roche Saint-Andre (2004). "Set1 is required for meiotic S-phase onset, double-strand break formation and middle gene expression." EMBO J 23(9): 1957-1967. 28 Sommermeyer, V., C. Beneut, E. Chaplais, M. E. Serrentino and V. Borde (2013). "Spp1, a member of the Set1 Complex, promotes meiotic DSB formation in promoters by tethering histone H3K4 methylation sites to chromosome axes." Mol Cell 49(1): 43-54. Steiner, S., J. Kohli and K. Ludin (2010). "Functional interactions among members of the meiotic initiation complex in fission yeast." Curr Genet 56(3): 237-249. Steiner, W. W., E. M. Steiner, A. R. Girvin and L. E. Plewik (2009). "Novel Nucleotide Sequence Motifs That Produce Hotspots of Meiotic Recombination in Schizosaccharomyces pombe." Genetics 182(2): 459-469. Stephan, W. and C. H. Langley (1998). "DNA polymorphism in lycopersicon and crossingover per physical length." Genetics 150(4): 1585-1593. Stevison, L. S. and M. A. Noor (2010). "Genetic and evolutionary correlates of fine-scale recombination rate variation in Drosophila persimilis." J Mol Evol 71(5-6): 332-345. Sturtevant, A. H., C. B. Bridges and T. H. Morgan (1919). "The Spatial Relations of Genes." Proc Natl Acad Sci U S A 5(5): 168-173. Sun, X., L. Huang, T. E. Markowitz, H. G. Blitzblau, D. Chen, F. Klein and A. Hochwagen (2015). "Transcription dynamically patterns the meiotic chromosome-axis interface." Elife 4. Sung, P. and D. L. Robberson (1995). "DNA strand exchange mediated by a RAD51-ssDNA nucleoprotein filament with polarity opposite to that of RecA." Cell 82(3): 453-461. Symington, L. S. (2002). "Role of RAD52 epistasis group genes in homologous recombination and double-strand break repair." Microbiol Mol Biol Rev 66(4): 630-670, table of contents. Symington, L. S. (2014). "End resection at double-strand breaks: mechanism and regulation." Cold Spring Harb Perspect Biol 6(8). Tsai, I. J., D. Bensasson, A. Burt and V. Koufopanou (2008). "Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle." Proc Natl Acad Sci U S A 105(12): 4957-4962. 29 Wallberg, A., S. Glemin and M. T. Webster (2015). "Extreme recombination frequencies shape genome variation and evolution in the honeybee, Apis mellifera." PLoS Genet 11(4): e1005189. Walsh, J. B. (1983). "Role of biased gene conversion in one-locus neutral theory and genome evolution." Genetics 105(2): 461-468. Weiner, B. M. and N. Kleckner (1994). "Chromosome pairing via multiple interstitial interactions before and during meiosis in yeast." Cell 77(7): 977-991. Wyman, C. and R. Kanaar (2006). "DNA double-strand break repair: all's well that ends well." Annu Rev Genet 40: 363-383. Yin, Y. and S. Smolikove (2013). "Impaired resection of meiotic double-strand breaks channels repair to nonhomologous end joining in Caenorhabditis elegans." Mol Cell Biol 33(14): 2732-2747. Zickler, D. and N. Kleckner (1999). "Meiotic chromosomes: integrating structure and function." Annu Rev Genet 33: 603-754. Zickler, D. and N. Kleckner (2015). "Recombination, Pairing, and Synapsis of Homologs during Meiosis." Cold Spring Harb Perspect Biol 7(6). 30 CHAPTER 2: The Drosophila early ovarian transcriptome provides insight to the molecular causes of recombination rate variation across genomes 2.0.1 Preface Chapter 2 appears here as a reprinting of an article with the same title published within BMC Genomics in 2013 (Volume 14, Issue 794). Formatting and minor alterations have been made for consistency. The full original, unaltered document may be found at http://www.biomedcentral.com/1471-2164/14/794. 2.0.2 Abstract Background: Evidence in yeast indicates that gene expression is correlated with recombination activity and double-strand break (DSB) formation in some hotspots and studies of nucleosome occupancy in yeast and mice suggest that open chromatin influences the formation of DSBs. In Drosophila melanogaster, high-resolution recombination maps show an excess of DSBs within annotated transcripts relative to intergenic sequences. The impact of active transcription on recombination landscapes, however, remains unexplored in a multicellular organism. We then investigated the transcription profile during early meiosis in D. melanogaster females to obtain a glimpse at the relevant transcriptional dynamics during DSB formation, and test the specific hypothesis that DSBs preferentially target transcriptionally active genomic regions. Results: Our study of transcript profiles of early- and late-meiosis using mRNA-seq revealed, 1) significant differences in gene expression, 2) new genes and exons, 3) parentof-origin effects on transcription in early-meiosis stages, and 4) a nonrandom genomic distribution of transcribed genes. Importantly, genomic regions that are more actively 31 transcribed during early meiosis show higher rates of recombination, and we ruled out DSB preference for genic regions that are not transcribed. Conclusions: Our results provide evidence in a multicellular organism that transcription during the initial phases of meiosis increases the likelihood of DSB and give insight into the molecular determinants of recombination rate variation across the D. melanogaster genome. We propose that a model where variation in gene expression plays a role altering the recombination landscape across the genome which could provide a molecular, heritable and plastic mechanism to observed patterns of recombination variation, from the high level of intra-specific variation to the known influence of environmental factors and stress conditions. 2.1 Background High-resolution transcription profiles offer insight into a wide array of biological questions including the identification of genes involved in specific molecular processes, the understanding of cellular fate and organ differentiation, the importance of genic and epigenetic factors, and the complex response to environmental conditions (Wang, Gerstein et al. 2009). With the rise of sequencing technologies such as RNA-seq and supporting methodologies, researchers are now able to obtain gene expression profiles (sensu levels of transcripts), potentially identifying rare or novel transcript forms that are only present in specific cells and/or at very precise developmental times (Tang, Barbacioru et al. 2009). One such cell population of interest lies within the anterior portions of the Drosophila ovary, where mitotic precursor cells begin their development into functional eggs and meiotic recombination occurs. The Drosophila ovary has served as a model for meiosis (Lake and Hawley 2012), embryo patterning (Roth and Lynch 2009), and stem cell differentiation (Kirilly and Xie 32 2007, Spradling, Fuller et al. 2011). Drosophila females have two ovaries comprised of 10 to 20 tube-like structures, called ovarioles, clustered together with a spatiotemporal organization of progressively developing oocytes (Spradling 1993). Oogenesis in Drosophila starts within the anterior compartment of the ovariole, the germarium, where mitotic stem cells produce cystoblasts that undergo further cell division generating a large 16-cell cyst with a single cystocyte becoming the oocyte. Before exiting the germarium as a stage-1 egg chamber, the primary oocyte will have entered pachytene and undergone meiotic recombination. These anteriormost portions of the Drosophila ovariole represent a highly active community of cells, regulated with remarkable fidelity, and yet, constitute only a small fraction of the entire ovary (Morris and Spradling 2011). Previous whole-genome transcriptome analyses of whole ovaries therefore offer only an amalgamated sight of its developmental and cellular complexity, limiting our understanding of the relevant gene expression activity of the germarium and early meiosis (Parisi, Nuttall et al. 2004, Gan, Chepelev et al. 2010). The process of meiotic recombination in D. melanogaster females begins with the initiation of double-strand breaks (DSBs). At a very broad scale, crossovers in Drosophila are distributed in bell-shaped fashion along chromosomes, with a maximal rate in the center of a chromosomal arm that tapers off near centromeric and telomeric regions (Lindsley and Zimm 1992). This is also the case in many other (e.g., mice, humans, Arabidopsis, etc.) but not all (e.g., Caenorhabditis elegans and C. briggsae) eukaryotes. At finer scales, recombination maps have revealed substantial variation across chromosomes in all species analyzed, including Drosophila (Singer, Fan et al. 2006, Coop, Wen et al. 2008, Kulathinal, Bennett et al. 2008, Rockman and Kruglyak 2009, Cutter and Choi 2010, Dumont and Payseur 2011, Fledel-Alon, Leffler et al. 2011, Comeron, Ratnappan et al. 2012, 33 McGaugh, Heil et al. 2012). In D. melanogaster, high-resolution mapping of more than 100,000 recombination events at a scale approaching gene-level resolution showed not only extreme heterogeneity in recombination rates across chromosomes but also that these landscapes of recombination vary significantly among individuals of the same species (Comeron, Ratnappan et al. 2012). Even within chromosomal regions traditionally assumed to have non-reduced recombination rates, crossover rates vary up to 80-fold when crossing two D. melanogaster strains, and 20-fold after combining genetic maps obtained from eight crosses of different strains (Comeron, Ratnappan et al. 2012). Beyond the differences across genomes, between species and within species, there is an important additional layer of complexity: recombination rates are plastic and influenced by factors such as temperature, food or maternal age (Stern 1926, Neel 1941, Redfield 1966, Parsons 1988, Priest, Galloway et al. 2008). The molecular determinants leading to DSB localization across the Drosophila genome remain obscure but a number of patterns are beginning to emerge [see (Comeron, Ratnappan et al. 2012) for details]. First, unlike human and mice recombination hotspots that are strongly influenced by the presence of the PRDM9-binding DNA motif (Myers, Bottolo et al. 2005, Baudat, Buard et al. 2010), no PRDM9 motif is detected in Drosophila (Comeron, Ratnappan et al. 2012, Heil and Noor 2012, Miller, Takeo et al. 2012, Singh, Stone et al. 2013). Second, analyses in Drosophila reveal many different DNA motifs significantly enriched in sequences surrounding recombination events, suggesting a fundamental qualitative difference between human/mouse and Drosophila DSB localization (Comeron, Ratnappan et al. 2012, Singh, Stone et al. 2013). Third, recombination events tend to occur within annotated transcript regions thus suggesting a possible association between transcription, chromatin accessibility, and DSBs that are repaired as recombination events 34 (Comeron, Ratnappan et al. 2012). This latter observation is in agreement with evidence in the yeast S. cerevisiae where some, but not all, hotspots of recombination increase activity with open promoters (Petes 2001). Mapping of chromatin accessibility and nucleosome occupancy in yeast and mice (Kirkpatrick, Wang et al. 1999, Getun, Wu et al. 2010, Pan, Sasaki et al. 2011) also suggests, albeit more indirectly, that the formation of DSBs could be influenced by transcription based on the known effect of transcription on chromatin remodeling and histone modifications. The impact of active transcription on meiotic DSB localization and recombination landscapes in a multicellular organism remains, however, unexplored. Here, we employ RNA-seq to obtain and analyze the whole transcriptome of early meiotic D. melanogaster cells. We isolated germaria-stage 3 cells to substantially enrich the fraction of sample that is actively experiencing early meiosis and DSB formation and obtain a first glimpse of the potential influence of transcription on recombination localization across the Drosophila genome. Our analyses uncover genes with germarium-specific expression patterns and novel transcripts. The study of offspring from reciprocal crosses also reveals distinct parent-of-origin effects that create differences in gene expression among genetically identical individuals. Finally, we identify a positive relationship across the genome between transcription in early meiotic cells and recombination rates. Importantly, recombination events are found to target actively transcribed genes relative to genes with no detectable transcription thus allowing us to rule out that the observed association is due to DSB preference for static gene properties at the level of DNA sequence (e.g., G+C content). These results provide insight into the molecular determinants of recombination rate variation across the D.melanogaster 35 genome and a clear path for future studies to assess the molecular causes of recombination variation among individuals and its plastic nature. 2.2 Results & Discussion 2.2.1 General Patterns of the Drosophila Early Meiotic Transcriptome We isolated mRNAs from meiotic portions of the Drosophila ovary, dissecting the germarium and stages 1-3, and compared them to later, more developed regions of the ovary, hereafter referred to as ‘Early’ and ‘Late’, respectively (see Materials and Methods). We performed ultra-deep mRNA sequencing (mRNA-seq) that obtained over 467 million (M) of 120bp-long reads. Approximately 80% of these reads mapped correctly to the D. melanogaster genome reference sequence, with a total average coverage greater than 400x when mapped to annotated transcripts (Table 2.1). Each of our eight independent samples sequenced (see Materials and Methods) generated between 34.5 and 60.3M mapped reads, with a median mapped read count of 49.8M. There was no difference between total mapped read counts between Early and Late tissues (P = 0.59). Comparisons of Early versus Late transcript profiles show high similarity, with a strong correlation coefficient (Spearman's R = 0.952; P < 0.001; see Figure 2.1). There are 7,914 genes expressed in Early regions compared to 7,557 genes in Late regions. These results suggest that roughly 50% of all genes are actively transcribed, a value similar to the typical percentage in other Drosophila tissues, based on mRNA seq (Chintapalli, Wang et al. 2007) or array-based comparisons of germarium and testes [46% of all genes expressed; (Cash and Andrews 2012)]. The detection of 5% more genes being transcribed (i.e., above a FPKM threshold of 1) in Early relative to Late meiosis (P = 0.015) is accompanied by a reduced average level of transcription for active genes in early meiosis by more than 13% (average FPKM of 89.1 and 103.1 for Early and Late stages, respectively; Wilcoxon Matched 36 Pairs Test, Z = 44.5, P < 1x10-12). Similar differences are observed when defining active genes based on FPKM greater than 0.1 (Z = 41.7, P < 1x10-12). The Drosophila X chromosome is enriched in genes preferentially or uniquely expressed in females (i.e., female-biased genes) and deficient in male-biased genes (Parisi, Nuttall et al. 2003, Ranz, Castillo-Davis et al. 2003, Sturgill, Zhang et al. 2007, Assis, Zhou et al. 2012, Llopart 2012, Meisel, Malone et al. 2012). Focusing only on the male germline, however, a recent study has shown that differences between X and autosomes are not caused by different gene content but to the lack of sex chromosome dosage compensation in Drosophila testes thus reducing transcript levels of X-linked genes (Meiklejohn and Presgraves 2012). In our deep-sequencing study, we see that the early ovarian transcriptome shows the expected “female” bias with actively expressed genes unequally distributed among chromosomes: 60.4% of genes on the X chromosome are transcribed compared with 55.3% in autosomes (R2 = 20.9, P = 4.8x10-6; see Figure 2.2). A more extreme difference is observed when defining active genes based on FPKM > 0.1 (79.5 vs 70.0% for X and autosomes, R2 = 89.2, P < 1x10-12). Notably, this overrepresentation of actively transcribed genes on the X chromosome is less apparent in Late meiotic stages (e.g., 56.5% on the X compared to 53.4% in autosomes, R2 = 7.23, P = 0.007). Finally, we observe that expressed genes are not distributed randomly across chromosomes, but are instead physically clustered (Wald–Wolfowitz or Run’s test, P < 1x108 for all levels of expression analyzed). When defining actively expressed genes as FPKM > 1, clusters contain an average of 3.5 consecutive genes (5.8 genes when FPKM > 0.1). These results are in agreement with data from other Drosophila tissues and conditions, with small clusters of functionally related, highly co-expressed genes (Boutanaev, Kalmykova et al. 2002, Spellman and Rubin 2002, Weber and Hurst 2011). 37 Table 2.1 mRNA-seq statistics for each sample Strain Condition* Gross reads Mapped reads % Mapped Avg. depth 208 Early 44,938,695 34,472,998 76.71 84.3 Late 52,572,707 44,868,154 85.34 119.2 Early 67,255,860 48,750,637 72.49 119.0 Late 61,344,832 51,546,801 84.03 121.5 Early 61,305,389 50,863,571 82.97 107.3 Late 71,871,368 60,296,814 83.9 122.9 Early 57,532,606 45,646,995 79.34 86.9 Late 50,982,155 37,072,046 72.72 93.5 Early 231,032,550 179,734,201 77.8 457.2 Late 236,771,062 193,783,815 81.84 397.4 375 375Fx208M 375Mx208F Combined *Early and Late indicate Drosophila Early- and Late-ovarian transcriptome, respectively. 38 Figure 2.1 Comparison of Log10 FPKM values for Drosophila Early- and Late- ovarian transcriptome Orange points indicate significantly differentially expressed genes based on FDR-corrected significance level of 5% (q < 0.05). Spearman’s ρ = 0.952 (P < 1x10-12). Figure 2.2 Transcriptional differences between autosomes and the X chromosome (A) Mean FPKM values for autosomes and the X chromosome. Error bars represent +/− 1 standard error. (B) Percentage of total transcribed genes across each chromosome. Error bars represent 90% confidence intervals. Green: Early-ovarian transcriptome, Orange: Late-ovarian transcriptome. 39 2.2.2 Differentially Expressed Genes in Early Meiotic Tissues We observed 1,191 genes with differences in FPKM between Early and Late meiotic tissues at nominal P < 0.05, with 376 genes showing a significant difference after correcting for multiple testing (q < 0.05; see Materials and Methods). The degree of differential expression ranges from +241-fold to -2060-fold in the early relative to late tissues, with a median difference of 1.66-fold among genes with significant differences. We observe a bias towards overall down-regulation of genes in Early versus Late tissues (approximately five times more genes are significantly down-regulated than up-regulated) that cannot be explained by read bias in the respective samples. The top ten over- and under-expressed genes in the Early sample are listed in Table 2.2. The use of DAVID (see Materials and Methods) to classify genes into GO categories reveals that the terms ‘proteolysis’ and ‘peptidase’ are significantly enriched within our topten up-regulated genes in our Early sample (FDR-corrected modified Fisher exact P = 0.0001 and 0.001, respectively). Furthermore, all of the known genes (sensu annotated in the Drosophila Genome, r.5.47) within this group are serine-type endopeptidases involved in proteolysis. Why there is such a bias towards genes involved in proteolysis is difficult to explain, but a similar pattern has been noted in the apex of the testis in Drosophila (Cash and Andrews 2012). We suggest that the overrepresentation of serine endopeptidases may be due to the required breakdown of many meiotic proteins following their utilization in meiosis in order to prevent erroneous aggregation of many self-assembling protein complexes that may interact with DNA. The analysis of the 312 significantly down-regulated genes suggests enrichment in the GO terms phosophoproteins, RNA splicing, nucleotide binding, phosophate metabolic processes, and ribonucleotide binding. Analysis of the top 40 ten most extreme down-regulated genes does not indicate overrepresentation of any GO term after correcting for multiple tests. These results indicate an enrichment of serine proteases in early versus late ovarian development and a concurrent down-regulation of the majority of genes in Early tissues. Interestingly, many of the top ten up-regulated genes were shown to be down-regulated in array experiments based comparisons between the whole ovary and whole fly. (Chintapalli, Wang et al. 2007). This result emphasizes that whole-ovary experiments might have lacked sufficient power to detect important genes involved in subregions of the developing ovary. 2.2.3 New Genes and Isoforms We applied the Cufflinks algorithm to our combined data sets and identified up to 6,004 transcript forms (genes, exons, or noncoding RNAs) that were absent from the D. melanogaster genome annotation (r. 5.47, September 2012). When we conservatively restricted the list to only those items that were detected in two or more samples and further removed items with FPKM < 1, the set still contained 220 entries. Notably, 47 highconfidence items were identified with lengths greater than 300bp, minimally repetitive sequences and reads that reliably mapped to predicted splice junctions. Additionally, a visual inspection shows that a minimum of 13 of these new transcripts are independent of other annotated gene entries and have clear exon-intron structures thus strong candidates for new genes, while the rest are either novel splicing forms or putative ncRNAs. To validate some of these new transcripts, we designed transcript-specific primers, extracted total RNA from ovaries and were able to reliably produce RT-PCR products from seven of ten haphazardly selected novel candidates. We thus, conservatively, estimate a contribution of ~30-35 novel items to the existing D. melanogaster genome annotation. Notably, a number of putative novel transcribed sequences mapped uniquely to the so- 41 called chromosome U that consists of unordered, unoriented scaffolds not present in the D. melanogaster genome (euchomatic or heterochromatic) sequences. These results add to the notion that the actual number of unnotated genes and isoforms is still high in this model organism. Ultra-deep sequencing studies focusing on specific cell populations and variable conditions are therefore needed to fill this annotation gap that can have important consequences in genomic and evolutionary analyses. Table 2.2 Top ten differentially expressed genes by fold-change Transcript Over-regulated in Early* CG17475-RA CG31267-RA CG32833-RA CG42704-RA CG18417-RA CG43074-RA CG47205-RA CG31266-RB CG31681-RA CG15254-RA Under-regulated in Early Vml-RA λTry-RA CG8997-RA CG7916-RA CG12057-RA CG7953-RA CG33306-RA chrUextra:28564682777 CG11911-RA CG18585-RA Biological process Proteolysis Proteolysis Proteolysis Unknown Proteolysis Unknown Unknown Proteolysis Proteolysis Proteolysis d/v axis specification Proteolysis Unknown Unknown Unknown Unknown Unknown Probable rRNA Proteolysis Proteolysis Early FPKM 13.95 8.22 2.91 62.64 1.16 11.81 7.12 3.44 2.74 2.42 0.11 Late FPKM 0.06 0.06 0.04 0.98 0.02 0.24 0.15 0.09 0.07 0.06 221.40 Fold change 240.9 140.7 65.3 64.2 48.7 48.7 46.3 40.1 39.1 37.9 −2,059.9 q value 1.02x10-8 1.81x10-6 9.43x10-3 2.08x10-8 7.64x10-3 3.06x10-5 1.40x10-3 1.74x10-3 4.71x10-3 1.40x10-3 <1x10-12 0.07 1.16 0.78 1.74 0.68 0.10 20.61 3.27 47.99 31.21 68.09 26.10 3.71 709.30 −46.0 −41.4 −39.8 −39.1 −38.2 −37.7 −34.4 4.39x10-3 4.91x10-10 1.48x10-9 6.71x10-7 2.34x10-9 1.62x10-4 2.00x10-2 0.75 0.07 21.88 1.72 −29.0 −24.9 7.91x10-8 5.45x10-3 *Early indicates Drosophila Early-ovarian transcriptome. 2.2.3 New Genes and Isoforms 2.2.4 Parent-of-Origin Effects in the Early Meiotic Tissue Differences in gene expression between genetically identical offspring from reciprocal crosses indicate that maternal and/or paternal effects alter expression. The molecular causes of these parent-of-origin effects include genomic imprinting (through 42 epigenetic modification during gametogenesis), cytoplasmic effects of the egg and sperm, or mitochondrial contributions to nuclear transcription. To investigate parent-of-origin effects in the Early meiotic tissue in females, we studied two homozygous D. melanogaster parental strains (strains RAL-208 and RAL-375 from Raleigh, NC (Mackay, Richards et al. 2012)) and the heterozygous offspring from reciprocal crosses. We identify genes with a parent-oforigin transcription pattern as those genes that show differential expression between offspring of reciprocal crosses and focused on the subset of these genes that change in transcript levels between offspring of reciprocal crosses in the same direction as maternal strains differ between them (i.e., parent-of origin effects with maternal-like transcript levels). The comparison of offspring of reciprocal crosses reveals that there are more genes with parent-of origin effects with maternal-like transcript levels in the Early- than in the Late-ovarian development tissues (1041 and 554 genes, respectively; P < 1x1012).Interestingly, there is an excess of genes with parent-of origin effects with maternal-like transcript levels that resemble transcription in the RAL-208 maternal strain than genes with transcription pattern resembling the RAL-375 maternal strain (P < 10-12). We expanded this study by investigating allele-specific transcript ratios of heterozygous offspring and observed an excess of the RAL-208 allele in both reciprocal crosses (P < 10-12) that is higher when the maternal strain is RAL-208 (Wilcoxon Matched Pairs Test, P = 0.004). These results not only reveal the presence of variable parent-of-origin effects acting on transcript abundance but also an overrepresentation of dominant effects in RAL-208 relative to RAL-375. We also identified an enrichment of a common set of GO terms associated with genes showing parent-of origin effects with maternal-like transcript levels, many of which 43 are involved in development and differentiation (Supplementary Table 2.1S). When the Early and Late datasets are combined, we recover similar GO term hits as were obtained for Early tissues alone (Supplementary Table 2.3S). We thus interpret this pattern as a clear signal of parent-of-origin effects in the transcriptome of the Drosophila ovary, with maternal-like gene expression that is mostly relevant to Early-ovary development. 2.2.5 Transcription is Associated with Increased Recombination Rates Ultra-high resolution mapping of recombination events in Drosophila revealed that meiotic DSBs (detected as combined non-crossover and crossover events) occur preferentially in annotated transcriptional units (Comeron, Ratnappan et al. 2012). We thus hypothesized that gene transcription increases the probability of DSB formation in Drosophila and influence the recombination landscapes across chromosomes. Alternatively, the preference of DSB for genic units could be associated with other characteristics such nucleotide composition, reduced average nucleotide diversity relative to intergenic regions, presence of specific DNA motifs in promoter and intronic regions, etc. Although the topography of recombination landscapes in S. cerevisiae and D. melanogaster are dramatically different in terms of hotspot activity and localization, evidence based on some hotspots in yeast suggests promoter and/or transcriptional activity affects recombination activity (Petes 2001). The effects of transcription on DSB formation could be either direct via reduced nucleosome occupancy and increased chromatin accessibility, or more indirect as consequence of histone modifications. To evaluate our hypothesis, we now focused on the expression profile during early D. melanogaster meiosis and compared the transcriptional landscape with recombination rate variation across the D. melanogaster genome. Note that we anticipate the presence of a 44 fraction of cells other than those where DSB formation occurs in our Early-meiosis sample. We argue, however, that our sample is enriched in recombining cells and therefore, even if we may not recover the precise transcriptional profile at the time/cells where DSBs occur, genomic regions with no evidence of transcription will be particularly informative when defining coldspots of transcription during DSB formation. To this end, we used estimates of recombination rates across the D. melanogaster genome that were experimentally obtained after genotyping 139 million informative SNPs and mapping more than 100,000 recombination events at a scale approaching gene-level resolution (see (Comeron, Ratnappan et al. 2012) for details). We then compared measures of transcription in Early meiosis with these high-resolution recombination landscapes. The analysis of adjacent 100-kb regions reveals a positive association between recombination rates and both the number of genes transcribed per interval (Spearman’s R = 0.175, P = 3.1x10-10) and the total length of the transcribed regions per interval (R = 0.122, P = 1.2x105; see Figure 2.3). To capture the possible effect of transcription levels we also obtained a measure of overall transcriptional activity within a genomic interval (OTA), defined as the Log10-transformed sum of the product FPKM x transcript length for each gene within a given genomic interval (100 kb in our study). In our study, OTA is positively associated with recombination rates across the genome (R = 0.137, P = 8.4x10-7). Multiple regression analysis shows, however, that the number (P = 0.006) and total length of transcribed sequences (P = 0.035) in a region are more relevant than OTA (P > 0.4) predicting recombination rates at the 100-kb scale. Notably, the relationship between measures of transcription and recombination (Figure 2.3) appears to be highly contingent upon regions that lack transcription relative to regions with transcription. There is a significant difference in recombination rates between 45 regions with no transcription and regions with one or more transcribed genes (MannWhitney test, P < 1x10-6). This result is consistent with the idea that our study preferentially captures the consequences of coldspots for transcription during DSB formation in our Earlymeiosis sample. Figure 2.3 Relationship between transcription and recombination rates (A) Mean recombination rate in cM/Mb (centimorgans per megabase) for genomic regions grouped according to the number of genes transcribed (FPKM > 0.1) within each 100-kb region. Spearman’s R = 0.168 (P = 1.5x10-9) based on non-overlaping 100-kb regions. (B) Mean recombination rate in cM/Mb for regions grouped according to the total region transcribed within each 100-kb region. R = 0.123 (P = 1.1x10-5). Error bars represent +/− 1 standard error. The high-resolution genetic maps of D. melanogaster (see above) also allowed the localization of more than 5,000 DSBs delimited by 500 bp or less (Comeron, Ratnappan et al. 2012). Here, we take advantage of these highly localized meiotic DSB events to investigate their distribution at the scale of single genes and intergenic regions. We observe that intergenic sequences have fewer DSBs than expected but, importantly, we detect a difference between genes transcribed and genes not transcribed (Figure 2.4). There is a significant excess of DSB within transcribed genes relative to random distribution (P = 5.1x10-6), while no preference/avoidance is observed for genes with no evidence of being transcribed (P > 0.4). These results show that the preference of DSB to be located within 46 annotated genic regions in Drosophila is not merely a consequence of DNA properties of genes such higher G+C content than noncoding sequences or reduced sequence diversity, presence of DNA regulatory motifs in promoter regions and introns, etc. This result is also in agreement with the previous analysis of recombination rates and nucleotide composition showing that there is no positive association between recombination rates and G+C content (P > 0.20; (Comeron, Ratnappan et al. 2012)). Instead, the detection of recombination events targeting actively transcribed genes relative to genes with no detectable transcription strongly suggests that gene expression during early meiosis has a causal effect on DSB location and formation and, ultimately, recombination rates. Finally, we investigated the effect of transcription levels on DSB presence. To this end, we divided genes with detectable transcription into three groups for low-, mediumand high-transcription levels. We observe that among genes with detectable transcription, DSBs target genes preferentially lowly transcribed genes (Figure 2.4), with a negative relationship between transcription levels and recombination. Again, these results may evidence the expected heterogeneity of cell populations within our Early-meiosis tissue or, alternatively, a more complex interplay between transcription, histone modification and turnover, and chromatin accessibility for DSBs. 47 Figure 2.4 Relative presence of DSBs across the genome Analyses based on the 5,610 DSB events delimited by 500 bp or less described in [20]. The relative presence is measured as the ratio of the number of DSBs observed within each category to the number expected based on a random distribution of DSBs across the genome. Conservatively, we classified genes as showing no active transcription when FPKM < 0.001 and groups of genes with low-, medium- and high-transcription represent levels of target potential associated with transcription (FPKM x transcript length); 33, 46 and 21% of active genes belong to the low-, medium- and high- transcription groups, respectively. Probabilities (shown above each bar) associated with the relative presence of DSBs were obtained based on 10,000 independent replicates of the 5,610 DSBs randomly distributed across the genome. 48 2.3 Conclusions We obtained and compared the transcription profiles of Early- and Late-meiosis in D. melanogaster females with mRNA-seq and ultra-deep coverage. We identified significant differences in gene expression, new genes and exons, and a pattern of parent-of-origin effects with maternal-like expression that is particularly evident in Early-meiosis stages. We also described that Early-meiosis transcription occurs more often on the X chromosome and that there is physical clustering of actively transcribed genes across chromosomes. In terms of gene categories, we report that many genes involved in proteolysis are highly expressed in early meiosis, which may be a result of the rapid degradation of meiotic proteins following their utilization in order to prevent erroneous, self-assembling aggregates (Ringrose and Paro 2007). Our study and results underscore the limitations of using heterogeneous cellular and tissue samples when searching for biologically relevant features specific to particular developmental times and cell sets. In our case, searching for transcriptional signals present in only meiotic oocytes benefits from not using the whole ovary—as the oocyte transitions to transcriptionally dormant following the entrance into stage 1 in D. melanogaster, vastly increasing the influence of supportive nurse cells (Liu, Buszczak et al. 2006). Work in yeast has shown that chromatin accessibility and nucleosome occupancy contribute to variation in the DSB landscape, although other factors may play a more dominant role in determining the probability of DNA cleavage (Kirkpatrick, Wang et al. 1999, Pan, Sasaki et al. 2011). Studies of nucleosome occupancy in mice meiotic spermatocytes also suggest that open chromatin structure directs, at least in part, the formation of DSBs (Getun, Wu et al. 2010). Indirectly, these studies suggest that recombination landscapes could be influenced by gene expression, as transcription is 49 known to alter chromatin structure. RNA-seq has been used as a powerful method to determine transcription patterns for specific tissues, cell populations and/or conditions, but it has heretofore not been exploited as a measure to gather information underlying patterns of variation in recombination rates across whole chromosomes. Based on our previous high-resolution genetic maps in Drosophila, here we investigated the specific hypothesis that DSBs preferentially target transcriptionally active genomic regions in Drosophila. To our knowledge, our results represent the first evidence in a multicellular organism that gene expression in early meiotic cells is associated with increased likelihood of DSBs. Importantly, the preference of DSB targeting annotated transcripts seems to be related to active transcription and therefore supports the model that gene expression in meiotic tissues play a role—albeit clearly not the only one— influencing the landscapes and magnitude of recombination in a particular genomic region. Indeed, although the observed association between transcription levels and recombination rates is highly significant in terms of associated probability, it is weak in terms of the variation in recombination rates that can be explained solely by transcription. As such, the proposed influence of transcription on DSB formation and recombination landscapes should be viewed as one of several determinants of DSB localization. The presence of specific DNA motifs, the vicinity to telomeres/centromeres and other high-order chromatin structures during early meiosis, are all factors likely to also play a role. Transcriptome data of specific cell types, possibly using novel transgenic methods, together with detailed genetic analyses are needed to determine the relative role of gene expression influencing DSB formation and, ultimately, recombination rates across the Drosophila genome. Recombination rates are an evolving and variable trait with detectable differences between species as well as within species. This inter-individual and inherited variation has 50 been documented for the total number of recombination events per meiosis or per chromosome as well as in terms of the distribution across chromosomes in a number of species, including D. melanogaster (Kidwell 1972, Abdullah and Charlesworth 1974, Brooks and Marks 1986, Williams, Goodman et al. 1995, Koehler, Cherry et al. 2002, Neumann and Jeffreys 2006, Yandeau-Nelson, Nikolau et al. 2006, Esch, Szymaniak et al. 2007, Coop, Wen et al. 2008, Grey, Baudat et al. 2009, Rockman and Kruglyak 2009, Dumont, White et al. 2011). Further, classic Drosophila genetics studies expose clear plasticity in the distribution of recombination rates across the genome as a result of biotic and abiotic factors (Stern 1926, Neel 1941, Redfield 1966, Priest, Galloway et al. 2008) that would also act upon inherited inter-individual variation. We propose that our model, in which variation in gene expression plays a role altering the likelihood of DSB formation and thus the landscape of recombination across chromosomes, could easily reconcile many of these observations and provide a molecular, heritable and plastic mechanism to a number of observed patterns of recombination, from the high level of intra-specific variation, to in influence of environmental factors and stress conditions. The concept that gene expression may act as “plastic” and heritable modifier of recombination, directly or epigenetically, is particularly relevant to evolutionary models on the maintenance of recombination. Our proposed model could represent a direct link between stressful conditions and increased recombination, either region-specific or genome-wide, the very same circumstances where recombination may be most favorable (Parsons 1988, Hadany and Beker 2003, Hadany and Beker 2003, Agrawal, Hadany et al. 2005). 51 2.4 Methods 2.4.1 Drosophila Stocks and Tissue Preparation We generated two crosses using 2 highly inbred strains (RAL-208 and -375) from the Drosophila Genetic Reference Panel (DGRP) (Mackay, Richards et al. 2012) that have been previously sequenced and recombination-mapped to high resolution. These populations were collected in Raleigh (NC, USA) and subjected to 20 generations of full sib mating. Freshly eclosed virgin females were collected from both inbred lines and crosses (males RAL-208 x females RAL-375 and its reciprocal cross) and allowed to mature for 72 hours at 23.5C. Ovaries from each of the four genotypes were dissected in RNA-Later Reagent (Quiagen) using forceps and dissecting probe. Ovarioles were teased apart and early meiotic portions (Germaria to Stage 3) removed using electrolytically sharpened tungsten needles, resulting in four ‘Early’ and four ‘Late’ tissue preparations (Kalistratov and Bashkirov 1964). 2.4.2 Illumina Library Preparation and Sequencing Total RNA was prepared from ovaries, ovaries with early meiotic regions removed, and early meiotic regions following an optimized protocol for the Quiagen RNEasy kit (Qiagen, Valencia, CA). mRNA was isolated using the Invitrogen Dynabead mRNA Purification kit, with two additional wash steps. mRNA was fragmented with a cation solution from New England Biolab’s NEBNext Kit, ethanol precipitated, and cDNA synthesis performed with the NEBNext Kit. End repair, dA-Tailing, and adapter ligation of custom adapters was also performed with the NEB Next kit following an optimized manufacturer’s suggested protocol. 300-350bp adapter ligated fragments were isolated from a 2% low-melt agarose gel and PCR enriched for 13 cycles. The PCR enriched libraries were validated by running an aliquot on a standard agarose gel. Products were purified and concentration 52 obtained with Quant-iT TM PicoGreen ® dsDNA Reagent and Kits (Invitrogen, CA, USA) on a Turner BioSystems TBS-380 Fluorometer. In total we generated eight Illumina Libraries, with two independently generated libraries per genotype to obtain adequate biological and technical replicates that were also run in separate Illumina lanes. We ran two lanes with four multiplexed libraries each. Single-read 120 bp fragments were sequenced on an Illumina Hi-Seq 2000 at the University of Iowa DNA Core Facility. 2.4.3 Sequence Alignment and Expression Analyses Illumina data were separated by tag using FastX Barcode Splitter and concatenating the two lanes of data for each tag respectively. All further analyses were performed within Galaxy, an accessible bioinformatics framework capable of next-generation sequencing data analysis (Giardine, Riemer et al. 2005, Blankenberg, Von Kuster et al. 2010, Goecks, Nekrutenko et al. 2010). Summary statistics were gathered using FastQC. The 5’ adapter sequence was then removed from each sample and 3’ ends trimmed until reaching a quality score greater than ten using FastqTrimmer. The groomed data was then mapped to the D. melanogaster reference genome (Apr. 2006 BDGP R5/dm3) using TopHat v1.4.0 (Roberts, Pimentel et al. 2011, Trapnell, Roberts et al. 2012). We then used the Cufflinks package 2.0.2 (Trapnell, Roberts et al. 2012) to assemble transcripts, obtain their relative abundance and find differentially expressed genes. After assembling transcripts, CuffMerge was used for merging and annotation analysis and measures of expression for every transcript associated with a particular gene were obtained in FPKM (Supplementary Figure 2.1S). Expression calculations for early and late ovary development were based on two sets of replicates (Early samples of RAL-375 males x RAL208 females and its reciprocal cross, and Late samples of RAL-375 males x RAL-208 females and its reciprocal cross) with two biological replicates per genotype and condition (Early or 53 Late). Classic-FPKM normalization was performed with pooled estimates of dispersion (negative binominal) following (Trapnell, Hendrickson et al. 2013). We then utilized the Cuffdiff 2 algorithm (Trapnell, Hendrickson et al. 2013) within Cufflinks 2.0.2 to calculate differential expression at both the gene and transcript levels. In short, differential gene expression was calculated using FPKM values for every gene while incorporating expression level variances during significance testing. This was performed by first deriving a dispersion model describing variances of fragment counts across replicates, which was then used to calculate the variances on a gene’s relative expression across replicates following the method described in (Trapnell, Hendrickson et al. 2013) (Supplementary Figures 2.2S2.4S). Genes were considered to be expressed if each sample had a minimum of ten reads mapped and were above an FPKM of 0.1 unless noted explicitly. Genes were considered to be differentially expressed if the prior expression requirements were satisfied and reached an FDR-corrected significance level of 5% (q < 0.05). 2.4.4 Novel Gene Identification Potentially novel genes were first identified by CuffLinks as significant reads mapping to unannotated regions of the dm3 genome that fit our expression criteria. Cufflinks initially identified 6004 potentially novel items. Restricting this list to only those that were detected in two or more samples reduced the number to 1308, and then filtering for only those expressed at reliable levels above one FPKM in at least one sample reduced the set to 220 entries. From this filtered list, we manually identified 47 items with lengths greater than approximately 300bp, were minimally repetitive, and possessed reads that reliably mapped to predicted splice junctions. We identified 13 of these items to be candidate novel genes, based on a more stringent visual inspection and identification of apparent intron-exon structures. 54 We performed RT-PCR on a subset of our identified potentially novel genes with probable open reading frames that were missing from both tracks in an attempt to confirm expression of the novel transcript. PCR primers were designed for regions with significant RNA-seq reads mapped that spanned more than 300 base pairs. Primers were first tested with genomic DNA as a template, and then with whole ovary cDNA as a template. Primers and conditions available upon request. A gene was considered novel if the mRNA track contained it but it was unannotated, or if it was missing from both mRNA tracks and the DM3 genome annotation. 2.4.5 Parent-of-Origin Effects To study parent-of-origin effects in early and late tissues, we first identified genes that are significantly differentially expressed between offspring of reciprocal crosses (RAL375 males x RAL-208 females and its reciprocal cross). We then focused on those genes with parent-of-origin effect that have levels of transcription in the offspring resembling the levels of transcription observed in the crosses’ maternal strain (RAL-208 or -375). To investigate allele-specific transcription in heterozygous offspring we obtained the set of reads that uniquely map to only one of the D. melanogaster parental strains with zero mismatches but not to the other parental strain, and vice versa, using MOSAIK assembler (http://bioinformatics.bc.edu/marthlab/Mosaik). We also removed all reads that would differentially map to one parental sequence and not to the other if one of the reference sequences contained one or more ‘N’s for this read. Additionally, we only studied genes with a minimum 100 allele-specific reads to increase accuracy in the allelic ratios. Gene-term enrichment analyses were performed with DAVID (Huang da, Sherman et al. 2009), utilizing the BP_FAT subset of gene ontology (GO) terms to identify enriched biological themes. We then combined the early and late tissue GO data for each strain and 55 repeated the analysis. We report p-values from the DAVID analysis according to the EASE score, a modified Fishers exact test (Huang da, Sherman et al. 2009). FDR-corrected EASE scores are reported utilizing the Benjamini-Hochberg multiple-test correction procedure employed by DAVID. 2.4.6 Genomic Distribution of Transcribed Genes In order to test whether transcribed and untranscribed genes were distributed randomly across the genome, we performed a run’s test for randomness. To determine the number of runs, we separated genes into two categories of transcriptional activity: genes transcribed at greater levels than 1 FPKM and those that were not. Along the length of the genome, switches from one category to another counted as a completed run. Overlapping genes were counted separately following this scheme. 2.4.7 Recombination vs Expression Analysis To test the hypothesis that gene expression is associated with recombination rates across the genome, we first generated landscapes of expression for each chromosome. We calculated the number of genes expressed at a threshold greater than 1 FPKM (unless noted otherwise) and the number of kilobases transcribed (counting overlapping transcript regions only once) per 100kb adjacent intervals. We also obtained a measure of overall transcriptional activity within a genomic interval (OTA), defined as the Log10-transformed sum of the product FPKM x transcript length for each gene within a given genomic interval. High-resolution recombination landscapes for adjacent 100kb regions across the whole genome were obtained from (Comeron, Ratnappan et al. 2012). The datasets supporting the results of this article are available in the NCBI SRA repository [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP032523] 56 2.5 Supplementary Information Table 2.3S Top Enriched1 GO Terms among genes with parent-of origin effects with maternal-like transcript levels for Early- and Late-Ovarian samples Transcription Condition2 bias (Strain)3 Term structural constituent of vitelline membrane vitelline memb. form. in chorion-cont. eggshell 375 vitelline membrane formation Early 208 Count 4 6.35 3.16x10-07 4.26x10-05 4 6.35 6.48x10-06 2.36x10-03 4 6.35 6.48x10-06 2.36x10-03 ovarian follicle cell development 8 12.7 2.44x10-05 4.44x10-03 extracellular matrix organization 4 6.35 3.45x10-05 4.19x10-03 cell morphogenesis 72 9.3 1.68x10-13 3.37x10-10 cellular component morphogenesis 79 10.21 3.18x10-13 3.17x10-10 neuron differentiation 68 8.79 3.44x10-13 2.29x10-10 neuron development 61 7.88 6.16x10-13 3.08x10-10 ribonucleotide binding 120 15.5 8.10x10-13 5.35x10-10 6 22.22 1.09x10-03 n.s. n.s. neuron development 375 Late 208 1 The FDRcorrected q value Percent of Total P Value neuron differentiation 6 22.22 2.26x10-03 behavior 6 22.22 2.79x10-03 n.s. n.s. transcription regulator activity 7 25.93 5.41x10-03 regulation of transcription 7 25.93 8.79x10-03 n.s. 4.37x10-08 contractile fiber 12 2.92 3.56x10-11 contractile fiber part 11 2.68 2.55x10-10 3.13x10-07 sarcomere 10 2.43 8.28x10-10 1.02x10-06 myofibril 10 2.43 1.83x10-09 2.24x10-06 chorion 9 2.19 1.66x10-07 2.03x10-04 external encapsulating structure 9 2.19 2.77x10-07 3.40x10-04 top 5 most enriched GO terms are shown for each category. 2 Early and Late indicate Drosophila Early- and Late-ovarian transcriptome. 3 Transcription bias indicates the maternal strain towards which a gene shows similarity while showing differential transcription levels between the offspring of reciprocal crosses. (n.s., q > 0.05). 57 Table 2.4S Top Enriched1 GO Terms among genes with parent-of origin effects with maternal-like transcript levels Transcription bias (Strain)2 375 Term structural constituent of vitelline membrane vitelline membrane formation in chorion-containing eggshell n Percent of Total 4 6.45 4 6.45 vitelline membrane formation 4 6.45 ovarian follicle cell development 8 12.9 extracellular matrix organization 4 7 2 7 9 6 8 5 6 6 0 5 2 6.45 cell morphogenesis cellular component morphogenesis 208 neuron differentiation cell morphogenesis involved in differentiation neuron development cell morphogenesis involved in neuron differentiation 1 The 9.81 10.76 9.13 7.63 8.17 7.08 P value 3.16x1 0-7 6.48x1 0-6 6.48x1 0-6 2.44x1 0-5 3.45x1 0-5 1.10x1 0-14 1.80x1 0-14 8.80x1 0-14 9.00x1 0-14 2.07x1 0-13 2.08x1 0-12 FDR-corrected q value 4.26x10-5 2.36x10-3 2.36x10-3 4.44x10-3 4.19x10-3 2.20x10-11 3.46x10-11 1.72x10-10 1.76x10-10 4.06x10-10 4.08x10-09 top 5 most enriched GO terms are shown for each category. 2 Transcription bias indicates the maternal strain towards which a gene shows similarity while showing differential transcription levels between the offspring of reciprocal crosses. 58 Figure 2.5S FPKM distribution density across genes Blue region: Early-ovarian transcriptome; Orange region: Late-ovarian transcriptome. Figure 2.6S Volcano plot of genes by significance and fold change. 59 Figure 2.7S Early vs Late Log2 fold difference histogram following normalization Figure 2.8S P-value distribution following normalization and testing with CuffDiff v2.0.2 before correcting for multiple tests. This histogram displays the approximate expected distribution of significance following proper normalization—harboring signals of true p 60 2.6 Chapter 2 References Abdullah, N. F. and B. Charlesworth (1974). "Selection for reduced crossing over in Drosophila melanogaster." Genetics 76(3): 447-451. Agrawal, A. F., L. Hadany and S. P. Otto (2005). "The evolution of plastic recombination." Genetics 171(2): 803-812. Assis, R., Q. Zhou and D. Bachtrog (2012). "Sex-biased transcriptome evolution in Drosophila." Genome Biol Evol 4(11): 1189-1200. Baudat, F., J. Buard, C. Grey, A. Fledel-Alon, C. Ober, M. Przeworski, G. Coop and B. de Massy (2010). "PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice." Science 327(5967): 836-840. Blankenberg, D., G. Von Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko and J. Taylor (2010). "Galaxy: a web-based genome analysis tool for experimentalists." Curr Protoc Mol Biol Chapter 19: Unit 19 10 11-21. Boutanaev, A. M., A. I. Kalmykova, Y. Y. Shevelyov and D. I. Nurminsky (2002). "Large clusters of co-expressed genes in the Drosophila genome." Nature 420(6916): 666-669. Brooks, L. D. and R. W. Marks (1986). "The organization of genetic variation for recombination in Drosophila melanogaster." Genetics 114(2): 525-547. Cash, A. C. and J. Andrews (2012). "Fine scale analysis of gene expression in Drosophila melanogaster gonads reveals Programmed cell death 4 promotes the differentiation of female germline stem cells." BMC Dev Biol 12: 4. Chintapalli, V. R., J. Wang and J. A. Dow (2007). "Using FlyAtlas to identify better Drosophila melanogaster models of human disease." Nat Genet 39(6): 715-720. Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination in Drosophila melanogaster." PLoS Genet 8(10): e1002905. Coop, G., X. Wen, C. Ober, J. K. Pritchard and M. Przeworski (2008). "High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans." Science 319(5868): 1395-1398. 61 Coop, G., X. Wen, C. Ober, J. K. Pritchard and M. Przeworski (2008). "High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans." Science 319(5868): 1395-1398. Cutter, A. D. and J. Y. Choi (2010). "Natural selection shapes nucleotide polymorphism across the genome of the nematode Caenorhabditis briggsae." Genome Res 20(8): 11031111. Dumont, B. L. and B. A. Payseur (2011). "Genetic analysis of genome-scale recombination rate evolution in house mice." PLoS Genet 7(6): e1002116. Dumont, B. L., M. A. White, B. Steffy, T. Wiltshire and B. A. Payseur (2011). "Extensive recombination rate variation in the house mouse species complex inferred from genetic linkage maps." Genome Res 21(1): 114-125. Esch, E., J. M. Szymaniak, H. Yates, W. P. Pawlowski and E. S. Buckler (2007). "Using crossover breakpoints in recombinant inbred lines to identify quantitative trait loci controlling the global recombination frequency." Genetics 177(3): 1851-1858. Fledel-Alon, A., E. M. Leffler, Y. Guan, M. Stephens, G. Coop and M. Przeworski (2011). "Variation in human recombination rates and its genetic determinants." PLoS ONE 6(6): e20321. Gan, Q., I. Chepelev, G. Wei, L. Tarayrah, K. Cui, K. Zhao and X. Chen (2010). "Dynamic regulation of alternative splicing and chromatin structure in Drosophila gonads revealed by RNA-seq." Cell Res 20(7): 763-783. Getun, I. V., Z. K. Wu, A. M. Khalil and P. R. Bois (2010). "Nucleosome occupancy landscape and dynamics at mouse recombination hotspots." EMBO Rep 11(7): 555-560. Giardine, B., C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor, W. Miller, W. J. Kent and A. Nekrutenko (2005). "Galaxy: a platform for interactive large-scale genome analysis." Genome Res 15(10): 1451-1455. Goecks, J., A. Nekrutenko and J. Taylor (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences." Genome Biol 11(8): R86. Grey, C., F. Baudat and B. de Massy (2009). "Genome-wide control of the distribution of meiotic recombination." PLoS Biol 7(2): e35. 62 Hadany, L. and T. Beker (2003). "Fitness-associated recombination on rugged adaptive landscapes." J Evol Biol 16(5): 862-870. Hadany, L. and T. Beker (2003). "On the evolutionary advantage of fitness-associated recombination." Genetics 165(4): 2167-2179. Heil, C. S. and M. A. Noor (2012). "Zinc finger binding motifs do not explain recombination rate variation within or between species of Drosophila." PLoS ONE 7(9): e45055. Huang da, W., B. T. Sherman and R. A. Lempicki (2009). "Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources." Nat Protoc 4(1): 44-57. Kalistratov, G. F. and A. A. Bashkirov (1964). "[Use of a Serial Portable Apparatus for Electrolytic Sharpening of Surgical Instruments in the Preparation of Mental Microelectrodes]." Biull Eksp Biol Med 58: 122-123. Kidwell, M. G. (1972). "Genetic change of recobination value in Drosophila melanogaster. II. Simulated natural selection." Genetics 70(3): 433-443. Kirilly, D. and T. Xie (2007). "The Drosophila ovary: an active stem cell community." Cell Res 17(1): 15-25. Kirkpatrick, D. T., Y. H. Wang, M. Dominska, J. D. Griffith and T. D. Petes (1999). "Control of meiotic recombination and gene expression in yeast by a simple repetitive DNA sequence that excludes nucleosomes." Mol Cell Biol 19(11): 7661-7671. Koehler, K. E., J. P. Cherry, A. Lynn, P. A. Hunt and T. J. Hassold (2002). "Genetic control of mammalian meiotic recombination. I. Variation in exchange frequencies among males from inbred mouse strains." Genetics 162(1): 297-306. Kulathinal, R. J., S. M. Bennett, C. L. Fitzpatrick and M. A. Noor (2008). "Fine-scale mapping of recombination rate in Drosophila refines its correlation to diversity and divergence." Proc Natl Acad Sci U S A 105(29): 10051-10056. Lake, C. M. and R. S. Hawley (2012). "The molecular control of meiotic chromosomal behavior: events in early meiotic prophase in Drosophila oocytes." Annu Rev Physiol 74: 425-451. 63 Lindsley, D. L. and G. G. Zimm (1992). The genome of Drosophila melanogaster. . San Diego, CA, Academic Press. Liu, J. L., M. Buszczak and J. G. Gall (2006). "Nuclear bodies in the Drosophila germinal vesicle." Chromosome Research 14(4): 465-475. Llopart, A. (2012). "The rapid evolution of X-linked male-biased gene expression and the large-X effect in Drosophila yakuba, D. santomea, and their hybrids." Mol Biol Evol 29(12): 3873-3886. Mackay, T. F., S. Richards, E. A. Stone, A. Barbadilla, J. F. Ayroles, D. Zhu, S. Casillas, Y. Han, M. M. Magwire, J. M. Cridland, M. F. Richardson, R. R. Anholt, M. Barron, C. Bess, K. P. Blankenburg, M. A. Carbone, D. Castellano, L. Chaboub, L. Duncan, Z. Harris, M. Javaid, J. C. Jayaseelan, S. N. Jhangiani, K. W. Jordan, F. Lara, F. Lawrence, S. L. Lee, P. Librado, R. S. Linheiro, R. F. Lyman, A. J. Mackey, M. Munidasa, D. M. Muzny, L. Nazareth, I. Newsham, L. Perales, L. L. Pu, C. Qu, M. Ramia, J. G. Reid, S. M. Rollmann, J. Rozas, N. Saada, L. Turlapati, K. C. Worley, Y. Q. Wu, A. Yamamoto, Y. Zhu, C. M. Bergman, K. R. Thornton, D. Mittelman and R. A. Gibbs (2012). "The Drosophila melanogaster Genetic Reference Panel." Nature 482(7384): 173-178. McGaugh, S. E., C. S. Heil, B. Manzano-Winkler, L. Loewe, S. Goldstein, T. L. Himmel and M. A. Noor (2012). "Recombination modulates how selection affects linked sites in Drosophila." PLoS Biol 10(11): e1001422. Meiklejohn, C. D. and D. C. Presgraves (2012). "Little evidence for demasculinization of the Drosophila X chromosome among genes expressed in the male germline." Genome Biol Evol 4(10): 1007-1016. Meisel, R. P., J. H. Malone and A. G. Clark (2012). "Disentangling the relationship between sex-biased gene expression and X-linkage." Genome Res 22(7): 1255-1265. Miller, D. E., S. Takeo, K. Nandanan, A. Paulson, M. M. Gogol, A. C. Noll, A. G. Perera, K. N. Walton, W. D. Gilliland, H. Li, K. K. Staehling, J. P. Blumenstiel and R. S. Hawley (2012). "A Whole-Chromosome Analysis of Meiotic Recombination in Drosophila melanogaster." G3 (Bethesda) 2(2): 249-260. Morris, L. X. and A. C. Spradling (2011). "Long-term live imaging provides new insight into stem cell regulation and germline-soma coordination in the Drosophila ovary." Development 138(11): 2207-2215. 64 Myers, S., L. Bottolo, C. Freeman, G. McVean and P. Donnelly (2005). "A fine-scale map of recombination rates and hotspots across the human genome." Science 310(5746): 321-324. Neel, J. V. (1941). "A relation between larval nutrition and the frequency of crossing over in the third chromosome of Drosophila melanogaster." Genetics 26(5): 506-516. Neumann, R. and A. J. Jeffreys (2006). "Polymorphism in the activity of human crossover hotspots independent of local DNA sequence variation." Hum Mol Genet 15(9): 1401-1411. Pan, J., M. Sasaki, R. Kniewel, H. Murakami, H. G. Blitzblau, S. E. Tischfield, X. Zhu, M. J. Neale, M. Jasin, N. D. Socci, A. Hochwagen and S. Keeney (2011). "A hierarchical combination of factors shapes the genome-wide topography of yeast meiotic recombination initiation." Cell 144(5): 719-731. Parisi, M., R. Nuttall, P. Edwards, J. Minor, D. Naiman, J. Lu, M. Doctolero, M. Vainer, C. Chan, J. Malley, S. Eastman and B. Oliver (2004). "A survey of ovary-, testis-, and soma-biased gene expression in Drosophila melanogaster adults." Genome Biol 5(6): R40. Parisi, M., R. Nuttall, D. Naiman, G. Bouffard, J. Malley, J. Andrews, S. Eastman and B. Oliver (2003). "Paucity of genes on the Drosophila X chromosome showing male-biased expression." Science 299(5607): 697-700. Parsons, P. A. (1988). "Evolutionary rates: effects of stress upon recombination." Biological Journal of the Linnean Society 35(1): 49-68. Petes, T. D. (2001). "Meiotic recombination hot spots and cold spots." Nat Rev Genet 2(5): 360-369. Priest, N. K., L. F. Galloway and D. A. Roach (2008). "Mating frequency and inclusive fitness in Drosophila melanogaster." Am Nat 171(1): 10-21. Ranz, J. M., C. I. Castillo-Davis, C. D. Meiklejohn and D. L. Hartl (2003). "Sex-dependent gene expression and evolution of the Drosophila transcriptome." Science 300(5626): 1742-1745. Redfield, H. (1966). "Delayed mating and the relationship of recombination to maternal age in Drosophila melanogaster." Genetics 53(3): 593-607. Ringrose, L. and R. Paro (2007). "Polycomb/Trithorax response elements and epigenetic memory of cell identity." Development 134(2): 223-232. 65 Roberts, A., H. Pimentel, C. Trapnell and L. Pachter (2011). "Identification of novel transcripts in annotated genomes using RNA-Seq." Bioinformatics 27(17): 2325-2329. Rockman, M. V. and L. Kruglyak (2009). "Recombinational landscape and population genomics of Caenorhabditis elegans." PLoS Genet 5(3): e1000419. Roth, S. and J. A. Lynch (2009). "Symmetry breaking during Drosophila oogenesis." Cold Spring Harb Perspect Biol 1(2): a001891. Singer, T., Y. Fan, H. S. Chang, T. Zhu, S. P. Hazen and S. P. Briggs (2006). "A high-resolution map of Arabidopsis recombinant inbred lines by whole-genome exon array hybridization." PLoS Genet 2(9): e144. Singh, N. D., E. A. Stone, C. F. Aquadro and A. G. Clark (2013). "Fine-scale heterogeneity in crossover rate in the Garnet-Scalloped region of the Drosophila melanogaster X Chromosome." Genetics. Spellman, P. T. and G. M. Rubin (2002). "Evidence for large domains of similarly expressed genes in the Drosophila genome." J Biol 1(1): 5. Spradling, A., M. T. Fuller, R. E. Braun and S. Yoshida (2011). "Germline stem cells." Cold Spring Harb Perspect Biol 3(11): a002642. Spradling, A. C. (1993). Developmental genetics of oogenesis. The Development of Drosophila melanogaster. M. a. M.-A. Bate, A. Cold Spring Harbor, N.Y., Cold Spring Harbor Press. 1: 1–70. Stern, C. (1926). "An effect of temperature and age on crossing-over in the first chromosome of Drosophila melanogaster." Proc Natl Acad Sci U S A 12(8): 530-532. Sturgill, D., Y. Zhang, M. Parisi and B. Oliver (2007). "Demasculinization of X chromosomes in the Drosophila genus." Nature 450(7167): 238-241. Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau, B. B. Tuch, A. Siddiqui, K. Lao and M. A. Surani (2009). "mRNA-Seq whole-transcriptome analysis of a single cell." Nat Methods 6(5): 377-382. 66 Trapnell, C., D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn and L. Pachter (2013). "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature Biotechnology 31(1): 46-53. Trapnell, C., A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn and L. Pachter (2012). "Differential gene and transcript expression analysis of RNAseq experiments with TopHat and Cufflinks." Nat Protoc 7(3): 562-578. Wang, Z., M. Gerstein and M. Snyder (2009). "RNA-Seq: a revolutionary tool for transcriptomics." Nat Rev Genet 10(1): 57-63. Weber, C. C. and L. D. Hurst (2011). "Support for multiple classes of local expression clusters in Drosophila melanogaster, but no evidence for gene order conservation." Genome Biol 12(3): R23. Williams, C. G., M. M. Goodman and C. W. Stuber (1995). "Comparative recombination distances among Zea mays L. inbreds, wide crosses and interspecific hybrids." Genetics 141(4): 1573-1581. Yandeau-Nelson, M. D., B. J. Nikolau and P. S. Schnable (2006). "Effects of trans-acting genetic modifiers on meiotic recombination across the a1-sh2 interval of maize." Genetics 174(1): 101-112. 67 CHAPTER 3: In silico prediction of recombination rate variation across the Drosophila melanogaster genome based on multiple DNA motif analysis 3.0.1 Preface Chapter 3 appears here as a reprinting of an article by Adrian, Cruz-Corchado, and Comeron with the same title submitted for review in 2015. Formatting and minor alterations have been made for consistency. 3.0.2 Abstract In all eukaryotic species examined, meiotic crossovers occur non-randomly along chromosomes. The cause for this nonrandom distribution remains poorly understood but the presence of specific DNA motifs has been shown to play a contributory role in crossover localization a number of species. In humans and mice, a motif targeted by the protein PRDM9 is strongly associated with crossover hotspots but even in this paradigmatic case motif presence alone is a poor predictor of crossover distribution when studied at a wholegenome scale. In Drosophila, contrary to the human and mouse case, no PRDM9 homolog exists and recent studies suggest that many different motifs are enriched near experimentally determined recombination events. Here, we present genomic and bioinformatic analyses in D. melanogaster to investigate whether any DNA motif distribution can be used to predict crossover variation across the whole genome using machine-learning algorithms including Random Forests (RF) and multivariate adaptive regression splines (MARS). MARS, in particular, generates models of crossover distribution that expose a combinatorial, non-linear influence of motif presence able to account for more than 40% of the variance in crossover rates genome-wide. We show that highly predictive motifs share structural similarities that suggest secondary and tertiary DNA structures can be important factors in crossover localization. We also show that transcriptional activity 68 during early meiosis and differences in motif use among chromosome arms add to the predictive power of the models. Our work presents a more detailed picture of crossover determination in Drosophila that includes DNA motif effects and a potential mechanistic explanation to the known plasticity in recombination rates, thus paving the road for further understanding of the multifactorial genetic and epigenetic nature of crossover distribution across genomes. 3.1 Background Meiosis is a pervasive process among eukaryotes and the meiotic machinery is heavily conserved (Keeney 2001). Yet, the rate of meiotic recombination and crossover, in particular, exhibits an astounding degree of variation across genomes as well as between closely related species, populations of the same species, and even among individuals of the same population (Neel 1941; Parsons 1988; Kim et al. 2007; Coop et al. 2008; Kulathinal et al. 2008; Mancera et al. 2008; Kong et al. 2010; Dumont et al. 2011; Fledel-Alon et al. 2011; Ross et al. 2011; Smukowski, Noor 2011; Comeron, Ratnappan, Bailin 2012; McGaugh et al. 2012; Miller et al. 2012; Singh et al. 2013; Gossmann et al. 2014; Liu et al. 2015). Moreover, crossover rates are affected by other factors such as age, temperature, stressors, etc., indicating that a precise description of crossover distribution requires identifying both heritable genetic variation and epigenetic factors (Brooks 1988; Kong et al. 2002; Hussin et al. 2011). Indeed, epigenetic marks and active transcription have been implicated in shaping the distribution of double-strand-breaks (DSB) and ultimately crossover rates in a number of species including Drosophila melanogaster (Mirouze et al. 2012; Yelina et al. 2012; Adrian, Comeron 2013). To gain insight into the factors involved in crossover localization much attention has been given to short DNA sequence motifs near crossovers. Computational analyses of high- 69 resolution maps can identify specific motifs enriched at hotspot regions, but the transition to biological relevance has not been clear (MacIsaac, Fraenkel 2006; Simcha, Price, Geman 2012). Even when a motif has been associated with a specific process or function, analyses of motif presence using only DNA primary sequences are rarely predictive enough to forecast specific patterns such crossover distribution at a whole-genome scale. One of these cases is the 13-mer motif recognized by the protein PRDM9 in humans and mice, with PRDM9 promoting histone methylation and crossing over around the DNA motif (Baudat et al. 2010). Indeed, the PRDM9-associated motif is present in approximately 40-60% of crossover hotspots in humans (Myers et al. 2008; Hinch et al. 2011). The reverse is much less often true: the PRDM9 motif occurs over 290,000 times in the human genome while fewer than 50,000 recombination hotspots have been identified (Ségurel, Leffler, Przeworski 2011). Therefore, the PRDM9 hotspot-associated motif is not a strong predictor of crossover distribution at a genome-wide level even in species with bona fide hotspots that have >100-fold crossover rate relative to coldspots. A recent analysis across the ape phylogeny supports this conclusion, with enrichment of putative PRDM9 binding in recombination hotspots when compared to coldspot regions while there is no significant association between PRDM9 presence and recombination rates when measured broadly across the genome (Stevison et al. 2015). An equivalent case is observed in the fission yeast Saccharomyces pombe, where motifs significantly enriched near some hotspots are, nonetheless, very poor predictors of hotspot localization genome-wide (Fowler et al. 2014). In Drosophila, high-resolution recombination maps have revealed that crossover rates may vary up to 20- to 40-fold across genomic regions traditionally assumed to exhibit limited variation in recombination, describing ‘peaks’ of crossover rates that are far less extreme and physically discrete as in species with traditional hotspots (Kulathinal et al. 70 2008; Comeron, Ratnappan, Bailin 2012; McGaugh et al. 2012; Singh et al. 2013). Moreover, Drosophila species, like other species including some placental mammals, do not have functional PRDM9 orthologs (Oliver et al. 2009; Parvanov, Petkov, Paigen 2010; MuñozFuentes, Di Rienzo, Vilà 2011; Heil, Noor 2012). The 13-mer motif associated with human hotspots that is recognized by PRDM9 is not observed near crossover events in Drosophila (Comeron, Ratnappan, Bailin 2012; Heil, Noor 2012). In fact, sequence analyses in D. melanogaster have identified not one but many DNA motifs significantly enriched near crossover events (Comeron, Ratnappan, Bailin 2012; Miller et al. 2012; Singh et al. 2013). Current data, therefore, suggest that Drosophila has mammalian-like hotspots and that crossover localization may be influenced by the combined effects of several motifs; we hypothesize that crossover-inducing motifs may be more evolutionary stable than in species with bona fide hotspots (e.g., humans). This scenario would support the possibility that studies of motif presence to some degree may be informative at a species-level describing genome-wide recombination variation. Whether the many crossover-associated motifs have predictive power describing variation in crossover rates across Drosophila genomes has not been explored. Whether the potential effects of motif presence on crossover localization are individual or synergistic, or whether the same set of motifs plays a role in crossover localization in different genomic regions remains unknown aswell. Here we present an investigation on the potential capability of predicting crossover variation across the genome of D. melanogaster based solely on the analysis of genomic sequence and the distribution of specific motifs. We first generated landscapes of motif presence that takes into account the probabilistic nature of motif sequences, background composition, and the numerous false positives expected in any large-scale genomic study. Using machine-learning techniques we show that the presence of multiple motifs can 71 explain a significant fraction of the observed variation in crossover distribution at a wholegenome scale. Our quantitative models can explain more than 40% of the genome-wide variation in crossover rates without studying meiotic products, and are particularly accurate (more than 60% positive rate) at detecting the genomic regions with the highest 10% and lowest 10% crossover rates. We show that while effect of each motif is small, there is a multifactorial and non-linear influence of motif presence in crossover localization, with the predictive power of all models increasing with additional motifs. We also show that transcriptional activity during early meiosis adds prediction power to the models and thus explicitly includes a potential mechanistic explanation to the known plasticity in recombination rates. We also report for the first time that motif effects on crossover rates vary among chromosome arms. 3.2 Results & Discussion 3.2.1 Generation of Genome-Wide landscapes of DNA motifs The study of almost 2,000 crossover events localized with high resolution (less than 500 bp) exposed many DNA motifs enriched near/at crossover events (Comeron, Ratnappan, Bailin 2012) and we used the reported position frequency matrix (PFM) of these motifs to generate corresponding landscapes of motif presence across the whole reference genome (see Methods). We avoided searching for ‘word’ matches of the most common nucleotides (with or without including ambiguous sites using IUPAC nucleotide code) to capture the probabilistic nature of most DNA motif composition. Instead, and for each motif, we applied a genomic scan to assign a likelihood Li to every k-mer sequence to fit the PFM (with i and k indicating a genomic position and the length of the motif, respectively). We then generated a genome-wide null distribution of Li (RLi) based on random shuffling of nucleotide and dinucleotide composition and used the null distribution 72 of RLi to obtain a threshold for observed Li that would represent a desired False-Discovery Rate (FDR) (eg. LFDR=0.01 when the FDR is 1%). We call a motif to be present at position i only when Li > LFDR. This approach allows applying any arbitrary FDR and, importantly, takes into account the number of sites under study as well as background nucleotide composition. FDR-corrected motifs predictably show greater variation in composition and decreasing similarity to the initial seed motif (increased bit scores) when FDR increases (see Figure 3.5S). Increasing FDR not only causes a fast increase in overall motif count but can also alter the relative distribution across the genome, reinforcing the need of using stringent FDR in genome-wide motif analyses. Unless otherwise indicated, we focused on FDR-corrected landscapes of motif presence at 100-kb scale with FDR set at 1%. We chose to use a conservative FDR of 1%. A FDR of 1% maximizes dynamic range and further recovers a sequence logo nearly indistinguishable from that of the one produced using an FDR < 1%, all while restricting the fraction of false positives to an acceptable threshold. All 20 motifs investigated (see Methods) show a wide-range distribution across the genome (Table 3.1S), ranging from zero to a maximum of 139 motif matches per 100-kb region with an average presence that varies from 0.08 (motif M10) to 39 (M6). Genome-wide presence ranges from 95 (M10) to a maximum of over 46,000 motif hits (M6) even when FDR is set to 1% and highlights the need for caution when interpreting the biological relevance of individual motifs. 3.2.2 Variation in motif presence among chromosome arms We observe that many motifs are not equally distributed among chromosome arms (Table 3.1S). All 16 motifs present more than 1,000 times genome-wide show a significant departure from random distribution (Kruskal-Wallis test based on the comparison of motif 73 presence per 100-kb region among chromosome arms, P < 0.001 in all cases). This result opens the possibility that the different chromosome arms might utilize different DNA motifs, individually or in combination, as localizing factors for crossover localization (see below). Moreover, the unusual pattern of motif distribution is not associated with differences between autosomal arms vs. chromosome X since 14 out of the 16 motifs also show significant presence heterogeneity among the four autosomal arms (P < 0.001). 3.2.3 Genomic co-occurrence of motifs We also observe that the presence of these motifs across the genome is highly spatially associated, with more that 50% of all pairwise genome-wide motif distributions showing Spearman’s ρ > 0.15 (P < 1 x 10-8), a percentage that reaches 75% when the uncommon motifs (fewer than 1,000 counts genome-wide) are removed from the study (average pairwise ρ = 0.30). Note that spatially-randomized motifs do not produce any association (an average ρ of 7 x 10-6 and a maximum pairwise ρ of 0.15 out of 50 million simulations of randomized motifs). Because our method to call motifs included background composition and fully avoids overlapping motifs, the existence of high spatial covariance of motif presence opens up the possibility of multifactorial regulation of crossover localization. 3.2.4 Motif presence is correlated with crossover rates across the genome At first glance the observed variation of motif presence on a chromosome level visually follows the historical description of broad-scale crossover rate variation in D. melanogaster (see Figure 3.1). Motif occurrence is reduced near telomeric and centromeric regions of all chromosomes concomitant with a tendency for reduced crossover rates (Morton, Rao, Yee 1976; Lindsley, Zimm 1992; Comeron, Ratnappan, Bailin 2012; Miller et al. 2012). Although the fraction of the genome initially used to infer motifs from 74 experimentally characterized crossovers was minimal (less than 1% of the genome), we studied potential genome-wide associations between variation in motif presence and crossover rates using estimates of recombination completely independent of the experimental genetic map used to find the initial motifs. To this end, we used highresolution population estimates of crossover rate across the genome based on patterns of linkage disequilibrium (Chan, Jenkins, Song 2012) applied to D. melanogaster African Rwanda (RG) and Zambia (ZI) populations [(Pool et al. 2012; Lack et al. 2015); see Methods for details]. Unless noted, analyses are based on crossover estimates of the RG population (see Methods). The direct comparison of crossover rates and motif presence at 100-kb scale shows a very strong positive association for 15 motifs, with Spearman’s ρ ranging between 0.15 (P = 3 x 10-7) and 0.51 (P < 1 x 10-16) (Figure 3.2 and Figure 3.6S). Because the 20 motifs analyzed were originally identified based on a very small fraction of the genome, the detection of a strong association for many motifs at a genome-wide scale supports the initial approach for detecting sequence signatures of crossover localization genome-wide. Not all motifs reported to be enriched near crossover events, however, are positively associated with crossover rates at a genome-wide scale, with 5 motifs showing ρ < 0.05 (P > 0.1), mostly attributable to their very limited presence across the genome once FDR-correction is applied; see Table 3.1S). We observe that some motifs show clear differences among chromosome arms in terms of association with crossover rates (Figure 3.2, Figure 3.6S and Figure 3.7S). For instance, variation in M7 presence shows a very strong association with crossover rates across the X chromosome (ρ = 0.51, P = 3x10-16) while it shows no association across autosomal arms (P > 0.1 in all autosomal arms). M2, on the other hand, shows strong 75 positive association with crossover rates across all four autosomal arms (ρ > 0.28; P < 2x105) but not across the X chromosome (ρ = 0.085; P > 0.1) This difference in potential causal effects of motif presence on crossover localization, however, does not seem to be simply an autosomal vs. X effect since other motifs (e.g. M1, M6, M11, M13) show more complex patterns. In an effort to obtain a baseline model that considers multiple motifs to explain crossover distribution we first performed LASSO (Least Absolute Shrinkage and Selection Operator) regression (Tibshirani 1996; Hastie et al. 2009) (see Methods for details). LASSO regression is a data mining technique that favors solutions with fewer parameter values under a linear model, simultaneously performing variable selection and simplifying model interpretation. LASSO exposes six heavily weighted motifs (in order of importance M3, M5, M1, M7, M2, and M18), all positively associated with crossover rates. With these six motifs, LASSO fits a linear model of motif presence that explains 30% of the variation in crossover rates genome-wide (ρ = 0.55, P < 2.2x10-16). Note that while motifs M5, M3 and M1 were among the motifs with the highest association with crossover rates based on simple individual non-parametric Spearman’s correlations (Figure 3.2 and Figure 3.6S), motifs M7, M2 and M18 were not in the top six. Easing the constraints of LASSO (ρ + 1 S.E.; Figure 3.8S) allows all but one motif variable to enter the model but this highly complex model exhibits little improvement in performance (ρ = 0.59, P < 2.2x10-16). 76 Figure 3.1 Genomic Landscape of Motif 3 Estimates of motif presence per 100 kb across chromosome arms 2L, 2R, 3L, 3R and the X chromosome. Motif 3 (M3) presence is assigned after applying a 1% FDR (see Methods). 77 Figure 3.2 Probability heatmap of correlation between motif presence and crossover rates for different chromosome arms. Probability of Spearman’s ρ of motif presence and crossover rates calculated genome-wide and for each chromosome arm separately based on non-overlapping 100kb regions. Only correlations with P < 0.01 are shown; regions in white are above this threshold. Motifs are ordered based on Spearman’s ρ, from stronger (M5; ρ = 0.513, P = 7x10-81) to weaker (M10; ρ = 0.013, n.s.) in genome-wide analyses. 78 3.2.5 A predictive model of variation in crossover rates across the genome based on sequence motif occurrence. In order to investigate how accurate a model of crossover variation based on variation in motif presence could be, we applied two machine learning methods. We first used Random Forests (RF) (Breiman 2001; Lee et al. 2005; Banfield et al. 2007) as a form of supervised learning to discriminate between regions (classes) of different crossover rates, particularly between low and high rates. We later constructed a quantitative predictive model using Multivariate Adaptive Regression Splines (MARS) (Friedman 1991; Friedman, Roosen 1995; Hastie et al. 2009) (see Methods for details). MARS is an approach that splits predictive variables into several intervals, and allows potential nonlinear relationships over these different intervals with any degree of interaction between variables. Importantly, MARS allows obtaining a final explicit and continuous model of crossover rates based on the combined presence of multiple motifs. 3.2.6 Random Forests (RF) categorical modeling We split 100-kb regions into ten approximately equally sized bins, each containing a class from the lowest to highest 10% crossover rate, and applied RF to classify crossover classes. The correctness of the RF models is measured by true positive rate (accuracy) and the area under the curve (AUC) that indicates the ability of the model to discriminate between the different classes, with AUC scores ranging from 0.5 (indicating that a model has no discriminatory ability) to 1 (indicating that the model can discriminate perfectly among classes). Note that RF does not directly generate probability values associated with the whole model. We, therefore, obtained the statistical significance of RF models by comparing the accuracy and AUC generated by the model and the accuracy and AUC generated by the application of the same RF method when crossover classes are randomized among 100-kb regions (250,000 randomizations per model). 79 Using ten crossover classes as our class variables, we first trained a random forests model using the presence of 20 motifs to later add chromosome arm (see above) and transcription data (Adrian, Comeron 2013). Genome-wide, the accuracy (true positive percentage) for models with 20 motifs is 26.1% versus a random accuracy of ~10% (P < 4x10-6), with mean AUC = 0.658 (P < 4x10-6). The model performs radically better for the top and bottom crossover classes (see below). Accuracy for the top and bottom 10% class is 78.6 and 60.2% (more than six-fold enriched; P < 4 x 10-6 in both cases), exposing a large deviation from the random expectations. Enrichment based on AUC shows an equivalent pattern, with AUC = 0.892 and 0.876 for the top and bottom classes, respectively (P < 4x10-6 in both cases). The study of motif presence alone correctly classifies crossover class within one step of their true class 49% of the time among all recombination classes, indicating that even when our model fails to accurately predict a class, it often falls into the adjacent bin. Including information on the chromosome arm and the genomic distribution of transcribed genes in early meiosis generates a model that increases the accuracy of the RF model to 27.8% and generates a mean AUC of 0.725 (P < 4x10-6 in both cases). Mean accuracy for the top class increases to 83.3% (AUC = 0.947; P < 6.6x10-6) and a 7% false positive rate, which evidences the high predictive power of motif presence for these regions (Figure 3.3). The addition of other genomic properties, such the number of annotated genes or proportion of exonic sequence in each window does not significantly affect model accuracy (data not shown) and these variables are never within the top ten variables in term of importance within the model (positions 18 and 21, respectively) ranked by information gain. Similarly, GC content per window is neither ranked highly by information gain criteria nor does it have a substantial impact on classification accuracy. Taken together, 80 the results indicate a significant influence of motif presence on crossover rate variation and support a role for active transcription during early meiosis in crossover localization. 3.2.7 MARS modeling We evaluated the quality of MARS models focusing on R2GCV scores (the MARS estimate of how well this model would perform on new data) based on the conservative 10fold cross-validation approach (see Methods for details). The simplest model that considers the variable presence of motifs across the genome is able to explain ~45% (R2GCV = 0.447) of the genome-wide variation in crossover distribution (Figure 3.4A and 3.4B), and identifies six motifs with significant contribution to the model (M7, M3, M5, M14, M18 and M6). This predictive model improves even further when including information on transcription data (R2GCV = 0.47) or chromosome arm (R2GCV = 0.57). The most complete model that includes motif presence, transcription and chromosome arms explains more than 59% of the variation in crossover genome-wide. This complex model recognizes chromosome arm, transcription data and 8 motifs as significantly important within the model. Many of these predictive variables exhibit clear non-linear effect on crossover rates including transcription data and most motifs (e.g. M2, M5, M7 and M18; Figure 3.4C). Table 3.2S shows additional results based on less conservative MARS validation methods). Analyses based on population estimates of crossover rates using the ZI population instead of the RG population generate similar results (Table 3.2S). Notably, the motifs and their ranking by order of importance differs between models investigating only motif presence and models including transcription and chromosome arm information. In particular, motifs such as M17, M12 and M2 become part of the model when interactions with chromosome arms are allowed, in agreement with the observation that these motifs showed positive association with crossover rates in some but 81 not all chromosomal arms (Figure 3.2). Combined, these results expose that crossover variation across the genome is significantly influenced by the presence of multiple motifs and by interactions between motif utilization, chromosome arm, and transcription activity in early meiosis. In D. melanogaster, crossover frequency is severely reduced near centromeres and, to a lesser degree, near telomeres—the so-called “centromere effect” (Beadle 1932). We, therefore, investigated the possible influence of motif distribution on crossover rates after excluding these sub-centromeric and telomeric regions from the study. MARS generates models with R2GCV that range between 0.405 (only motif presence) and 0.539 (motifs + chromosome arms + transcription activity in early meiosis). These results show that variation in motif distribution maintains significant power explaining variation in crossover rates across genomic regions considered of average or high recombination. Finally, we investigated the predictive power of motif presence along individual chromosomal arms to eliminate any possible effects associated with differences in motif presence (see above) and total crossover rates among arms. MARS analyses show (Figure 3.4A) a weaker but still high predictive power describing variation in crossover rates based on motif presence alone (R2GCV ranging between 0.16 and 0.36); when transcription data is also included, the predictive power is increased (R2GCV ranging between 0.19 and 0.46). 82 Figure 3.3 True-positive rate generated by a Random Forests (RF) model True positive rate (accuracy) is given for 10 crossover classes, from class A (indicating the class with lowest 10% crossover rate) to class J (indicating the class with highest 10% crossover rate). Random accuracy (uninformative model) per class is 10% (horizontal dashed line). Bar colors indicate significance of departing random expectations assessed by randomized simulations. The model tested utilized all motifs, chromosome arm, and transcription estimates to predict recombination classes (see Methods for details). 83 Figure 3.4 MARS predictive models of crossover rates. Chrom. A) Estimates of the predictive power of the MARS models (R2GCV) using only motif presence or when the models include transcription data and/or chromosome arms as predictive variables. Results based on genome-wide analyses (red) and after removing sub-telomeric and –centromeric regions (trimmed genome) (blue). B) Relationship between observed crossover rates (see Methods) and predicted crossover rates based on MARS with a model that includes variation in motif presence genomewide. C) Examples of MARS predictions of the influence of motif presence on crossover rates. 84 3.3 Conclusions To obtain FDR-corrected landscapes of motif presence across the genome we used a flexible approach based on the likelihood of each k-mer sequence of nucleotides to correspond to a given PFM to later calculate and apply an arbitrary FDR. The use of a FDR instead of a direct arbitrary probability is necessary to limit the extent of false positives and this parameter should be tuned appropriately for each investigation. We set FDR at 1% because it generates a large number of motif hits while still producing PFMs equivalent to those obtained under the most stringent FDR of 0.1%, thus suggesting that false positives do not seriously alter motif detection. We also confirmed that a single PFM per motif is adequate across the whole genome by comparing the top 50 and bottom 50 recombinationcorrelated regions by residual (which followed an approximately normal distribution). We found no significant differences between PFMs recovered from these two classes of regions indicating that a single PFM per motif can be used at genome-wide scale and that motif count, rather than motif sequence or quality, is a more important predictor of crossover in our data. 3.3.1 Motif Composition Several of the motifs with strongest association with crossover rates or in terms of importance within RF and MARS models share an enrichment of poly-A/T or [A/T]n tracts. It has been well established in gel-mobility and X-ray crystallographic studies that repeated instances of A/T tracts can produce a bend in the DNA helix axis, with the in-phase repetition of these elements contributing to larger overall bends (Travers 1990; Dlakic, Harrington 1998; Hizver et al. 2001). Furthermore, there is evidence to suggest that polyA/polyT bending tracts serve important functions, altering protein-DNA binding specificities and regulating transcription (Travers 1990; Perez-Martin, de Lorenzo 1997). 85 AA, TT, AT dinucleotides are also a factor in nucleosomal positioning (Segal et al. 2006), which could have a direct impact on chromatin accessibility, consistent with hypotheses where such accessibility is required for DSB induction. In fact, a poly-A tract is at the center of a proposed common (CoHR) motif of DSBs in yeast (Blumental-Perry et al. 2000). Moreover, the association between A/T rich motifs and crossovers is a motif-specific and not a large-scale nucleotide composition phenomenon since there is only a nominal effect of overall A+T content on the distribution of crossover rates at 100-kb scale (ρ = -0.057, P = 0.006) that becomes non-significant when analyzing noncoding sites only [P > 0.25; (Comeron, Ratnappan, Bailin 2012)]. The strong signal associated with [CA]n motifs across the D. melanogaster genome is also interesting because these motifs are markers for Z-DNA formation (Herbert, Rich 1996) and have been shown to be significantly enriched near hotspots of recombination in S. pombe and S. cerevisiae (Treco, Arnheim 1986; Fowler et al. 2014). Our data, therefore, support the concept that the secondary and tertiary DNA structures are relevant to meiotic DSB fine-scale localization, acting at a very local scale within larger genomic or chromatin features. Intriguingly, motif M6 shows a strong influence on crossover rates within MARS models that analyze motifs presence (either genome-wide or after removing sub-telomeric and –centromeric regions). M6 [MCO4 in (Comeron, Ratnappan, Bailin 2012)] includes the 7-mer CCTCCCT sequence first associated with hotspot determination in humans and is the core sequence of the longer 13-mer motif recognized by the zinc finger protein PRDM9 (Jeffreys, Neumann 2002; Jeffreys, Neumann 2005; Myers et al. 2005; Myers et al. 2008; Baudat et al. 2010). Moreover, this core CCTCCCT motif is not only observed in D. melanogaster (this study) but was previously observed in regions with high crossover rates in D. pseudoobscura (Kulathinal et al. 2008). Because there is no PRDM9-homolog in 86 Drosophila species (Oliver et al. 2009; Heil, Noor 2012) and the complete PRDM9-associated 13-mer motif is not observed near crossovers (Comeron, Ratnappan, Bailin 2012; Heil, Noor 2012), the observation that a motif including its core sequence CCTCCCT may play a detectable role describing crossover localization is unexpected and could indicate unappreciated commonalties in crossover-associated motifs between mammals and insects. 3.3.2 Differences among Chromosome Arms Our study has exposed not only that the presence (per 100kb) of many motifs varies among chromosome arms but also that the association between motif presence and crossover rates differs depending on chromosome arm. MARS analyses reveal detectable interactions between chromosome arm and motif presence when describing crossover rates and, in agreement, the order of motif importance within MARS models varies when chromosome arm is included as predictor. Notably, there is no simple split in the model between autosomes vs. X chromosome. Our results imply that different large-scale genomic regions or whole chromosome arms may utilize different combination of DNA motifs as localizing factors for crossover formation (and potentially DSBs). Outside of differential rates of evolution (Meisel, Connallon 2013; Llopart 2015) or average rates of crossover among arms, it is hard to conceptualize how large chromosomal domains would be different from any other in terms of motif utilization for DSB formation and resolution. At this time, we can only hypothesize that such a potential mechanistic link could be associated with spatio-temporal movement and nuclear localization of the chromosome arms during early meiosis (Parvinen, Soderstrom 1976; Koszul et al. 2008; Shibuya, Ishiguro, Watanabe 2014). At a more practical level our results also indicate that analyses of motif occurrence based on a single large genomic region or chromosome may generate an informative view 87 for that region or chromosome, but one that may not be applicable genome-wide, thus explaining differences in motif detection among studies. 3.3.3 Crossover Localization across the Genome Our study supports the concept that multiple motifs play a detectable role in crossover localization in Drosophila. Although high-resolution maps in a number of other species (including S. cerevisiae, S. pombe, Plasmodium falciparum or Apis mellifera) have exposed a similar scenario with multiple motifs significantly enriched near crossovers (Gerton et al. 2000; Cromie et al. 2007; Steiner et al. 2009; Jiang et al. 2011; Bessoltane et al. 2012; Liu et al. 2015), we show that variation in the relative presence of these motifs is in fact predictive and explains a significant fraction of the observed variation in crossover distribution across the genome. While the effect of each motif is small, the predictive power of all models increases with additional motif variables, and we have found non-linear influences of motif presence in crossover localization in Drosophila. We also show that variables describing transcriptional activity during early meiosis and, more unexpectedly, chromosome arm add to the predictive power of the models. Standard multiple linear and the more complex LASSO linear regression analyses suggest that 22 and 35%, respectively, of the observed variance in crossover rates across the genome could be explained by motif presence. However, the application of more advanced machine learning techniques such RF and MARS allows the study of multiple variables without the constraints of linear models. Our study shows that RF modeling allows identifying regions with high and low crossover rates with high accuracy, particularly those regions with the highest and lowest 10% crossover rates where the accuracy of the model is at least six-fold greater than random expectation. 88 MARS, on the other hand, generates quantitative predictions while allowing multiple interactions among motifs and nonlinear effects when describing crossover rates, including saturation/insensitivity above/below certain ranges. Indeed, MARS exposes significant interactions among motifs and generates models able to account for more than 40% of the variance in crossover rates across the D. melanogaster genome based on R2GCV estimates. MARS also identified important motifs with effects on crossover localization that showed secondary effects when analyzed using simpler approaches, with potential large non-linear and combined effects. Note that MARS estimates of R2GCV based on less conservative methods (Table 3.2S) would imply a much higher influence of motif presence on variance in crossover distribution but caution should be applied to the interpretation of these high R2 estimates (instead of those based on the more conservative 10-fold cross validation) due to potential overfitting. Our MARS results based on R2GCV suggest that motif presence in Drosophila can explain genome-wide patterns of crossover even more than species such as humans, where patterns of PRDM9 can explain ~18% of the variation in recombination hotspots (Baudat et al. 2010). Based on the knowledge obtained from this study and others, a general model is emerging where crossover distribution is determined by a combination factors acting at different physical scales (Petes 2001; Kleckner 2006; Pan, Keeney 2007; Pan et al. 2011; Adrian, Comeron 2013; Borde, de Massy 2013). In D. melanogaster, the centromere effect describes variation in crossover distribution at the largest scale (hundreds of kb), with a severe reduction in crossover rates at sub-centromeric/telomeric regions (Beadle 1932). Our study shows that motif presence plays a significant role describing the observed heterogeneity in crossover distribution even after removing regions near centromeres and telomeres, and MARS models perform significantly worse when using the physical position 89 along a chromosome arm relative to the centromere (with 0 and 1 indicating the subcentromeric and –telomeric positions, respectively) as the sole predictor of crossover variation (R2GCV = 0.046). Moreover, we observe that regions proximal to telomeric and centromeric regions have fewer motifs positively associated with crossovers (Figure 3.1). Because crossovers near centromeres increase the probability of non-disjunction events at the second meiotic division (Koehler et al. 1996) it is tempting to speculate that natural selection may have played a role in the observed paucity of recombinogenic motifs in such genomic regions. Combined, our results indicate that the centromere-effect observed today in D. melanogaster may then be the consequence of both direct mechanistic explanations as well as long-term evolutionary forces that have reduced the presence of crossoverassociated motifs in sub-centromeric regions. Beyond the large-scale centromere-effect and the presence of specific motifs, there is evidence that transcriptionally active genomic regions are enriched for crossovers (Adrian, Comeron 2013; Aymard et al. 2014) and that many DSB enriched regions are nuclease sensitive and influenced by nucleosome dynamics (Ohta 1994; Wu, Lichten 1994; Fan, Petes 1996; Aymard et al. 2014) and/or DNA secondary structures. Other sources of data show that recombining DNA sequences map to chromatin loops that are later tethered to the underlying synaptonemal complex and ultimately targeted by the evolutionarily conserved Spo11 protein (MEI-W68 in Drosophila) (Blat et al. 2002; Moens et al. 2002; Buhler, Borde, Lichten 2007). The relationship among transcription activity, nucleosome dynamics, DNA secondary structures and chromatin loop size and distribution in early meiosis is not yet known. Our results show that the multifactorial effect of motif presence is an important layer of information likely acting at a very local scale within much larger chromatin domains. This scenario would explain the abundance of recombination- 90 associated motifs throughout many organisms, including the PRDM9-associated motif that has been shown to be important in many mammalian systems; it further recognizes that motifs, transcription and chromatin structures must work together to initiate recombination. Some of the variation in crossover distribution is still unexplained and we propose that models of crossover localization would benefit from linking genetic and mechanistic explanations to the observed heritable but epigenetic variability in across genomes. Analyses investigating the influence of environmental effects on transcription, chromatin accessibility and crossover distribution using controlled crosses may provide key information to develop such more precise models. 3.4 Methods 3.4.1 Motif Landscape Generation Studies designed to assess the potential explanatory power of motif presence describing recombination events at genome-wide level first requires the generation of motif landscapes, a bioinformatic exercise that is not trivial. The difficulties in generating such landscapes include motif detection algorithms able to recognize probabilistic models of motif sequences instead of consensus ‘words’ (Schneider 2002; Hu, Li, Kihara 2005; D'haeseleer 2006; Das, Dai 2007; Hartmann et al. 2013), the inclusion of background nucleotide and dinucleotide composition (Simcha, Price, Geman 2012)., and the numerous false positives expected in any large-scale genomic study. In order to form motif frequency estimates across chromosomes, we developed a suite of custom python scripts designed to take an input of position frequency matrix, PFM, generated by MEME (Bailey et al. 2009), apply a sliding-windows approach to estimate the likelihood of each stretch of DNA of containing the MEME identified motif, and finally apply a FDR to classify a sequence to be classified as motif. We used the top 20 MEME PFMs by E-value that were generated from 91 (Comeron, Ratnappan, Bailin 2012) that identified motifs that were enriched in sequences containing a crossover event within 500bp. In more detail, motifs were first identified using a sliding window approach as strings of k-mer nucleotides matching the input PFM at an initial probability (Li, at genomic position i) greater than the arbitrary threshold of 1x10-30 (allowing approximately three complete mismatches). We then generated a null distribution of Li (RLi) from randomly shuffled genome sequence. Shuffling was carried out both by randomly shuffling each genomic or chromosomal nucleotide individually and shuffling in pairs to preserve dinucleotide structure, obtaining a more realistic null approximation (Ding, Lorenz, Chuang 2012). From this list of RLi probabilities, we generated a null distribution of likelihoods that allows estimating the highest values at any desired FDR (e.g., FDR of 1% would allow 1% of RLi > LFDR-0.01). We chose to utilize the 1% FDR-corrected estimates because those estimates yielded realistic approximations of frequency, low noise, and maximal correlation with rates of crossing over. Stretches of k-mer bases with Li > LFDR-0.01 were then scored as motifs to be present. Note that the set of FDR-corrected motifs generate PFMs that are similar but not an exact match to the initial set of seed PFMs, which is not unexpected due to the limited number and genomic distribution of the original set of sequences analyzed and the influence of background nucleotide composition (see Figure S1). Additionally, when comparing motif landscapes to experimentally measured recombination rates (Comeron, Ratnappan, Bailin 2012), we masked those regions that were used in generating the motifs, though the difference between masked and unmasked genomes was small, and the data sets generated nearly identical results. Using positional information of each motif location, we generated sliding-window estimates of FDR-corrected motif presence, or ‘motif landscapes’, for window sizes of 100- 92 kb across the genome. Unless noted, analyses were carried out using non-overlapping windows across the whole genome. To remove sub-centromeric and –telomeric regions with strongly reduced crossover rates we followed (Comeron 2014): sub-centromeric regions were assigned by starting at the centromere and moving into the chromosome arm until a minimum of 3 consecutive 100-kb windows showed crossover rates >1 cM/Mb, and sub-telomeric regions were assigned in an equivalent manner. Chromosome positions and gene annotations were based on the D. melanogaster dm3 assembly and annotation release 5.47 (http://flybase.org/). Sequence logos were generated using the R package seqLogo (Bembom 2007). Unless noted, landscape graphics and statistical analyses were conducted in the R programming language. 3.4.2 Model Generation and Attribute Selection Multiple linear, isotonic, and LASSO regression models were created in R or Java programming environments using the isoreg and glmnet packages and the WEKA v3.6.1 software package [(Hall et al. 2009); http://www.cs.waikato.ac.nz/ml/weka/]. For these models, all motif variables were included in the analysis unless otherwise removed by the model. LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani 1996; Hastie et al. 2009) is a data mining technique that favors solutions with fewer parameter values under a linear model, simultaneously performing variable selection and simplifying model interpretation. LASSO tends to produce coefficients that are small or zero and attribute selection is supplied by the attributes with nonzero coefficients. The intensity of regularization (or shrinkage) within LASSO is controlled by the regularization/shrinkage parameter (). Unless noted, we used WEKA software (v.3.6) to apply LASSO and used a that minimizes the cross validated mean squared error plus 0.5 standard error. 93 We also utilized the WEKA implementation of Random Forests (RF) for classification. RF is a nonparametric approach useful for detecting associations when there are large numbers of predictor variables with the possibility that each variable has relatively weak effects (Breiman 2001; Banfield et al. 2007). Briefly, RF classification constructs a collection of many independent decision trees, sampling both the data and attributes randomly with replacement. The remaining, unused data is classified using the collection of trees, with the classification of each item being based upon the result mode of the RF. Here, we generated 1000 trees of unrestricted depth with Log2(Attribute Number) +1 random attributes in each individual tree. When calculating probabilities for RF estimates, our simulations were limited to 250 trees in order to speed computation at the expense of increased variance, and therefore represent conservative estimates. We generated a Random Forests model with an increasing variable number to determine the effect of additional motifs on the model. Each model generated was evaluated using 10-fold cross validation and tested versus a ZeroR null model which classifies all instances solely based on the majority (mode) class. In all cases, the Random Forest model performed significantly better at classification (two-tailed t-test, P < 0.05) than the null model unless otherwise noted. In order to select the best features for use in model generation, we ranked all features by the information gain criterion implemented in WEKA. Information gain is the measure of the contribution of a particular feature to the model. After ranking, we extracted the top features. When referring to a number of variables, we include only that number of topmost ranked attributes. We applied multivariate adaptive regression splines (MARS) (Friedman 1991; Friedman, Roosen 1995; Hastie et al. 2009) using the software suite Salford Predictive Modeler (v.7) from Salford Systems (http://www.salford-systems.com). MARS is a form of 94 regression analysis that splits predictive variables into several intervals, allows potential nonlinear relationships over different intervals (basis functions) and combines individual models as a final quantitative and predictive model. Importantly, MARS also allows for any degree of interaction between variables. The quality of MARS models can be ascertained using the generalized cross-validation (GCV) criterion (Craven, Wahba 1979), with GCV evaluating the fit of the model penalizing complexity and the optimal model being the one with the lowest GCV score. The measure R2GCV is the MARS estimate of how well this model would perform on new data. Alternative estimates of model quality such standard R2 only considers the fit of the model to the data and can be subject to substantial overfitting and are not reported (we obtained R2 > R2GCV in all cases). Because the number of data points is not very large (n=1,191 independent 100-kb regions), we avoided portioning the data into training and test samples and, instead, we applied cross-validation and the MARS legacy modes to estimate the optimal model and its performance based on the GCV criterion (Friedman 1991). Because under the legacy mode MARS builds a sequence of models using all available data (and could therefore overestimate the performance of the model), we show MARS results based on 10-fold cross-validation to train and test to classifiers unless specifically noted (in all cases, a 10-fold validation mode generates smaller R2GCV than when a legacy mode is applied). Finally, variable importance under MARS was measured by the Gini index (Breiman et al. 1984) and it is shown in terms of relative importance (percentage) of variables as compared to the best one. 3.4.3 Population-scaled high-resolution crossover maps We calculated the population-scaled recombination rate for two African populations of D. melanogaster using the program LDhelmet (Chan, Jenkins, Song 2012). LDhelmet is a statistical method that allows estimating fine-scale recombination rates across genomes 95 based on patterns of linkage disequilibrium, where the parameter estimated is the population-scaled crossover rate per bp and generation (ρLD); ρLD = 2 Ne r, where Ne is the effective population size of the population and r is the rate of crossover per bp and generation in females. Note therefore that estimates of ρLD by LDhelmet represent historic estimates of crossover for the population or species under analysis. We analyzed the Rwanda (RG) and Zambia (ZI) populations because both are from the sub-Saharan ancestral range of D. melanogaster, which minimizes the non-equilibrium effects caused by recent expansion observed in western Africa and non-African D. melanogaster populations and show low levels of admixture (Pool et al. 2012; Lack et al. 2015). Moreover, RG and ZI represent a relatively large sample of strains with no chromosomal inversions (Pool et al. 2012; Lack et al. 2015). We obtained the genomic sequences from the Drosophila Genome Nexus (Lack et al. 2015) and analyzed only strains with no evidence of chromosomal inversions in any chromosomal arm. In total our analysis included 19 RG sequences (RG10, RG13N, RG15, RG19, RG22, RG24, RG28, RG2, RG32N, RG33, RG34, RG35, RG38N, RG39, RG4N, RG6N, RG7, RG8) and 20 ZI sequences ( ZI184, ZI250, ZI252, ZI271, ZI311N, ZI320, ZI324, ZI332, ZI344, ZI378, ZI386, ZI398, ZI402, ZI418N, ZI420, ZI455N, ZI457, ZI477, ZI517, ZI85). Unless noted, we used results from analyzing the RG population because it combines a relatively large sample of strains with no chromosomal and the lowest and well characterized levels of admixture (Pool et al. 2012), thus allowing masking admixture regions. Following (Chan, Jenkins, Song 2012) we applied a block penalty of 50 and an effective mutation rate of 0.006 per base pair. The data was dived in blocks of 1000 SNPs, with 200 SNPs of overlap. For each block we ran LDhelmet for 500,000 iterations after 100,000 iterations of burn-in. Recombination maps for each chromosomal arm were analyzed as non-overlapping adjacent windows of 100 Kb. At 96 100kb-scale, RG and ZI show a highly correlated, albeit different, population-based crossover maps (Spearman’s ρ= 0.76; P < 1x10-16). 3.5 Supplementary Information Table 3.1S Statistics of motif presence in D. melanogaster. 97 Table 3.2S Summary of MARS models of crossover distribution in D. melanogaster. Table 3.2S Summary of MARS models of crossover distribution in D. melanogaster. 98 Figure 3.5S Effect of false discovery thresholds on motif presence and sequence. A) Estimates of motif 1 (M1) presence across chromosome arm 2R. Purple line indicates presence of the motif (P < 0.05) with no FDR correction. Green and red lines indicate motif presence after applying 5% and 1% FDR correction. Blue line indicates motif presence based on match to strict consensus sequence and no FDR correction. B‐E) Motif logos formed from the collected motif matches from each above estimates of motif presence. 99 Figure 3.6S Motif logos and individual Spearman’s correlation with crossover rates. (Continued on next page) 100 Figure 3.6S (Continued) Motif logos numbered by descending E-value of seed PFMs in (Comeron, et al. 2012). Motif logos based on the position frequency matrix of motifs assigned as present at a 1% FDR. Logos are presented with positions (y-axes) weighted by information content (x-axes) per site. Spearman’s ρ of motif presence and crossover rates based on non-overlapping 100kb regions across the whole genome. RG and ZI indicate results using high-resolution population-estimates of crossover rates from the Rwanda (RG) and (Zambia) ZI populations, respectively (see Methods for details). Exp.Rec. indicates results based on experimentally measured crossover rates (Comeron, et al. 2012) after removing the sequences used to identify crossover-enriched motifs. Figure 3.7S Correlation between motif presence and crossover rates for different chromosome arms. Spearman’s (ρ) of motif presence and crossover rates for each chromosome arm separately based on nonoverlapping 100kb regions. Only correlations with P < 0.05 are shown. 101 Figure 3.8S LASSO coefficient paths. Coefficient paths for a LASSO model across lambda (regularization) values (x‐axis). Motifs in figure legend are shown in the order they enter the model (M3 first, M5 second, etc.) 102 3.5 Chapter 3 References Adrian, A, J Comeron. 2013. The Drosophila early ovarian transcriptome provides insight to the molecular causes of recombination rate variation across genomes. BMC Genomics 14. Aymard, F, B Bugler, CK Schmidt, et al. 2014. Transcriptionally active chromatin recruits homologous recombination at DNA double-strand breaks. Nat Struct Mol Biol 21:366-374. Bailey, TL, M Boden, FA Buske, M Frith, CE Grant, L Clementi, J Ren, WW Li, WS Noble. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37:W202-208. Banfield, RE, LO Hall, KW Bowyer, WP Kegelmeyer. 2007. A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29:173-180. Baudat, F, J Buard, C Grey, A Fledel-Alon, C Ober, M Przeworski, G Coop, B de Massy. 2010. PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice. Science 327:836-840. Beadle, GW. 1932. A possible influence of the spindle fibre on crossing-over in Drosophila. Proc Natl Acad Sci U S A 18:160-165. Bembom, O. 2007. seqLogo: An R package for plotting DNA sequence logos. R package version 1.30.0. . Bessoltane, N, C Toffano-Nioche, M Solignac, F Mougel. 2012. Fine scale analysis of crossover and non-crossover and detection of recombination sequence motifs in the honeybee (Apis mellifera). PLoS ONE 7:e36229. Blat, Y, RU Protacio, N Hunter, N Kleckner. 2002. Physical and functional interactions among basic chromosome organizational features govern early steps of meiotic chiasma formation. Cell 111:791-802. Blumental-Perry, A, D Zenvirth, S Klein, I Onn, G Simchen. 2000. DNA motif associated with meiotic double-strand break regions in Saccharomyces cerevisiae. EMBO Rep 1:232-238. 103 Borde, V, B de Massy. 2013. Programmed induction of DNA double strand breaks during meiosis: setting up communication between DNA and the chromosome structure. Curr Opin Genet Dev 23:147-155. Breiman, L. 2001. Random Forests. Machine Learning 45:5-32. Breiman, L, J Friedman, R Olshen, C Stone. 1984. Classification and regression trees. Boca Raton: CRC Press. Brooks, LD. 1988. The evolution of recombination rates. In: REaL Michod, B.R., editor. The evolution of sex. Sunderland, MA: Sinauer Associates. p. 87-105. Buhler, C, V Borde, M Lichten. 2007. Mapping meiotic single-strand DNA reveals a new landscape of DNA double-strand breaks in Saccharomyces cerevisiae. PLoS Biol 5:e324. Chan, AH, PA Jenkins, YS Song. 2012. Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genet 8:e1003090. Comeron, JM. 2014. Background selection as baseline for nucleotide variation across the Drosophila genome. PLoS Genet 10:e1004434. Comeron, JM, R Ratnappan, S Bailin. 2012. The many landscapes of recombination in Drosophila melanogaster. PLoS Genet 8:e1002905. Coop, G, X Wen, C Ober, JK Pritchard, M Przeworski. 2008. High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans. Science 319:1395-1398. Craven, P, G Wahba. 1979. Smoothing noisy data with spline functions. Numerische Mathematik 31:377-403. Cromie, GA, RW Hyppa, HP Cam, JA Farah, SI Grewal, GR Smith. 2007. A discrete class of intergenic DNA dictates meiotic DNA break hotspots in fission yeast. PLoS Genet 3:e141. 104 D'haeseleer, P. 2006. How does DNA sequence motif discovery work? Nature Biotechnology 24. Das, M, H-K Dai. 2007. A survey of DNA motif finding algorithms. BMC Bioinformatics 8:S21. Ding, Y, W Lorenz, J Chuang. 2012. CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences. BMC Bioinformatics 13:32. Dlakic, M, RE Harrington. 1998. Unconventional helical phasing of repetitive DNA motifs reveals their relative bending contributions. Nucleic Acids Res 26:4274-4279. Dumont, BL, MA White, B Steffy, T Wiltshire, B Payseur. 2011. Extensive recombination rate variation in the house mouse species complex inferred from genetic linkage maps. Genome Res 21:114-125. Fan, Q-Q, TD Petes. 1996. Relationship between nuclease-hypersensitive sites and meiotic recombination hot spot activity at the HIS4 locus of Saccharomyces cerevisiae. Mol Cell Biol 16:16. Fledel-Alon, A, EM Leffler, Y Guan, M Stephens, G Coop, M Przeworski. 2011. Variation in human recombination rates and its genetic determinants. PLoS ONE 6:e20321. Fowler, KR, M Sasaki, N Milman, S Keeney, GR Smith. 2014. Evolutionarily diverse determinants of meiotic DNA break and recombination landscapes across the genome. Genome Res 24:1650-1664. Friedman, JH. 1991. Multivariate adaptive regression splines. The annals of statistics:1-67. Friedman, JH, CB Roosen. 1995. An introduction to multivariate adaptive regression splines. Stat Methods Med Res 4:197-217. Gerton, JL, J DeRisi, R Shroff, M Lichten, PO Brown, TD Petes. 2000. Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 97:11383-11390. 105 Gossmann, TI, AW Santure, BC Sheldon, J Slate, K Zeng. 2014. Highly variable recombinational landscape modulates efficacy of natural selection in birds. Genome Biol Evol 6:2061-2075. Hall, M, E Frank, G Holmes, B Pfahringer, P Reutemann, IH Witten. 2009. The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11:10-18. Hartmann, H, EW Guthöhrlein, M Siebert, S Luehr, J Söding. 2013. P-value based regulatory motif discovery using positional weight matrices. Genome Res 23:181194. Hastie, T, R Tibshirani, J Friedman, T Hastie, J Friedman, R Tibshirani. 2009. The elements of statistical learning: Springer. Heil, CS, MA Noor. 2012. Zinc finger binding motifs do not explain recombination rate variation within or between species of Drosophila. PLoS ONE 7:e45055. Herbert, A, A Rich. 1996. The biology of left-handed Z-DNA. J Biol Chem 271:1159511598. Hinch, AG, A Tandon, N Patterson, et al. 2011. The landscape of recombination in African Americans. Nature 476:170-175. Hizver, J, H Rozenberg, F Frolow, D Rabinovich, Z Shakked. 2001. DNA bending by an adenine--thymine tract and its role in gene regulation. Proc Natl Acad Sci U S A 98:8490-8495. Hu, J, B Li, D Kihara. 2005. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 33:4899-4913. Hussin, J, MH Roy-Gagnon, R Gendron, G Andelfinger, P Awadalla. 2011. Agedependent recombination rates in human pedigrees. PLoS Genet 7:e1002251. Jeffreys, AJ, R Neumann. 2002. Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nat Genet 31:267-271. 106 Jeffreys, AJ, R Neumann. 2005. Factors influencing recombination frequency and distribution in a human meiotic crossover hotspot. Human Mol Genet 14:22772287. Jiang, H, N Li, V Gopalan, et al. 2011. High recombination rates and hotspots in a Plasmodium falciparum genetic cross. Genome Biol 12:R33. Keeney, S. 2001. Mechanism and control of meiotic recombination initiation. Curr Topics Dev Biol Volume 52:1-53. Kim, S, V Plagnol, TT Hu, C Toomajian, RM Clark, S Ossowski, JR Ecker, D Weigel, M Nordborg. 2007. Recombination and linkage disequilibrium in Arabidopsis thaliana. Nat Genet 39:1151-1155. Kleckner, N. 2006. Chiasma formation: chromatin/axis interplay and the role(s) of the synaptonemal complex. Chromosoma 115:175-194. Koehler, KE, CL Boulton, HE Collins, RL French, KC Herman, SM Lacefield, LD Madden, CD Schuetz, RS Hawley. 1996. Spontaneous X chromosome MI and MII nondisjunction events in Drosophila melanogaster oocytes have different recombinational histories. Nat Genet 14:406-414. Kong, A, DF Gudbjartsson, J Sainz, et al. 2002. A high-resolution recombination map of the human genome. Nat Genet 31:241-247. Kong, A, G Thorleifsson, DF Gudbjartsson, et al. 2010. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467:1099-1103. Koszul, R, KP Kim, M Prentiss, N Kleckner, S Kameoka. 2008. Meiotic chromosomes move by linkage to dynamic actin cables with transduction of force through the nuclear envelope. Cell 133:1188-1201. Kulathinal, RJ, SM Bennett, CL Fitzpatrick, MA Noor. 2008. Fine-scale mapping of recombination rate in Drosophila refines its correlation to diversity and divergence. Proc Natl Acad Sci U S A 105:10051-10056. 107 Lack, JB, CM Cardeno, MW Crepeau, W Taylor, RB Corbett-Detig, KA Stevens, CH Langley, JE Pool. 2015. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics 199:1229-1241. Lee, JW, JB Lee, M Park, SH Song. 2005. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal 48:869-885. Lindsley, DL, GG Zimm. 1992. The genome of Drosophila melanogaster San Diego, CA: Academic Press. Liu, H, X Zhang, J Huang, JQ Chen, D Tian, LD Hurst, S Yang. 2015. Causes and consequences of crossing-over evidenced via a high-resolution recombinational landscape of the honey bee. Genome Biol 16:15. Llopart, A. 2015. Parallel faster-x evolution of gene expression and protein sequences in Drosophila: beyond differences in expression properties and protein interactions. PLoS ONE 10:e0116829. MacIsaac, KD, E Fraenkel. 2006. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2:e36. Mancera, JM, L Vargas-Chacoff, A Garcia-Lopez, A Kleszczynska, H Kalamarz, G Martinez-Rodriguez, E Kulczykowska. 2008. High density and food deprivation affect arginine vasotocin, isotocin and melatonin in gilthead sea bream (Sparus auratus). Comp Biochem Physiol A Mol Integr Physiol 149:92-97. McGaugh, SE, CS Heil, B Manzano-Winkler, L Loewe, S Goldstein, TL Himmel, MA Noor. 2012. Recombination modulates how selection affects linked sites in Drosophila. PLoS Biol 10:e1001422. Meisel, RP, T Connallon. 2013. The faster-X effect: integrating theory and data. Trends Genet 29:537-544. Miller, DE, S Takeo, K Nandanan, et al. 2012. A Whole-Chromosome Analysis of Meiotic Recombination in Drosophila melanogaster. G3 (Bethesda) 2:249-260. 108 Mirouze, M, M Lieberman-Lazarovich, R Aversano, E Bucher, J Nicolet, J Reinders, J Paszkowski. 2012. Loss of DNA methylation affects the recombination landscape in Arabidopsis. Proc Natl Acad Sci U S A. Moens, PB, NK Kolas, M Tarsounas, E Marcon, PE Cohen, B Spyropoulos. 2002. The time course and chromosomal localization of recombination-related proteins at meiosis in the mouse are compatible with models that can resolve the early DNADNA interactions without reciprocal recombination. J Cell Sci 115:1611-1622. Morton, NE, DC Rao, S Yee. 1976. An inferred chiasma map of Drosophila melanogaster. Heredity (Edinb) 37:405-411. Muñoz-Fuentes, V, A Di Rienzo, C Vilà. 2011. Prdm9, a major determinant of meiotic recombination hotspots, is not functional in dogs and their wild relatives, wolves and coyotes. PLoS ONE 6:e25498. Myers, S, L Bottolo, C Freeman, G McVean, P Donnelly. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science 310:321-324. Myers, S, C Freeman, A Auton, P Donnelly, G McVean. 2008. A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet 40:1124-1129. Neel, JV. 1941. A relation between larval nutrition and the frequency of crossing over in the third chromosome of Drosophila melanogaster. Genetics 26:506-516. Ohta, KS, T.; Nicolas, A. 1994. Changes in chromatin structure at recombination initiation sites during yeast meiosis. EMBO 13:9. Oliver, PL, L Goodstadt, JJ Bayes, Z Birtle, KC Roach, N Phadnis, SA Beatson, G Lunter, HS Malik, CP Ponting. 2009. Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa. PLoS Genet 5:e1000753. Pan, J, S Keeney. 2007. Molecular cartography: mapping the landscape of meiotic recombination. PLoS Biol 5:e333. 109 Pan, J, M Sasaki, R Kniewel, et al. 2011. A hierarchical combination of factors shapes the genome-wide topography of yeast meiotic recombination initiation. Cell 144:719-731. Parsons, PA. 1988. Evolutionary rates: effects of stress upon recombination. Biol J Linn Soc 35:49-68. Parvanov, ED, PM Petkov, K Paigen. 2010. Prdm9 Controls Activation of Mammalian Recombination Hotspots. Science 327:835. Parvinen, M, KO Soderstrom. 1976. Chromosome rotation and formation of synapsis. Nature 260:534-535. Perez-Martin, J, V de Lorenzo. 1997. Clues and consequences of DNA bending in transcription. Annu Rev Microbiol 51:593-628. Petes, TD. 2001. Meiotic recombination hot spots and cold spots. Nat Rev Genet 2:360-369. Pool, J, R Corbett-Detig, R Sugino, et al. 2012. Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet 8:e1003080. Ross, JA, DC Koboldt, JE Staisch, HM Chamberlin, BP Gupta, RD Miller, SE Baird, ES Haag. 2011. Caenorhabditis briggsae recombinant inbred line genotypes reveal inter-strain incompatibility and the evolution of recombination. PLoS Genet 7:e1002174. Schneider, TD. 2002. Consensus Sequence Zen. Appl Bioinformatics 1. Segal, E, Y Fondufe-Mittendorf, L Chen, A Thastrom, Y Field, IK Moore, JP Wang, J Widom. 2006. A genomic code for nucleosome positioning. Nature 442:772-778. Ségurel, L, EM Leffler, M Przeworski. 2011. The Case of the Fickle Fingers: How the PRDM9 Zinc Finger Protein Specifies Meiotic Recombination Hotspots in Humans. PLoS Biol 9:e1001211. 110 Shibuya, H, K Ishiguro, Y Watanabe. 2014. The TRF1-binding protein TERB1 promotes chromosome movement and telomere rigidity in meiosis. Nat Cell Biol 16:145-156. Simcha, D, ND Price, D Geman. 2012. The Limits of De Novo DNA Motif Discovery. PLoS ONE 7:e47836. Singh, ND, EA Stone, CF Aquadro, AG Clark. 2013. Fine-scale heterogeneity in crossover rate in the Garnet-Scalloped region of the Drosophila melanogaster X Chromosome. Genetics. Smukowski, CS, MA Noor. 2011. Recombination rate variation in closely related species. Heredity (Edinb) 107:496-508. Steiner, WW, EM Steiner, AR Girvin, LE Plewik. 2009. Novel nucleotide sequence motifs that produce hotspots of meiotic recombination in Schizosaccharomyces pombe. Genetics 182:459-469. Stevison, LS, AE Woerner, JM Kidd, JL Kelley, KR Veeramah, KF McManus, CD Bustamante, MF Hammer, JD Wall. 2015. The time-scale of recombination rate evolution in great apes. bioRxiv doi:10.1101/013755. Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. J. R. Stat Soc. B-Methodological 58:267-288. Travers, AA. 1990. Why bend DNA? Cell 60:177-180. Treco, D, N Arnheim. 1986. The evolutionarily conserved repetitive sequence d(TG.AC)n promotes reciprocal exchange and generates unusual recombinant tetrads during yeast meiosis. Mol Cell Biol 6:3934-3947. Wu, T-C, M Lichten. 1994. Meiosis-induced double-strand break sites determined by yeast chromatin structure. Science 263:3. Yelina, NE, K Choi, L Chelysheva, et al. 2012. Epigenetic remodeling of meiotic crossover frequency in Arabidopsis thaliana DNA methyltransferase mutants. PLoS Genet 8:e1002844. 111 CHAPTER 4: Environmental contribution to recombination variation. 4.1 Introduction Since the early 1900’s researchers have realized that meiotic recombination exhibits an astounding degree of variation. This variation in recombination rates persists at varying levels—from within a single individual’s genome, to large differences between individuals of the same species. These differences are predicted to alter the amount of standing polymorphism throughout the genome, and for that reason, variation in recombination rates is an important feature that affects the evolution of a species. Given the substantial amount of variation in crossing over rates, one might be surprised to find that the meiotic machinery and process of meiosis is heavily conserved (Keeney 2001). In fact, the process of meiosis is pervasive throughout all animal life on earth with relatively few exceptions. Given the apparent paradox exhibited by the near-universal conservation of recombination, but vast variation in its distribution, explaining this variation has been slow. Many factors have been heretofore implicated in affecting the distribution of recombination events, such as meiotic transcription (Adrian and Comeron 2013), epigenetic marks (Mirouze, Lieberman-Lazarovich et al. 2012, Yelina, Choi et al. 2012), nucleotide composition (Miller, Takeo et al. 2012), motifs (Steiner, Steiner et al. 2009, Baudat, Buard et al. 2010), among others. Despite significant effort, models predicting rates of crossover are incomplete (Adrian, Cruz Corchado, and Comeron, in review), and require additional data to incorporate thermodynamic considerations of the recombination pathway. Early experiments by Kidwell and others (Detlefsen and Clemente 1923, Stern 1926, Bridges 1927, Kidwell 1972) identified that modulators of recombination rates were numerous. Aside from recombination rates being altered between species and among 112 populations of a single species, certain environmental factors were quickly appreciated to have a strong effect on genome wide recombination rates. In particular, maternal age, temperature, and food quality were determined to affect genome wide recombination rates (Bridge 1915, Charlesworth & Charlesworth 1985, Plough 1917). The same regions of the genome would display different linkage maps among populations and under differential conditions. These observations provided researchers with an abundant assortment of questions: are there factors that decrease recombination rates in addition to the many factors that appear to increase recombination rates. Are the factors shaping recombination responses to differential environments the same as the ones controlling recombination rates at other levels? Does the distribution of recombination events change, or does a proportional change occur across the genome? Recently, researchers have assumed that rearing Drosophila at different temperatures would change overall recombination rates proportionally, but not the distribution of the events between phenotypic markers (Singh 2014). Whether this assumption is valid, or if recombination rates are altered in other regions differentially is presently unknown. Preliminary work in our lab performed in 2010 suggested that overall rates and genomic distribution of recombination events would be altered. Furthermore, due to a collection of previous observations by our lab and others, we felt it was important to understand how recombination rates are altered in a few specific cases where the environment was altered. While early studies had a collection of environmental factors that affect recombination, no one had described how distribution and amount of recombination changes in response to any factor at a genome-wide, high-resolution (500kb or less) scale. As a result, we set out to utilize next-generation sequencing to rapidly generate high- 113 resolution maps of crossing over under several conditions that wild Drosophila would naturally encounter: elevated temperature, decreased temperature, and acetic acid stress. The results reported herein are the first of a large project describing the impact of environmental conditions on recombination rates. Preliminarily, we have created a genetic broad-scale map of flies exposed to 0.5% acetic acid in their growth medium at two time frames. Thus far, I have generated more than 500 single meiosis library preparations for sequencing. As sequencing data is not available at the time of writing this thesis, I will instead show the data from the crosses performed and elaborate upon the preliminary studies and development of new protocols facilitating the high-throughput development of fine-scale recombination maps. 4.2 Results & Discussion 4.2.1 Identification of Stress-Inducing Conditions As we wanted to assay recombination rates during natural stress, I chose to pursue both age and acid-related stress and how these conditions affect recombination rates of Drosophila. Acetic acid is the final fermentation product of rotting fruits and is naturally present in many food sources that Drosophila feed upon. Drosophila are attracted to the odor of vinegar in low concentration, and repulsed by the excessively strong odor of more concentrated vinegar (Ai, Min et al. 2010). Rotting bananas—a favorite of Drosophilids cohabitating with humans—have been shown to contain 0.28-0.74% acetic acid per unit weight while fermenting (Omura and Honda 2003). Because of its common occurrence in the natural food of Drosophila, and simultaneously a factor that elicits clear avoidance behavior at high concentration, we chose to assay recombination rates under acetic acid stress. Notably, this very experiment (albeit without next-generation sequencing) was suggested by Gowen (1919) who observed early on that many environmental factors 114 affected recombination rates in our favorite model fly. We first determined what concentration of acid to use by rearing Drosophila on media containing 0, 0.5, 1, and 2% acetic acid by volume of standard corn meal agar. Concentrations of 0.5 and 1% acid exhibited a 20-40% reduction in the number of offspring surviving post-eclosion relative to control (Figure 4.1). Interestingly, though there was no statistical difference in offspring number between 0.5% and 1% acid (P>0.05), bottles containing 2% acetic acid produced a maximum of one living fly across three experiments. We, therefore, chose to continue our experiments with 0.5% acetic acid. Figure 4.1 Surviving offspring in media containing acetic acid. Standard corn meal agar with 0, 0.5, 1, and 2% acetic acid by volume. Error bars indicate +/- 1 Standard Error across three experiments. Blue bars: strain 6036. Red bars: strain Raleigh 375 4.2.2 Recombination Frequency among Phenotypic Markers I crossed mutant flies containing five phenotypic markers on the X chromosome (strain 6036), with a Drosophila Genetic Reference Panel (DGRP) strain, 375. These 115 heterozygous flies were then placed on acid media with new purebred males to generate recombinant offspring (Importantly, Drosophila melanogaster males do not exhibit recombination). Flies were allowed to lay eggs for three days before being moved to new acid media for days 4-7, and then transferred to a final bottle of acid media. This scheme allowed us to assay recombination rates of acid-stressed flies and also observe age-related stress effects. Of 4648 male flies scored for the presence or absence of five markers independently, we observed, on average, a significant (X2 P < 0.05 in all cases) reduction in recombination rates across y-w, w-ct, and ct-m intervals (Figure 4.2) following correction for interference (Kosambi 1944). The m-f interval displayed no significant reduction following correction. To my knowledge, this is the first report of a reduction in recombination rates following the application an environmental ‘stress’ to an organism during meiosis. Notably, however, is that our mutant-carrying strain 6036 displays reduced fitness relative to wild type due to the phenotypic markers that affect several developmental processes, namely eclosion (Figure 4.1). For this reason, we will carry out recombination assays using Illumina sequencing on recombinant, but not phenotypically affected, females. This approach will allow a decidedly un-biased assay of recombination rates across the complete genome under stress. If this observed reduction is supported by our sequencing results, however, the implications bode poorly for this population of flies experiencing stress. In such a case, a genome-wide reduction in recombination rates would expectedly reduce the capacity of the population to adapt to its present conditions. The present assay is limited to just the X-chromosome, and it is likely that other chromosomes display different alterations in recombination rates. Nevertheless, why these particular regions exhibit repression of recombination is interesting, and could potentially indicate the lack of polymorphisms affecting fitness under acetic acid stress on the X chromosome. 116 Figure 4.2 Map distances between select phenotypic markers under acid stress. B1 Observed B2 Observed Expected (Flybase) 25.0 20.0 CM 15.0 10.0 5.0 0.0 y-w w-ct ct-m m-f INTERVAL Map distances in centiMorgans among five phenotypic markers. Blue: Observed map distance for acid-stressed flies from eggs lain within the first three days post-mating. Red: Observed map distance for acid-stressed flies from eggs lain within three to seven days post-mating. Green: Expected distances obtained from Flybase.org. 4.3 Materials and Methods 4.3.1 Generation of Recombinants Recombinant flies were generated by first crossing male RAL 375 strain D. melanogaster with virgin females of strain 6036 which contained homozygous X-linked markers for yellow, white, cut, miniature, and forked (y, w, ct, m, f, respectively) to allow for manual scoring of male recombinants and to track that no errors were made during crosses. Crosses were kept at 23.5°C unless otherwise specified. Heterozygotes were allowed to eclose and virgin females were crossed with new male 375 flies in vials, and then placed on cornstarch media containing 0.5% acetic acid after 12 hours. Adults were transferred to new media (also containing 0.5% acetic acid) after 3, and 7 days post-eclosion. Offspring males from each bottle were scored for their recombinant phenotype and frozen alongside females at -20oC until library preparation. Acetic Acid media was prepared by liquefying prepared cornmeal media with a small amount of distilled water added to assist melting and to replace evaporation. Milk bottles were filled with either the reliquifyed corn meal 117 media or the media containing an addition of 0.5% glacial acetic acid. Bottles were allowed to cool and set for 24 hours before use. 4.3.2 DNA Library Preparation Illumina libraries were prepared using a heavily modified protocol based on Comeron (2012). Many steps in the procedure had to be modified in order to yield consistent successful preparations in a high-throughput setting. Single flies were disrupted for 30 seconds at 50 cycles per second in a Qiagen Tissue Lyser LT. DNA was extracted from single recombinant female flies in a 96-well format with the Qiagen DNEasy 96-well kit, following the manufacturer’s insect protocol with additional proteinase K added to maximize yield. DNA extracts were quantified using a Molecular Biosystems 96-well plate reader following a modified picogreen-assay protocol using Promega Quant-IT reagent to detect fluorescence of ds-DNA. DNA extracts were then digested with either MboII, HpyAV, or Mn1I restriction enzymes at 37°C for 50 minutes, with an inactivation at 56°C for 20 minutes. This digestion was important for several reasons. Firstly, the use of these enzymes leaves a 3’ adenine overhang that allows for the ligation of sequence specific adapters. Secondarily, each restriction enzyme leaves a specific and detectable signature that can be bioinformatically identified, allowing the repeated use of the same adapter sequence with different restriction enzymes in multiplex. The digested products then had Illumina specific adapters ligated overnight at 16°C. The Ligation was terminated at 56°C for 20 minutes, and the USER enzyme (New England Biolabs) was finally added to excise an internal uracil linker in the linked adapters. These adapter-ligated products were grouped in sets of five by concentration, and cleaned up using a modified Serapure magnetic bead mixture (GE Healthcare) to optimally exclude excess unligated adapters. The cleaned-up reaction was run on a 0.75% agarose gel at 90V 118 for 80 minutes and a 450bp fragment was excised from each reaction and purified using the Qiagen Gel Cleanup kit following the manufacturers recommended protocol. The size selected, adapter ligated products were then enriched via high-fidelity polymerase chain reaction with NEB Phusion HF master mix for 19 to 22 cycles depending on initial concentration. An aliquot of the library was run on an agarose gel to confirm success and correct size distribution. Completed libraries were then again quantified on a Molecular Biosystems plate reader and adjusted to a final concentration of 15nM. Ilumina sequencing was performed on an Illumina Hi-Seq 2500 with a read length of 125bp. 4.3.3 Recombination rate estimation For each library, reads will be split based on adapter sequence and restriction enzyme signature and trimmed 3’ to eliminate the barcode sequence, and 5’ until a quality score of ten or greater is reached. Quality metrics will be performed to ensure un-biased sequencing and to verify the absence of sequencing artifacts. Because of the heavily multiplexed nature of our libraries and sensitivity of our approach to errors, we will required exact read-matches to the respective genome in order to eliminate untrustworthy data. In order to do so, we will first loosely map to the Drosophila reference genome, and then the mapped reads will then be re-mapped to the 6036 and 375 genomes. Heterzygous sites and ambiguous bases will be removed from the analysis. For each sample, crossover sites will be inferred by ‘switches’ to either 6036 or 375 at polymorphic sites as per Comeron 2012. 4.4 Chapter 4 References Adrian, A. and J. Comeron (2013). "The Drosophila early ovarian transcriptome provides insight to the molecular causes of recombination rate variation across genomes." BMC Genomics 14(794). 119 Ai, M., S. Min, Y. Grosjean, C. Leblanc, R. Bell, R. Benton and G. S. Suh (2010). "Acid sensing by the Drosophila olfactory system." Nature 468(7324): 691-695. Baudat, F., J. Buard, C. Grey, A. Fledel-Alon, C. Ober, M. Przeworski, G. Coop and B. de Massy (2010). "PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice." Science 327(5967): 836-840. Bridges, C. B. (1927). "The Relation of the Age of the Female to Crossing over in the Third Chromosome of Drosophila Melanogaster." J Gen Physiol 8(6): 689-700. Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination in Drosophila melanogaster." PLoS Genet 8(10): e1002905. Detlefsen, J. A. and L. S. Clemente (1923). "Genetic Variation in Linkage Values." Proc Natl Acad Sci U S A 9(5): 149-156. Gowen, J. W. (1919). "A Biometrical Study of Crossing Over. on the Mechanism of Crossing over in the Third Chromosome of DROSOPHILA MELANOGASTER." Genetics 4(3): 205-250. Keeney, S. (2001). Mechanism and control of meiotic recombination initiation. Current Topics in Developmental Biology, Academic Press. Volume 52: 1-53. Kidwell, M. G. (1972). "Genetic change of recombination value in Drosophila melanogaster. II. Simulated natural selection." Genetics 70(3): 433-443. Kosambi, D. D. (1944). "The estimation of map distance from recombination values." Annals of Eugenics 12: 172-175. Miller, D. E., S. Takeo, K. Nandanan, A. Paulson, M. M. Gogol, A. C. Noll, A. G. Perera, K. N. Walton, W. D. Gilliland, H. Li, K. K. Staehling, J. P. Blumenstiel and R. S. Hawley (2012). "A Whole-Chromosome Analysis of Meiotic Recombination in Drosophila melanogaster." G3 (Bethesda) 2(2): 249-260. Mirouze, M., M. Lieberman-Lazarovich, R. Aversano, E. Bucher, J. Nicolet, J. Reinders and J. Paszkowski (2012). "Loss of DNA methylation affects the recombination landscape in Arabidopsis." Proceedings of the National Academy of Sciences. Omura, H. and K. Honda (2003). "Feeding responses of adult butterflies, Nymphalis xanthomelas, Kaniska canace and Vanessa indica, to components in tree sap and rotting 120 fruits: synergistic effects of ethanol and acetic acid on sugar responsiveness." J Insect Physiol 49(11): 1031-1038. Steiner, W. W., E. M. Steiner, A. R. Girvin and L. E. Plewik (2009). "Novel Nucleotide Sequence Motifs That Produce Hotspots of Meiotic Recombination in Schizosaccharomyces pombe." Genetics 182(2): 459-469. Stern, C. (1926). "An effect of temperature and age on crossing-over in the first chromosome of Drosophila melanogaster." Proc Natl Acad Sci U S A 12(8): 530-532. Yelina, N. E., K. Choi, L. Chelysheva, M. Macaulay, B. de Snoo, E. Wijnker, N. Miller, J. Drouaud, M. Grelon, G. P. Copenhaver, C. Mezard, K. A. Kelly and I. R. Henderson (2012). "Epigenetic Remodeling of Meiotic Crossover Frequency in Arabidopsis thaliana</italic> DNA Methyltransferase Mutants." PLoS Genet 8(8): e1002844. 121 CHAPTER 5: Conclusions & Future Directions 5.1 Transcriptional Effects on Crossover Localization In Chapter 2, I sought to more fully understand the impact of the transcriptional environment of meiotic cells on fine-scale crossover localization. We found evidence that active transcription during early meiotic development increases the likelihood of crossovers in D. melanogaster, and similarly that reduced transcription was correlated with reduced levels of recombination. This evidence supports the idea that transcriptionally active chromatin may be more susceptible than quiescent chromatin to DSBs. Our models of crossover patterning are enriched by this study because gene expression is also nonrandomly distributed throughout the genome, and displays high variability analogous to the many levels of variability displayed by recombination rates. Therefore, gene expression could provide a molecular, heritable, and plastic mechanism to explain the observed patterns of recombination variation—from the high level of intraspecific variation, to the known influence of environmental conditions. I identified that while the tissues are highly similar, there are significant differences in gene expression between early and late meiotic tissues in the Drosophila germarium. Many of the genes that are over- or under-expressed are involved in proteolysis, likely underscoring the requirement for rapid turnover of many stepwise developmental and organizational processes in the developing oocytes. Our analysis also produced a wealth of information concerning new genes and exons—many of which mapped to thus far unassembled regions of the Drosophila genome. This is fairly remarkable given that Drosophila is one of the most studied and best characterized organisms; these results suggest the existence of many unknown processes ongoing in Drosophila worthy of continued research. Furthermore, I showed that there are multiple parent-of-origin effects 122 on transcription even among similar fly lines (founded from females from the same geographical location). This finding is interesting in that it speaks to the rapidity in which transcription can be altered from generation to generation during a highly conserved and regulated process, though more study is needed in order to explore this possibility further. Finally, I showed that transcribed genes in meiosis are not distributed randomly along the genome, but are instead frequently clustered. The observation that transcribed genes are clustered during meiosis likely underlies the openness of chromatin domains along particular regions of the chromatin, and may contribute to the accessibility of DNA during recombination initiation. Despite our exceptionally deep (400x per condition) transcriptional coverage of meiotic tissues, our study was technically limited in several ways. I was limited in my ability to isolate the exact cells undergoing DSB formation. In my study, I produced a transcriptome for those cells experiencing DSB formation along with associated nurse and epithelial cells; the cells of interest perhaps only consisting of ~5% of the overall cell population. This limitation was due to the necessity of hand-dissecting the desired regions with electrolytically sharpened tungsten probes. While I have since successfully microdissected out the 2a/2b regions of the germaria using laser capture microscopy, the tools to do so were unavailable at the time the study took place. A clear advancement of my study would be to utilize laser-capture microdissection in future experiments. Despite this limitation in the performed study, due to the deep coverage of reads, I can be reasonably confident that non-transcribed or lowly transcribed genes represent truly silent or ‘quiet’ genes during meiosis, and provides support for our findings that regions of low recombination are associated with regions of repressed or absent transcription. 123 The exact nature of the relationship between transcription and the likelihood of recombination is yet unknown. It remains to be seen if transcription itself is the causative factor, or if some related factor(s) is (are) instead. Future studies should investigate the potential for chromatin accessibility, and/or possibly related factors, such as histone methylation, etc. To me, transcription represents a plastic phenomenon that is known to be altered in varying conditions and among populations—which is the reason that I pursued the transcriptional avenue rather than say, DNA accessibility for which the intraspecies variance is less well studied. Additional work is required to parse out the causative agents and to reveal their underlying molecular mechanisms. For now, models including transcription appear to be reasonably suited to explaining a nontrivial amount of the variance in rates of recombination. 5.2 In silico Prediction of Recombination Rates Based on Multiple Motifs In Chapter 3 I conducted a computational investigation of previously identified recombination-associated motifs in Drosophila. I found that a combination of motifs can explain an unprecedented fourth (or more) of the overall variance in genomic recombination rates. Through the use of careful, FDR-corrected estimates of motif occurrences, I show that the presence of poly-A containing or poly-[AT] containing motifs is most significantly predictive of regions of increased recombination, and that the absence of these motifs is predictive of depressed rates of recombination (and not a byproduct of local A/T base composition). Widening our model to include additional factors such as transcription further improves our predictive capabilities, though motif utilization appears to play the dominant role in determining model accuracy. These observations support a model where chromatin accessibility is key to specifying broad-range crossover localization, and further indicates that multiple DNA motifs may be an important factor in 124 the fine scale localization of crossing-over events. Furthermore, the independent identification of a PRDM9-like core motif suggests that there may be unappreciated commonalities between mammalian and insect systems with respect to recombination localization since no PRDM9 homolog has been thus far identified in Drosophila. Importantly, the use of independent estimates of recombination through maps of linkage disequilibrium in natural populations of Drosophila supports the validity of our results (Comeron, Ratnappan et al. 2012). Since the association between these historical recombination datasets is greater than the association between recombination data from which the motifs were generated, the most influential motifs may not be as transient as the PRDM9 motif. Moreover, our observations that these core motifs are utilized differentially among chromosome arms indicates that recombinational processes may be specified differently among the chromosomes—a finding that has not been reported yet in any species thus examined (Adrian, Cruz Corchado, and Comeron). Lastly, the composition of the most influential motifs speaks to a possible role in the primary structure at sites local to recombinational events and raises the possibility that certain chromatin configurations increase the likelihood of a successful DSB or resolution as a crossover. One barrier for this study was the resolution of crossing over events. While population estimates for D. melanogaster are accurate at a 100-kb scale, this scale is not best suited for generating predictive models for recombination rates. This is somewhat mitigated by our use of linkage disequilibrium based estimates of recombination. Furthermore, our initial identification of seed motifs was based on a small number of crossover events that were delimited by 500 base pairs or less (representing less than 5% of the overall dataset). If recombination rates were accurate at sub-1kb, the initial enrichment calls for seed motifs would have been more accurate and perhaps more 125 biologically informative. Thus far, such scales of crossover resolution do not exist for any organism except S. cerevisiae (Mancera, Bourgon et al. 2008). Finally, our study of motifs would have been bolstered by additional recombination data under different conditions, allowing us to more properly parse out the effects of our motifs. I have shown that a combination of motifs and transcriptional activity are predictive of recombination rates in Drosophila. However future studies should investigate to what extent these predictive motifs alter recombination rates, and if they vary from species to species. Key questions remain: Are these motifs utilized in other Drosophilids? Are one or any combination of these motifs necessary for DSB induction? Do these motifs work to alter chromatin structure for DSB induction, or are they secondary? In order to address these questions, a variety of molecular genetics and computational approaches should be employed in concert. Only with a broad spectrum view of these many factors will we formulate an accurate model of recombination localization. 5.3 Environmental Impact on Recombination Rates In Chapter 4, I presented the start of an investigation into how various environmental conditions affected recombination variation in Drosophila melanogaster. While it has been well established that many conditions do affect recombination rates, exactly how and to what extent is largely unknown. Recent studies have even shown that bacterial infection leads to an increase in the recombination rate between markers in Drosophila (Singh, Criscoe et al. 2015). However, there is still little known as to the distribution of those increased number of recombinant events and if this represents a genome-wide phenomenon. I sought to approach this problem in a systematic way to look at how age, temperature, and food stress alters recombination landscapes. Herein I presented preliminary data on acetic acid food stress, a naturally-occurring product of the 126 oxidation of ethanol that is often present in the food of wild Drosophila. From my preliminary scoring of recombinants, it appears that acetic acid may repress recombination across the X chromosome. This could, however, be an artifact of differential survival of recombinants and thus should be taken with caution until more unbiased NGS-Sequencing based data can be obtained. This investigation should be continued in order to determine how different stressors affect recombination landscapes, and whether there is overlap in those differences. Additionally, I will investigate how recombination landscapes are altered when multiple stressors are combined: maternal age and temperature stress, maternal age and acetic acid stress. I am curious to find if the conditions are additive in there effects or if perhaps a new pattern forms altogether. These findings can then be examined under light of previous findings to determine if environmental factors are affecting recombination rates on top of population-level variance, or if environmental effects are separate altogether. 5.4 Summary Variation in organisms’ recombination rate distribution has been observed since the early 1900’s. Until relatively recently, the extent of that variation was largely unknown. Our lab and others have made substantial contributions to this field by identifying and quantifying the many levels of recombination rate variation in Drosophila. My efforts have been focused on the causes of this variation at the population and environmental levels by utilizing a variety of genomics and molecular genetics techniques. Throughout my investigations, we have learned that valuable information may exist in the primary DNA sequence of organisms, and that the meiotic environment during DSB induction and resolution is an important factor contributing to recombination landscapes. We have learned that motifs may play a major role in the localization of crossovers within the 127 Drosophila genome, and that transcriptional effects also play a role, albeit a much smaller role. Thus far we can explain up to 60% of the observed variance in recombination rates, however nearly half remains unexplained. I anticipate that a deeper understanding of how environmental conditions alter recombination landscapes will prove most useful in understanding this remaining portion of unexplained variation. As it stands, models incorporating chromatin accessibility and DNA-binding directors of recombination seem to be the most promising. However, the specific proteins and mechanisms responsible for the localization of the vast majority of recombinational events remain elusive. Recombination-selection based experiments and manipulation of individual recombination correlates is likely required in order to develop the overall picture more completely. In the near future, we mustn’t become nearsighted in our adoption of individual factors (e.g. only PRMD9), but rather be open to the likely fact that many players are involved in this highly complex song and dance. Characterizing the unifying and speciesspecific factors that alter the many levels of recombination rate variation is a key step in our understanding. Of course, the fact that the mechanisms responsible for DSB localization also may themselves vary dramatically from organism to organism. Whatever the underlying forces are revealed to be, it is increasingly apparent that nature has established a remarkably plastic and striking method for altering the efficacy of selection in response to stress. 128 5.5 Chapter 5 References Adrian, A. and J. Comeron (2013). "The Drosophila early ovarian transcriptome provides insight to the molecular causes of recombination rate variation across genomes." BMC Genomics 14(794) Adrian, A. Cruz Corchado, J. Comeron, J. "In silico prediction of recombination rate variation across the Drosophila melanogaster genome based on multiple DNA motif analysis" (In Review). Comeron, J. M., R. Ratnappan and S. Bailin (2012). "The many landscapes of recombination in Drosophila melanogaster." PLoS Genet 8(10): e1002905. Mancera, E., R. Bourgon, A. Brozzi, W. Huber and L. M. Steinmetz (2008). "High-resolution mapping of meiotic crossovers and non-crossovers in yeast." Nature 454(7203): 479-485. Singh, N. D., D. R. Criscoe, S. Skolfield, K. P. Kohl, E. S. Keebaugh and T. A. Schlenke (2015). "EVOLUTION. Fruit flies diversify their offspring in response to parasite infection." Science 349(6249): 747-750.
© Copyright 2025 Paperzz