Ancient Transposable Elements Discovery and Annotation Ayrin Ahia-Tabibi Department of Computer Science Center for Bioinformatics McGill University, Montreal November 2014 A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Master of Computer Science © Ayrin Ahia-Tabibi, November 2014 Abstract Transposable elements (TE), the largest class of repetitive DNA fragments, are the single most abundant component of the genetic material of most eukaryotes. The sheer number, mechanism of transpositions and repetitive natures of the TE sequences are responsible for some challenges in genomics, although that is what makes them particularly interesting entities to study. The recent advancement in the sequencing technologies and the availability of genomic sequences has made the genome-wide analysis of TEs possible. The impact of TEs on structure, evolution and size of the genome as well as genome sequencing and annotation has created growing interest and demand for developing new bioinformatics approaches for their identification. These approaches all aim to computationally discover, detect and analyze both known and novel families of TEs. After their insertion in the genome, most TE copies get relatively quickly degraded, making the recognition of old insertion events challenging. In this thesis, we develop a new pipeline to improve the annotation of ancient transposable elements that have shaped the dynamic component of the human genome. We make use of the availability of inferred ancestral mammalian genome to detect these ancient TE copies using RepeatMasker. Using LiftOver these TEs are lifted to the human genome and then fed to our TEMapper program to be aligned to their corresponding consensus sequences and corrected for the percentage of divergences. Applying the ancient TE annotation pipeline, we revised the annotation of TEs and reached 115Mb coverage gained corresponding to ~7.28% improvement in the human genome. This number corresponds to the significant 3.5% increase in TE composition of the human genome. In addition, we discover novel TE families and investigate their association with genes and regulatory elements. 1 Résumé Les éléments transposables (ET), la plus grande classe de fragments d'ADN répétitif, sont les éléments les plus abondants du génome de la plupart des eucaryotes. Leur nombre même, leur mécanisme de transpositions et leur nature répétitive sont responsables de certains défis importants en génomique, et c’est en partie ce qui les rend particulièrement intéressants à étudier. L’avenue récente des technologies de séquençage et la disponibilité de séquences génomiques a rendu possible l'analyse de ETs dans des génome entiers. L'impact des ET sur la structure, l'évolution et la taille du génome ainsi que le séquençage du génome et l'annotation a suscité l'intérêt et la demande pour développer de nouvelles approches bio-informatique pour leur identification. Après leur insertion dans le génome, la plupart des copies d’ET se dégradent assez rapidement, ce qui rend la reconnaissance de vieux événements d'insertion difficile. Dans cette thèse, nous développons de nouveaux algorithmes visant à améliorer l'annotation des éléments transposables anciens qui ont façonné la composante dynamique du génome humain. Nous faisons usage de la disponibilité de génome des mammifères ancestraux infers pour détecter ces anciennes copies ET en utilisant le programme RepeatMasker. Les ET identifies dans des sequences ancestrales sont ensuite transférés au genome humain grâce à l’outil et sont ensuite introduites dans notre programme de TEMapper à être alignés à leurs séquences consensus. Grâce à ce nouveau mécanisme d’annotation d’ET, nous avons révisé l'annotation d’ET du genome humain et augmenté de 115 Mb la portion du genome annotée comme ayant une origine d’ET. En outre, nous découvrons de nouvelles familles d’ET et démontrons que certaines d’entre elles sont associées à des genes aux fonctions ou profile d’expression conservés et à des éléments de régulation. 2 Table of Contents Abstract ........................................................................................................................................... 1 Résumé ............................................................................................................................................ 2 Table of Contents ........................................................................................................................... 3 Table of Figures.............................................................................................................................. 5 Table of Tables ............................................................................................................................... 6 Acknowledgments .......................................................................................................................... 7 Chapter1: Introduction ................................................................................................................. 8 1 Biology Of Transposable Elements .................................................................................................. 9 1.1 Structure and Systematics of Transposable Elements................................................................ 10 1.1.1 Retrotransposons ................................................................................................................................ 12 1.1.2 DNA Transposons.............................................................................................................................. 15 1.1.3 Contribution of Transposable Elements in Shaping the Human Genome and Association to Diseases 16 2 3 Transposable Element Discovery Methods ................................................................................... 20 2.1 De Novo Methods ...................................................................................................................... 21 2.2 Homology-Based Methods ........................................................................................................ 22 2.3 Structure-Based Methods ........................................................................................................... 24 2.4 Comparative Genomic Methods ................................................................................................ 25 2.5 Integrated Methods .................................................................................................................... 26 Ancestral Mammalian Genome Inference .................................................................................... 27 Chapter 2: Annotation of Ancient Transposable Elements in the Human Genome ............. 29 1 Introduction ..................................................................................................................................... 29 2 Methods ............................................................................................................................................ 30 3 2.1 Ancestral Genome Reconstruction............................................................................................. 32 2.2 Identification of Transposable Elements in Boreoeutherian Ancestor ...................................... 34 2.3 Divergence Calculation .............................................................................................................. 36 Results and Discussion .................................................................................................................... 42 3.1 Human and Boreoeutherian Ancestor Divergence Profiles ....................................................... 43 3 3.2 Mapping and Divergence Correction ......................................................................................... 46 3.3 De-novo Transposable Element Discovery................................................................................ 50 3.4 Conclusion ................................................................................................................................. 54 Chapter 3: Conclusion and Future Directions .......................................................................... 55 References ..................................................................................................................................... 58 4 Table of Figures Figure 1. 1: Human genome composition and percentage shares .......................................................... 10 Figure 1. 2: Schematic representation of major classes of Transposable Elements ......................... 11 Figure 1. 3: Components of the human genome ............................................................................................ 16 Figure 1. 4: Workflow of the 4-step de novo TE detection pipeline ........................................................ 21 Figure 1. 5: Workflow of 4-steps homology based TE annotation pipeline ......................................... 24 Figure 2. 1: Transposable Elements degrading over time ............................................................ 30 Figure 2. 2: Ancient Transposable Element Annotation Pipeline ................................................. 31 Figure 2. 3: Mapping TEs from the boreoeutherian ancestor genome to human genome ............ 32 Figure 2. 4: Vertebrate Phylogenetic Tree .................................................................................... 33 Figure 2. 5: RepeatMasker .cat output file example ...................................................................... 37 Figure 2. 6: LiftOver .mapped output file example ....................................................................... 37 Figure 2. 7: Ancestor program .maf output file example .............................................................. 38 Figure 2. 8: Hypothetical example of nested TEs inserted into the genome ................................. 41 Figure 2. 9: Divergence profile of annotated major TEs in the human genome ........................... 44 Figure 2. 10: Divergence profile of annotated major TEs in the boreoeutherian ancestor genome ................................................................................................................................................ 44 Figure 2. 11: LINE/L2 elements divergence profile comparison .................................................. 45 Figure 2. 12: LINE/CR1 elements divergence profile comparison ............................................... 45 Figure 2. 13: SINE/Alu elements divergence profile comparison ................................................. 45 Figure 2. 14: Coverage gained by revised annotation of LINE/CR1 elements ............................ 47 Figure 2. 15: Coverage gained by revised annotation of LINE/L2 elements ................................ 47 Figure 2. 16: Coverage gained by revised annotation of SINE/MIR elements ............................. 47 5 Table of Tables Table 1. 1: Coding regions modified by TE insertions in human ............................................................. 18 Table 2. 1: Revised TE annotation by ancient transposable element annotation pipeline .............. 49 Table 2. 2: Summary of de novo TE family discovery ................................................................................ 51 6 Acknowledgments This dissertation would have not been possible without the help of so many people in so many ways. My deepest gratitude goes to my supervisor Prof. Mathieu Blanchette, who expertly guided me through my graduate education. His enthusiasm kept me constantly engaged with my research and his personal generosity inspired me to become a better person. I am forever thankful for his understanding, wisdom, patience, encouragement and pushing me farther than I thought I could go. My appreciation extends to all my talented current and former laboratory colleagues, particularly Rola Dali and David Becerra, for their assistance and suggestions throughout my project. They have all truly helped make my time enjoyable in the lab. To McGill University for the opportunity and Natural Science and Engineering Research Council of Canada (NSERC) for the funding throughout these two years. I am grateful to all my friends for helping me survive all the stress and always listening and giving me word of advice. Above ground, I am indebted to my family whose value to me only grows with age. Specially, I thank my wonderful parents, Atoosa and Kian, my beautiful sister, Atra, and my beloved grandmother who always have faith in me, for their unconditional love and endless support. Last but not least, I acknowledge my husband, Babak, who has been by my side since the beginning, given me strength and inspiration, and blessed my life in the hours when the lab lights were off. 7 Chapter1: Introduction A Transposable Element (TE) is a DNA sequence of 200 to 5000 base pair (bp) long that can change its position within the genome, create mutations and alter the cell's genome size. TEs are abundant yet poorly understood components of almost all eukaryotic genomes. They are important biological entities to study because of their role in genome structure, size, rearrangement and contribution to gene and regulatory region evolution. This research is motivated by the fundamental challenges in genome sequencing, genome assembly, annotations and alignments, which are rooted in the mobile and repetitive nature of this dynamic component. In fact, the evolutionary implications and the presence of coding regions in some TEs can complicate the process of gene annotation and genome assembly. Therefore, accurate TE identification and classification is essential for many applications in genomics. Yet, existing algorithms remain incapable of annotating old TE insertions, which impedes on a number of downstream analyses. There are several challenges associated with TE identification caused by the nature of TEs [Lerat 2009] such as: • They do not follow a universal structure. • They insert themselves in different regions of the genome or within one another leading to nested copies. • They mutate and diverge form the original copy over time. • They mostly lose their replication abilities once mutated. Computationally reconstructed mammalian ancestral sequences, which will be discussed later in the chapter, may contain remnants of very old copies from the known and unknown families of TEs, not found in the human genome to date. Applying repeat discovery techniques, it should be feasible to identify these ancient TE copies and therefore, elaborate on the human genome evolution. 8 In this chapter, we review the biology of TEs, introduce different families of TEs, and discuss the main types of TE discovery methods including de-novo, homology-based, structural-based, comparative and integrated approaches. In addition, we explain methods for ancestral mammalian genome reconstruction upon which the proposed TE annotation pipeline relies. 1 Biology Of Transposable Elements The term “repetitive sequence” refers to homologous DNA fragment that are present in multiple copies in the genome. First discovered and analyzed by McClintock in maize in 1948 [McClintock 1948], TEs are a widespread class of repetitive genomic regions (200-5000 bp long) that have the ability to change position, create mutations, and copy-paste themselves into the genome of the host, thereby alter the genome size. The repetitive nature of TEs is mediated by their ability of transposition via an RNA or DNA intermediate and thus, increases the copy number to eventually constitute a large fraction of genome sequences. In humans, certain TE families are present in up to a million copies and all together they account for approximately 50% of the human DNA [Gregory 2005]. Long believed to be ‘selfish’ intragenomic parasitic regions contributing no functional elements for the host, they are now recognized as a major source of new functional genes [Volff 2006] and regulatory elements [Feschotte 2008]. However, TE activities are also associated with a number of genetic diseases, in particular cancer [Solyom & Kazazian 2012] all of which will be discussed later in this chapter. The human genome composition and percentage shares of various functional and nonfunctional sequences are summarized in Figure 1.1 and are elaborated in the next section. 9 Figure 1. 1: Human genome composition and percentage shares [Jasinska et al. 2004] 1.1 Structure and Systematics of Transposable Elements Although on this thesis our focus is on transposable elements, they are only one of the several types of mobile genetic elements. Repetitive DNA was originally classified into “highly”, “middle” and “low copy” repetitive sequences, roughly corresponding to intersperse, tandem, and segmental duplications [Britten & Kohne 1968]. Tandem repeats represent arrays of copies of DNA fragments immediately adjacent to each other in head to tail orientation. In contrast, interspersed repeats are DNA fragments up to 20-30 kilo bases (Kb) in length, inserted randomly into the host DNA. Interspersed repeats are mostly inactive and incomplete copies of inserted TEs. Low copy repeats (LCRs), also known as segmental duplications (SDs), are highly homologous sequence elements within the eukaryotic genome. They are typically 10-300 Kb in length, and bear more than 95% sequence identity [Sharp et al. 2006]. A SD is caused by an error in chromosomal splicing during genetic recombination. 10 According to their mechanism of transposition (the process by which TEs move about a genome), eukaryotic TEs can be categorized into two major types, retrotransposons and DNA transposons. The schematic of the two types is illustrated in Figure 1.2. Transposition can be classified as either "autonomous" or "non-autonomous" in both retrotransposons and DNA transposons. Autonomous TEs encodes a complete set of enzymes characteristic of its family and is self–sufficient in terms of transposition while non-autonomous TEs requires the presence of other TEs to move. They transpose by borrowing the protein machinery encoded by its autonomous relatives. This is often because non-autonomous TEs lack transposase (for DNA transposons) or reverse transcriptase (for retrotransposons). In human, the majority of TEs are non-autonomous [Jurka et al. 2007]. Figure 1. 2: Schematic representation of major classes of Transposable Elements [Slotkin & Martienssen 2007] 11 1.1.1 Retrotransposons Retrotransposons are described as copy and paste TEs and are the most common type of TEs. They are first transcribed through an RNA intermediate form DNA to RNA (mRNA) and then reverse transcribed from RNA to DNA (cDNA), which is then inserted back into a new position in the genome [Craig 1995]. Reverse transcriptase (RT) and endonuclease/integrase (EN/INT) enzymes, which are encoded by autonomous elements, catalyze the process of reverse transcription and integration [Jurka et al. 2007]. 1.1.1.1 Non-Long Terminal Repeat and Long Terminal Repeat Retrotransposons The presence or absence of long terminal repeats (LTRs) further classifies retrotransposons into non-LTR and LTR elements. Non-LTR TEs are best known for the enormous success reproducing in the human genome and have persisted in eukaryotic genomes for hundreds of millions of years. These ancient genetic elements, as their name implies, lack the LTRs. A typical non-LTR retrotransposons contains one or two open reading frames (ORFs) and includes an internal promoter in the 5' terminal region that governs transcription of the retrotransposons [Jurka et al. 2007]. The majority of human TEs result from the current and previous non-LTR activities. LTR retrotransposons which account for ~8% of the human genome, are retroviral-like in structure and mechanism. Although the mechanism of retrotransposition is not yet completely understood, the structure of LTR retrotransposons is as follows. They have direct LTRs that range from ~100 bp to over 5 Kb in size. The two LTR regions, 5' LTR and 3' LTR, are very similar. They are identical when the element inserts into the host genome, and once inserted, they begin to evolve independently [Jurka et al. 2007]. An LTR retrotransposons carry two ORFs for the gag and pol proteins and sometimes a third one downstream, for the env protein. Gag (GroupSpecific antigens) is a polyprotein that forms the core structural proteins of retroviruses. Pol (DNA Polymarase) is the reverse transcriptase, the essential enzyme that carries out the reverse transcription process. Env (Envelop protein) is a viral protein that serves to form the viral envelope. 12 For the sake of completeness, it is worth mentioning that although the two classes of non-LTR and LTR elements are well established and studied, there are also Penelope and Dictyostelium intermediate repeat sequence (DIRS) retrotransposons that were more recently discovered [Arkhipova et al. 2003; Badyaev 2005; Lorenzi et al. 2006; Poulter & Goodwin 2005]. The Penelopes encode 2.5 Kb long ORF and characterized by an unusual LTR not typical for a standard non-LTR retrotansposons. The DIRS retrotransposons consist of ~4.1 Kb of unique internal sequence flanked by inverted terminal repeats of unequal lengths. Member of all these four classes of retrotransposons are present in the genome of all eukaryotic kingdom: protista, plant, fungi, animalia with the exception of Penelope that has not yet been identified in plants. However, since the existing tools are not advanced enough to recognize Penelope and DIRS as distinct classes in the human genome, we do not focus on them here. Retrotransposons are commonly grouped into three main families: • Endogenous Retrovirus-Like Elements (ERVs) [Benit et al. 1999]: TEs with LTRs that encode a reverse transcriptase protein, similar to retroviruses. In humans, these elements have been named HERVs (for human endogenous retroviruses), and several families of such elements have been characterized. Analysis of a large series of mammalian genomic DNAs shows that ERVs are present among all placental mammals, suggesting that these elements were already present at least 70 million years ago [Cordaox et al. 2009]. However, their activity is presently very limited in humans, if it occurs at all [Mills et al. 2007] • Long Interspersed Elements (LINEs) [Ostertag & Kazazian 2001]: Mainly L1s, L2s, and CR1s that do encode reverse transcriptase (autonomous) but lack LTRs, and are transcribed by RNA polymerase II. • Short Interspersed Elements (SINEs) [Kramerov & Vassetzky 2011]: Mainly Alus (for arthrobacter luteus), SVA (for SINE-VNTR-Alu) and MIRs (for mammalianwide interspersed repeat). They are not-LTR retrotransposons that do not encode reverse transcriptase (non-autonomous) and are transcribed by RNA polymerase III. 13 1.1.1.2 Long Interspersed Element (LINE) LINE retrotransposons, which have been present in the human genome for at least 70 million years, are between 5-9 Kb long and comprise an astounding ~21% of the human DNA. An average human genome contains ~80-100 active LINE elements that can retrotranspose to new genomic locations; the rest are inactive copies. They are currently the only known retrotransposons in the human genome that code for the proteins machinery required for their own transpositions. In addition, LINE1 (L1) is the only element from this family that is still active in the human genome today and is found in all mammals [Cordaox et al. 2009]. However, remnants of L2 and CR1 elements are also found in the human genome. 1.1.1.3 Short Interspersed Element (SINE) Alu elements are short stretches of DNA, ~300 bp long on average, and are therefore classified as SINEs. They do not encode protein products and depend on LINE retrotransposons for their replication [Cordaox et al. 2009]. Alu elements of different kinds occur in large numbers in primate genomes only [Cordaox et al. 2009]. There are over one million Alu elements interspersed throughout the human genome, and it is estimated that about 11% of the human genome consists of Alu sequences. In fact, Alus are the most abundant transposable elements in the human genome. Most human Alu insertions can be found in the corresponding positions in the genomes of other primates [Nekrutenko & Li 2001]; however, about 7,000 Alu insertions are unique to humans. Another member of the SINE class is SVA elements [Ostertag 2003] that like Alus are primate linage specific. The SVA family is the hominid-specific youngest TE family identified in human with ~2700 copies. SVA elements are composed of three parts, SINE, VNTR and Alu and vary in size as a result of polymorphisms in VNTR (for variable number tandem repeat). Like other retroelements, Alu and SVA insertions can have both negative effects, by implicating for genetic disorders, and potentially positive effects, by creating new gene families, on the human genome. Alus, SVAs and L1s that together account for one third of the human genome, are the only TEs currently active in humans. They have undeniably been shown to be responsible for genetic disorders [Kazazian et al. 1988; Deininger & Batzer 1999; Chen et al. 2005; Callinan & Batzer 2006; Belancio et al. 2008; Solyom & Kazazian 2012]. 14 1.1.2 DNA Transposons The second type of TEs is DNA transposons [Jurka et al. 2007; Cordaox et al 2009] that are mainly described as cut-and-paste TEs [Craig 1995]. Having been active until ~37 million years ago, DNA transposons are each ~5 Kb long and all together make less than 3% of the human genome. Contrary to retrotransposons, the replication of DNA transposons does not involve an RNA intermediate. Various types of transposases enzymes that cut the TE and paste it in a target site catalyze the transpositions. DNA Transposons are typically bound by terminal inverted repeats (TIRs), which serve as the recognition sequence for the transposases. Some transposases non-specifically bind to any target site in DNA, whereas others bind to specific DNA sequence targets. The transposase cuts at the target site resulting in single-stranded 5' or 3' DNA extends, called sticky ends. The DNA transposon that has been cut by the transposase is then pasted into a new target site. The insertion sites may be identified by short direct repeats followed by a series of TIRs important for the TE excision by transposase. After the insertion, the activity of a DNA polymerase and a DNA ligase respectively fills in gaps and closes the sugar-phosphate backbone. One question that may rise is that if DNA transposons transpose by cut and paste mechanism, how do they accumulate over time? They move through a non-replicative mechanism. Relying on the host machinery, DNA transposons increase their copy numbers through indirect mechanisms [Feschotte & Pritham 2007]. The first mechanism is through the DNA replication process, during which the transposon moves from a newly replicated chromatid to an unreplicated site. Therefore, the transposon is replicated twice, which means a net gain of one copy. The second mechanism results from the repair of double stranded DNA breaks. DNA breaks can be mutagenic to a cell and are repaired by several ways including homologous recombination, a process by which the cell copies the missing DNA sequence from the homologous chromosome. TEs, among other sequences, found on the homologous chromosome will be copied to the damaged chromosome resulting in the possibility of introducing TE copies that were not originally present. 15 Figure 1. 3: Components of the human genome [Gregory 2005] Although here our focus is on the TEs identified in the human genome, it is worth mentioning that other than the cut-and-paste transposons there are two other classes of DNA transposons, Helitrons and Polintons. Helitrons transpose via replicative rolling-circle transposition [Kapitonov & Jurka 2001] and are present in the genome of plants, fungi, insects, nematodes and vertebrates. Like cut-and-paste transposons, Helitrons cannot synthesize their own DNA and instead they duplicate using the host replication machinery. On the other hand, Politrons are selfsynthesizing [Kapitonov & Jurka 2006] transposons that are 15-20 kb long and are identified in protists, fungi and animals. The summary of human genome composition, which we have discussed in this section, is captured in Figure 1.3. 1.1.3 Contribution of Transposable Elements in Shaping the Human Genome and Association to Diseases In this section, we highlight some of the significant roles of TEs in mutating protein coding regions, rewiring regulatory networks, cancers and genetic diseases. 16 1.1.3.1 Transposable Elements Role in Protein-Coding Regions Highly mutagenic active TEs frequently insert into protein-coding genes and therefore are found in a large number of human protein-coding genes. Studies show that approximately 4% of human genes contain TEs or TE fragments within their coding regions [Nekrutenko & Li 2001]. Consequently, they cause chromosome breakage, illegitimate recombination, and genome rearrangement. In addition, TE insertions influence gene splicing patterns by alternative splicing. Table 1.1 provides some examples of TE insertions in the human genes and their effects on the coding regions. There are two ways by which TEs could have integrated into coding regions. The first way is the one in which a TE is inserted into a protein coding exon. The most common path however, is the one in which a TE is inserted into a noncoding intron region and subsequently recruited as a new exon. About 90% of TE insertions are into introns and this high rate is possible because many TEs carry potential splice sites [Nekrutenko & Li 2001]. Thus, TE insertion might be an important cause for the high frequency of alternative splicing in human protein-coding genes. For example, the fact that ~1.4 million Alu elements are interspersed throughout the human genome, each of which carrying several potential splicing sites, provides numerous possibilities for formation of alternate transcripts [Brosius & Gould 1992]. Orthologous genes can encode functionally different proteins or differ in terms of expression. For example, an Alu insertion occurs 1 in every 200 human births [Deininger & Batzer 1999], which means the chimpanzee genome does not contain many of the Alu elements found in the human genome. Since Alus are not discovered in non-primates, they might have a significant contribution to the divergence between primates and other mammals. Overall there are about 400 human genes containing Alu inserts in the coding regions that are not found outside of the primate lineage [Nekrutenko & Li 2001]. 17 Gene TE Type Effect on Coding Region Human hematopoietic progenitor kinase (HPK1) Alu Extension Plakophilin 2a and b Alu Extension Adenosine deaminase Alu Extension Proteasome subunit p27 L1 Changing stop codon Hepatocyte nuclear factor-3/fork head protein L1 Extension Methyl-CpG binding protein L2 Changing stop codon Down syndrome critical region gene 5 L2 Changing start codon 8-oxo-dGTPase LTR Changing start codon LTR50 LTR Changing stop codon Myelin-associated oligodendrocytic basic protein Table 1. 1: Coding regions modified by TE insertions in human [Nekrutenko & Li 2001] 1.1.3.2 Transposable Elements’ Roles in Regulatory Networks TEs have been a rich source of material for the assembly and tinkering of regulatory systems and have had a key role in the evolution of human gene regulation. They can influence neighboring genes by functioning as enhancers or promoters. There are many ways by which TEs can directly influence the expression of a nearby gene, both at the transcriptional and post-transcriptional levels. To give some examples, studies [Feschotte 2008] reveal that at least 16 % of eutherianspecific conserved non-coding elements were derived from TEs. A study [Jordan et al. 2003] reports that nearly 25% of experimentally characterize human promoters contains TE fragments, including cis-regulatory elements. Another study [Ramirez et al 2006] shows that one quarter of the DNAse I hypersensitive sites identification in human T cells overlap with annotated TEs. 18 There are three main reasons for TEs being a source of regulatory elements [Feschotte 2008]. First, they tend to cluster around genes involved in development and transcriptional regulation. Second, they are over represented in genomic segments containing transcription factor binding sites (TFBSs). Finally, there are a growing number of highly conserved TEs that act as transcriptional enhancers. TEs wire genetic networks. The TE families scattering throughout the genome allows the same motifs to be engaged at many chromosomal locations and therefore, results in bringing multiple genes into the same regulatory networks. For instance, a human genome study [Wang et al. 2007] suggests that a set of closely related families of LTRs have dispersed more than 1500 binding sites for the master regulatory factor p53. These sites encompass 30% of all p53 binding sites that have been mapped. TEs also show far more overlap than expected with non-coding RNAs such as microRNA (miRNA) that are important players in regulation of gene expression. This suggests that certain TE families such as Miniature Inverted Repeat Transposable Element (MITEs) possess characteristics that make them prone to give rise to miRNA, for instance [Feschotte et al. 2002]. 1.1.3.3 Transposable Elements’ Role in Cancer and Genetic Diseases The mobility and sheer number of both retrotransposons and DNA transposons allows them to shape our genotype and phenotype both on evolutionary scale and on individual level. The fact that TEs cause variation, within or in between individuals, is now evident with the everincreasing number of genome-scale studies. These studies have expanded the pool of human disorders resulting from TE insertion [Chen et al. 2005]. The non-LTR retrotransposons, Alu and L1 in particular, can undoubtedly cause diseases through insertional mutagenesis, recombination and structural variation, providing enzymatic activities for other mobile DNA, transcriptional over activation and epigenetic effects. Since DNA transposons are considered immobile in the human genome, no human disease is known to arise as a result of their activities. 19 It is now likely that some types of cancer, neurological disorders and genetic diseases arise as a result of retrotransposons mutagenesis [Mills et al. 2007]. There have been 96 known retrotransposons inserts in disease cases, out of which 25 are caused by L1s, 60 are attributable to Alus and 7 to SVAs [Hancks & Kazazian 2012]. Overall, retrotransposons insertions accounts for about 1 in 250 (0.4%) of disease-causing mutations [Wimmer et al. 2011]. Alus predominantly cause diseases by homologous recombination between two Alu sequences, although insertion into or near exons and Alu splicing from introns are also possible. Hemophilia B [Vidaud et al. 1993], Huntington disease [Hutchinson et al. 1993], and neurofibromatosis [Wallace et al. 1991] are among the famous diseases associated to Alu insertions. The predominant mechanism by which L1 causes diseases is insertional mutagenesis into or nearby genes. Highly active L1 elements account for most disease-causing insertions [Brouha et al. 2003]. Studies show that retrotransposon, in particular L1 insertions, were over-represented in proteincoding genes. If for example, a L1 element inserts into a gene that functions in neurological development, it might lead to neurological diseases such as Rett syndrome [Tomas et al. 2012]. L1 insertions have also been found in brain and lung tumors [Iskow et al. 2010] and are attractive candidates for both somatic drivers and hereditary factors in germ cell tumors and in other cancer types. Genetic instability [Symer 2002], which is a hallmark of cancer [Negrini et al. 2010], is associated with L1 elements. Thus, over-activation of L1s could have the potential contribution to tumorigenesis. In addition to cancer, L1 has been associated to well-know genetic diseases such as hemophilia A [Kazazian et al. 1988], beta thalassemia [Divoky et al. 1996], and muscular dystrophy [Kondo-Iida et al. 1999]. Also, several novel L1 insertions on the X chromosome were discovered in males with presumptively X-linked disorders [Huang et al. 1993]. 2 Transposable Element Discovery Methods The following section categorizes and summarizes the TE discovery methods, including de-novo, homology-based, structural-based, comparative and integrated approaches, in the field of computational genomics. 20 2.1 De Novo Methods De novo TE discovery approaches reviewed in Bergman & Quesneville 2007, look for repetition of mobile DNA at multiple positions within a genome without using any prior knowledge about the structure of TEs or similarities to known sequences. These methods aim to discover consensus sequences of TEs family from similar sequences. Once identified, the sequences are typically clustered, filtered, and characterized. However, these methods also tend to find repeats such as tandem repeats or segmental duplication because de novo methods are not specific to TEs. In additions, they are not able to detect TE families with low copy number or nonoverlapping fragments. What increases computational complexity is the biological complexity of TEs, including the fragmented nature of the TE instances and sequence similarities of related TE families. Discovering several co-linear repeats instead of a single repeat, aggregation of discovered nested TEs into a large family, and detecting families with over-split or multiple distinct sub families are among the major problems for de novo TE discovery methods. Figure 1. 4: Workflow of the 4-step de novo TE detection pipeline [Flutre et al. 2011] 21 Although de novo techniques typically struggle with identifying degraded fragments, they are the most effective, albeit computationally expensive, approaches identify novel TEs. P-CLOUDS [Gu et al. 2008], RECON [Bao & Eddy 2002], and RepeatModeler [Smit & Hubler 2008] are among best-known de novo tools. Classical computational strategies, like suffix trees or pairwise similarity searches, were initially used for repeat detection by tools such as RepeatFinder [Volfovsky et al. 2001]. Nonetheless, software such as P-CLOUDS has been designed to rapidly find repeats in genome sequences by counting highly frequent words of a given length k, called k-mers. Other methods such as ReAs [Li et al. 2005] also count frequent k-mers but try to define consensus sequences. For each frequent k-mer, a multiple alignment of all the k-mers is built and extended iteratively. These programs are very useful for quickly providing a view of the repeated fraction in a given set of genomic sequences. However, they do not provide much detail about the TEs present in these sequences. Their output only identifies highly repeated regions without indicating precise TE fragment boundaries or TE family assignments. Repeats can be identified by self-alignment of genomic sequences, starting with an all-by-all alignment of the assembled sequences. BLAST [Altschul et al. 1990] and BLASTER [Quesneville et al. 2003] heuristic algorithms are among the tools used for the alignment. Note that the aim of this step is not to recover all TE copies of a family but to use those that are well conserved to build a robust consensus. Stringent alignment parameters are crucial for successful reconstruction of a valid consensus. Tools like RECON cluster the obtained matches corresponding to repeats, into groups of similar sequences. The aim is for each cluster to correspond to copies of a single TE family. Once clusters are defined, applying a filter eliminates the vast majority of segmental duplications. Finally, what remain are only the clusters with at least three members from each of which a multiple alignment is built and a consensus sequence is derived. These 4 steps of the de novo TE detection pipeline are summarized in Figure 1.4. 2.2 Homology-Based Methods Homology-based approaches reviewed in Bergman & Quesneville 2007 are the most commonly used methods to detect TE families based on the homology to known TE protein-coding sequences or to DNA consensus sequences. In fact, these approaches utilize known TEs as a means to discover new copies of TEs in genomes. They employ fast seed-based heuristic 22 alignment algorithms such as BLAST and FASTA [Pearson 2000] with known TEs used as queries, followed by post-processing including merging and/or extending of individual genomic hits. The main advantage over de novo approach is that homology based methods are more likely to find bona fine TEs as they are based on the knowledge of the know TE sequences. Although there exist only a few homology-based tools, they are normally the most accurate in identifying known TEs as well as detecting degraded TEs. However, they are unable to identify TEs unrelated to known elements. From this category of approaches, RepeatMasker [Smit, Hubler & Green 1996], a popular tool to identify, classify, and mask repetitive elements, including low-complexity sequences and interspersed repeats, is widely used in computational genomics. RepeatMasker searches for repetitive sequence by aligning the input genome sequence against a library of known repeats, such as repBase [Jurka et al. 2005]. Sequence comparisons in RepeatMasker are performed by one of several popular search engines including nhmmer [Wheeler & Eddy, 2013], cross_match [Smit, Hubler & Green 1996], ABBlast/WUBlast, RMBlast [Altschul et al. 1990] and Decypher. It makes use of curated libraries of repeats and currently supports Dfam [Travis et al. 2013] (profile HMM library derived from Repbase sequences) and Repbase, a service of the Genetic Information Research Institute which is the most commonly used database of repetitive DNA elements. In order to detect common TE protein domains, some homology-based approaches utilize hidden Markov models (HMMs) to scan predicted ORFs from the PFAM database [Bateman et al. 2002] as an alternative approach to fast heuristic alignment algorithms [Berezikov et al. 2007; Rho et al. 2007]. Although these approaches are effective for genomes that are closely related to those genomes used to build the database, they have difficulties with distantly related species. This is due to the fact that HMMs tend to capture more irrelevant data when searching for diverse sequences. Overall, obtaining a full-length reference sequence by homology-based TE detection method requires further analysis of structural features of TE families. 23 Figure 1. 5: Workflow of 4-steps homology based TE annotation pipeline [Quesneville et al. 2005] Mining the library of TE sequences obtained by the de novo TE detection pipeline, tools like Repeat Masker, BLASTER or CENSOR [Kohany et al. 2006; Jurka et al. 1996] use pairwise alignment algorithms to detect TE fragments. Note that if these tools are used in conjugation, then the MATCHER program assesses the results and keep the best for each location. Short simple repeats (SSR), short motifs repeated in tandem, are not only present in TEs but also independently in the genome. Therefore, if TE matches are restricted to SSR that the TE consensus may contain, then it is necessary to filter them in the second step. The last two steps are discarding false-positive matches and finally connecting the distant fragments by applying the long join procedure [Flutre et al 2011], before the final annotation is reported. These 4 steps of structural based TE annotation pipeline are summarized in Figure 1.5. 2.3 Structure-Based Methods The prior knowledge of common structural features of TEs such as LTRs, target site duplications (TSDs), primer-binding sites (PBSs), polypurine tracts (PPTs) and ORFs for the gag, pol (containing the RT domain) and/or env genes, is being taken advantage of by this class of 24 approaches [Bergman & Quesneville 2007]. Unlike homology-based methods, structure-based methods are less dependent on similarity to the known TEs families and instead they rely on detecting specific models of TE architecture. In structure-based approaches, specific models must be developed for each TE family. The strongly structured TE families such as LTRs are easier to detect using these methods. However, they are generally less useful when searching for degraded TEs or for TEs without a conserved structural characteristic. LTR_STRUC [McCarthy & McDonald 2003] for example, is a structure-based tool that works well to identify complete TEs that comply with a conserved LTR structure at each end of the element. Using a heuristic seed-and-extend algorithm, this tool aims to find and align local repeats located within a user-specified distance that are used as an initial set of candidate LTRs. The boundaries of the LTRs on the original sequence are determined by the pairwise alignment of putative LTRs. A critical limitation of LTR_STRUC is that only TEs within the same contig can be detected. Another unique structural-based tool is TE-HMM [Andrieu et al. 2004]. This method takes advantage of the fact that the nucleotide composition of TE ORFs and that of the host genes are often different. The core of this approach is building HMMs with three states representing coding regions and at least one state for noncoding regions. The three states are allocated one for each codon position and the one or more states for noncoding regions allowing for the frame shift mutations that are common in decaying TEs. TE-HMM is able to accurately identify the coding regions in known TEs and differentiate TEs from genes in a given datasets. This is assuming that separate HMMs for RNA-based TEs, DNA-based TEs and host genes have already been trained. Therefore, as with all HMMs, TE-HMM is dependent on good training data sets from the same species group. However, other structural features of unidentified TEs such as UTRs and LTRs cannot be discovered using this approach since TE-HMM attempts to only predict coding sequences. 2.4 Comparative Genomic Methods Comparative genomic discovery methods reviewed in Bergman & Quesneville 2007 rely neither on homology nor structural features. They are based on the fact that transpositions create large 25 insertions and variations between pairs of closely related genomes that can be detected in multiple sequence alignment. Such differences are analyzed and classified by comparative genomic approach. This method searches for insertion regions (IRs) where multiple alignments of orthologous genome sequences are disrupted by a large (>200 bp) insertion in one or more species [Caspi & Pachter 2006]. The effectiveness of this approach is dependent on the quality of the whole genome alignments and will be useful when related genomes are available in order to identify new families of TEs. The limitation of this approach is its inability to detect common ancestral TEs. It would also perform poorly in TE-rich regions, as there might exist nested TE insertions that complicate the detection of distinct TE families. 2.5 Integrated Methods Due to the nature of the TEs, there might not ever be a single best approach of detection; thus, employing the existing methods in conjugation could be very realistic and effective. The REPET package [Flutre et al, 2011] integrates bioinformatics programs to detect, annotate and analyze TEs. The two main pipelines in use are called TEdenovo and TEannot for the homology-based TE annotation. The TEdenovo pipeline compares the genome with itself using BLASTER and clusters matches with GROUPER [Quesneville et al. 2003], RECON [Bao & Eddy 2002] and PILER [Edgar & Myers 2005] clustering programs. It then builds a multiple alignment for every cluster to derive a consensus sequence. After filtering for redundancy, these consensus sequences are classified to finally obtain a library of classified, non-redundant consensus sequences. The TEannot pipeline searches the library of consensus sequences built by TEdenovo, using BLASTER, RepeatMasker and CENSOR. After removing the false positives by an empirical statistical filter, MATCHER [Quesneville et al. 2003, 2005] groups the TE fragments in to families and output the annotations. 26 3 Ancestral Mammalian Genome Inference In this section, we explain why and how the ancestral mammalian genome, which is used as an input to the method we have developed, is reconstructed The possibility of computationally interfering ancestral genome, given a known phylogenetic association, is among the interesting prospect of having a large number of extant genome available. The scientific community has inferred ancestral gene orders and the history of rearrangement leading to a give set of extant genomes [Bourque & Pevzner 2002; Mu 2011]. However, there are challenges associated to this level, namely the computational complexity, limited accounting for all possible evolutionary events and orthologs identification. We have focused in this study, as it is conventional in this field of research, on DNA sequence evolution at the level of substitutions, insertions and deletions. The aim is to infer a set of most likely ancestral sequences based on a known set of extant, collinear, non-rearranged orthologous sequences. Correctly aligning the highly divergent sequences and inferring the computationally complex maximum likelihood indels (insertion/deletion) are among the difficulties of this inference problem. Nonetheless, significant investments in whole-genome multiple sequence alignment [Blanchette et al. 2004; Paten et al. 2009] has resulted in a set of heuristic algorithms developed to infer ancestral DNA sequences [Diallo et al. 2010, Paten 2008; Westesson et al. 2012]. These algorithms have made it possible to infer large sections of syntenic regions with good accuracy. The first step toward reconstructing an ancestral mammalian genomic sequence is to build an accurate whole-genome multiple sequence alignment of the complete genomes of mammals [Miller et al. 2007]. To this end, the sequences are repeat masked using RepeatMasker [Smith and Green 1999] and then aligned using the Multiz multiple-alignment program [Blanchette et al. 2004]. It is assumed that two bases are aligned if and only if they derive from a common ancestral base. The alignment then gets divided into several syntenic alignment blocks. Within every block, ancestral sequences at each internal node of the mammalian phylogenetic tree [Miller et al. 2007] are inferred. It is worth mentioning that rearrangements, duplications, or large insertions are not expected within each block. These sequences are inferred using the Ancestors 27 1.1 [Diallo et al. 2010], which is a program that uses an evolutionary model, involving contextdependent substitutions as well as indels to infer the maximum likelihood ancestral sequences. Finally, heuristics are used to reduce errors due to incorrect alignment. For several reasons, the eutherian mammals phylum is a particularly interesting target for ancestral genome inference. The fact that it includes the human genome is one of the key motivations, as the study of ancestral genomes may shed some light on the function of various parts of our own genome. We also argue that a good target species for a genomic reconstruction is one that has generated a large number of independent, successful descendant linage through a rapid series of ancestral speciation [Murphy et al. 2001]. Therefore, due to the mammalian radiation, certain early-eutherian mammal genomes can be inferred to a surprisingly high degree of accuracy. The boreoeutherian ancestor is the ancestor of all eutherian mammals except Afrotherians (e.g.elephants) and Xenarthans (e.g. sloths and armadillos). Using the reconstruction pipeline mentioned above, most of the euchromatic genome of the boreoeutherian ancestor can be inferred from the extant genome of each main lineages with 98-99% base-by-base accuracy [Blanchette et al. 2004]. In the next chapter, we will discuss how the inferred genome of the boreoeutherian ancestor can be used to identify ancient TEs in the human genome. 28 Chapter 2: Annotation of Ancient Transposable Elements in the Human Genome 1 Introduction After their insertion in the genome, most TE copies get relatively quickly degraded, making the recognition of old insertion events challenging. Although these ancient TEs may no longer be recognizable in the human genome, inferred mammalian ancestral genomes may contain younger version of them which can be detected by existing TE annotation tools such as RepeatMasker [Smit, Hubley, & Green 1996]. To elaborate more, Figure 2.1 represents the aging of TE copies (symbolized by squares) in a genome (symbolized by lines) over time. Assume the red square on the first line is a TE fragment belonging to family A, inserted into our genome 100 million years ago. In 20 million years, it replicates itself and makes an identical copy in another position in the genome. Then 20 million years later, it again replicates itself and makes a third copy but as the genome evolve, the second copy along with other parts of the genome gets mutated (pink). After another 20 million years, another TE fragment belonging to a different family, family B (dark blue) is inserted into the genome. The family B copy starts to transpose while the TE fragments belonging to family A have been mutated and lost their transposition ability. Over time, TE copies belonging to both families mutate and become more diverged from the corresponding original copies (they become lighter an lighter in colors). Therefore, on the last line of the Figure 2.1, which represents our genome today, we see TE fragments of different colors belonging to families A and B. All the copies from family B, which have expanded more recently than family A, are recognizable, although some have become lighter. However, TE copies from family A have lost their colors to a point when one of them is no longer recognizable as a copy belonging to this family. 29 Figure 2. 1: Transposable Elements degrading over time Lines represent a genome evolving over time. Red squares represents family A and blue squares represents family B of TEs. Nevertheless, if we had the genome from 50 million years ago, we would have been able to identify all the copies belonging to family A that are still present in our genome today. Thus, having the inferred ancestral genome of the boreoeutherian ancestor may help us in classifying the ancient transposable element insertions in the human genome. In this chapter, we developed an approach to identify ancient TE copies, complete the annotation of TEs in the human genome, classify new families of TEs and identify their associations with functional elements in the human genome. We propose the use of ancestral genomes to improve the detection of ancient TE copies and discover new TE families. We expand on our pipeline in the methods section and report our results using this approach in the results and discussion sections. 2 Methods In this section, we explain the pipeline summarized in Figure 2.2, which we developed to complete the annotation of TEs in the human genome. We first give an outline of the method and then provide details on each step. 30 Figure 2. 2: Ancient Transposable Element Annotation Pipeline Given a phylogenetic tree of mammals, a mammalian ancestral genome is reconstructed. Running RepeatMasker on the ancestral genome, we obtain TE copies whose descendants have possibly not been identified in the human genome. These are the TEs not recognizable by RepeatMasker as TE copies in the human genome. Once we identified these ancient TE copies, we map them to the human genome using LiftOver [Kuhn et al. 2012], which is a tool to convert genome positions between different genome assemblies based on an alignment (Figure 2.3). Then, we calculate the divergence percentage of these mapped TEs in humans from the consensus sequences. 31 Figure 2. 3: Mapping TEs from the boreoeutherian ancestor genome to human genome Red squares represents TEs identified in the boreoeutherian ancestor genome, blue squares represents TEs currently identified in the human genome, purple squares represents the overlapping regions in the revised human TE annotation. We also run RepeatMasker on the human genome and mask the identifiable TE copies that in fact have already been recorded as TEs in the current annotation of the human genome. The overlapping regions (shown as purple squares in Figure 2.3) are identified using FeatureBits [Kuhn et al. 2012], a bioinformatics tool to report the intersection of two files of genomic annotations, before the newly annotated ancient TE copies in the human genome are reported. This pipeline is explained in details in the following sub-sections. 2.1 Ancestral Genome Reconstruction Previous works [Blanchette et al. 2004] have shown that ancestral mammalian genomes can be probabilistically reconstructed with up to 99% accuracy, given a phylogenetic tree We used the multiz whole-genome multiple-alignment of actual genomic sequences from 36 mammals, obtained from UCSC genome browser [Miller et al. 2007], and applied the Ancestor 1.1 pipeline [Diallo et al. 2010], to reconstruct ~1.8 giga bases (Gb) of ancient genome sequence from the boreoeutherian ancestor. The Ancestor program output file contains the alignment of the 36 mammals, the inferred boreoeutherian ancestor sequence, and the confidence score attached to each base, all subdivided in to blocks. 32 Figure 2. 4: Vertebrate Phylogenetic Tree The boreoeutherian ancestor is noted with a red box. Adapted from [Miller et al. 2007] 33 As indicated on the phylogenetic tree in Figure 2.4, the boreoeutherian ancestor is a species that used to live approximately 75 million years ago. The reason for this choice is that the accuracy of the reconstruction method depends crucially on the length of early branches of the phylogenetic tree. For example, the star tree with no internal shared node is the most favorable tree for reconstruction. Thus, ancestral sequences at the center of a rapid radiation can be reconstructed more accurately than those of the more recent ancestors. The boreoeutherian ancestor meets all the criteria and is the most accurate ancestral sequence that is obtained by the method mentioned above. 2.2 Identification of Transposable Elements in Boreoeutherian Ancestor RepeatMasker is a popular program in computational genomics that screens DNA sequences to identify and classify interspersed repeats and low complexity DNA sequences. To do so, it aligns the query sequence against a library of known repeats, such as repBase (the most commonly used database of repetitive DNA elements) [Jurka et al. 2005]. The output of the program is a detailed annotation of the repeats identified in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked. TE families, in contrast to multigene families, are usually defined based on their active ancestor and generation mechanism, although over time, individual elements may acquire diverse biological roles. Existing tools such as RepeatMasker however, struggle to identify TEs in that are diverged more than 40 % from the original consensus sequence. Most of these TEs have been inserted more than 70 Million years ago and have diverged to the point that they are no longer recognizable as TEs in the human genome. Nonetheless, the more diverged TEs look younger (less diverged from the original consensus sequence inserted) in the ancestral sequences. Therefore, in principle, we should be able to identify these old insertions in an ancestral sequence using RepeatMasker and map them to the human genome. 34 Having the reconstructed genome of the boreoeutherian ancestor (available at: http://cs.mcgill.ca/~aahiat1/Ancestor/), which was explained in section 3 of chapter 1, the first step in our pipeline is to identify TE copies present in that genome. To this end, we run RepeatMasker (command line option) on the boreoeutherian ancestor sequence FASTA file [Pearson 2000]. We set the species of the input sequence (-species parameter) to mammal and the search engine (-e parameter) to hmmer. The default RepeatMasker version uses UW-BLAST search engine, which in fact, was initially used for our annotations. However, the first version of a transposable element profile HMM database [Eddy 1998] was released in 2013 which notably improved the characterization of TE sequences. The use of profile HMMs not only improves the sensitivity over single sequence search, it also represents the additional information content in position-specific nucleotide distributions and indel variability. Only recently, genome scale searches of profile HMMs have become feasible; therefore, the new version of RepeatMasker uses Dfam (profile HMM library derived from Repbase sequences) [Travis et al. 2013] and nhmmer [Wheeler & Eddy, 2013] (see: http://www.repeatmasker.org/). RepeatMasker produces four types of output files, .mask, .tbl, .cat, and .out that essentially contain the TE annotation information (available at: http://cs.mcgill.ca/~aahiat1/RepeatMasker/). The .mask file contains the masked sequence annotation, which is the same as the query sequence except that the repetitive elements are masked using N or X letters. The .tbl file states the percentage of genome coverage of each annotated TE family. The .cat file contains details on the alignment between the identified TE sequences and their consensus sequences in the repeat library and profile HMM library. It also includes information such as the position of the identified TEs in the query sequence, the divergence from the consensus sequence, and the classified TE family. The .out file is a self-explanatory annotation file containing the summary of all the information on the .cat file except the alignment details. After running RepeatMasker on the boreoeutherian ancestor genome, we extract the chromosome number, start and end positions, divergence percentage, and annotated family from the .cat output file. We store them in a Browser Extensible Data (BED) file format to be used as an input by another program called UCSC Batch Coordinate Conversion tool (LiftOver) that converts genome position from one genome assembly to another genome assembly, which in our case is 35 from the boreoeutherian ancestor to human. Running LiftOver on the boreoeutherian ancestor BED file, we get the human genome coordinates (UCSC hg19 assembly) corresponding to the TE sequences identified in the boreoeutherian ancestor genome. Here we use LiftOver command line option, where the chain file that holds the instruction for the conversion is set to boreoToHg19.chain (available at: http://cs.mcgill.ca/~aahiat1/LiftOver/). For more information on LiftOver tool, see: https://genome.ucsc.edu/cgi-bin/hgLiftOver. 2.3 Divergence Calculation After following the pipeline mentioned above, we obtained all the TE positions in the human genome that had been present in the boreoeutherian ancestor genome. However, these TE copies, which have been reported by RepeatMasker and then mapped to the human genome, are not annotated with the correct percentage of divergence from the consensus sequences in the library of TEs. This is because RepeatMasker calculates the divergences between the boreoeutherian ancestor and the consensus sequences. LiftOver only lifts the identified regions to the human genome and dose not meant to change the divergences. Correcting this is important because the divergence is an indication of the age of TE fragments. Therefore, our goal is to correct the percentage of divergence for each TE copy in human from its consensus sequence. This is achieved using a java program we have developed called TEMapper (available at: http://cs.mcgill.ca/~aahiat1/TEMapper/). Figure 2.5 corresponds to a sample snapshot of a .cat file outputted by RepeatMasker that includes details of the alignment between the consensus sequences and the query sequence. Figure 2.6 shows a sample LiftOver .mapped output file in which every row corresponds to one TE copy lifted from one genome assembly to another genome assembly. Figure 2.7 demonstrates an alignment block that belongs to a sample Ancestor program .maf output file. 36 Figure 2. 5: RepeatMasker .cat output file example Each block corresponds to one TE copy masked. Highlights are the information extracted from this file: divergence percentage, chromosome number, start and end positions, family, and the alignment between the boreoeutherian ancestor and consensus sequences. Figure 2. 6: LiftOver .mapped output file example Each row contains the information of a TE fragment in the boreoeutherian ancestor genome (column 4) , lifted to the human genome (column 1-3). 37 Figure 2. 7: Ancestor program .maf output file example Highlights are the human and boreoeutherian ancestor sequences that are aligned in one alignment block. Sequences of other species and details of the reconstruction are ignored. The first step is to extract the alignment between the boreoeutherian ancestor and consensus sequences reported by RepeatMasker in the .cat file, for each TE copy. Then we search the reconstructed file given by Ancestor and extract the alignment between the human and boreoeutherian ancestor corresponding to the same region. Having these two alignments, we should be able to align the human and consensus sequences through the boreoeutherian ancestor 38 sequences and produce a third alignment from which we can calculate the correct divergence. In principle, the divergence between the consensus sequence and human must be less than or equal to the divergence between the consensus and boreoeutherian ancestor plus the divergence between the boreoeutherian ancestor and human: Div % (H,C) <= Div % (C,B) + Div % (B,H) Our alignment algorithm works as follow: Consider: Consensus Sequence: 𝑐 = 𝑐! … 𝑐! Boreoeutherian Ancestor Sequence: 𝑏 = 𝑏! … 𝑏! Human Sequence: ℎ = ℎ! … ℎ! • If 𝑐! is aligned with 𝑏! and 𝑏! is aligned with ℎ! then 𝑐! is aligned with ℎ! . • If 𝑐! is aligned with a gap in 𝑏, then 𝑐! is aligned with a gap in human. • If 𝑐! is aligned with 𝑏! and 𝑏! is aligned with a gap in human then 𝑐! is aligned with a gap in human. For example, having the two hypothetical alignments as follow: Consensus: … A G C T G G C T G T C A C – T C … Boreoeutheian ancestor: … A G C – – G C T G T T C C A T C … Boreoeutherian ancestor: … A G C – G – C T G T T C C A T C … Human: … A G C A A A C T G T G – C A T C … The alignment between the consensus and human sequences would be: Consensus: … A G C T G – G – C T G T C A C – T C … Human: … A G C – – A A A C T G T G – C A T C … 39 We calculate the percentage of divergence between the consensus and the human sequences by dividing the number of mismatched bases at each position by the alignment length. It is worth mentioning that, there might be other types of mutation events such as duplication and rearrangement within the identified sequences. However, since they are rare and difficult to identify, we did not account for them in our model. Thus, our alignment algorithm only considers substitutions and indels. Note that the four sequences from the first two alignments must have the same length, including the gaps. Div% (H,C) = (# mismatches including gaps / alignment length) * 100 Therefore, in our example, the divergence is about 44%. Implementing the TEMapper, not considering the inevitable implementation difficulties and errors, has encountered three main complications. The first complication originates from the enormous size of the data. Approximately 1.8 Gb of the boreoeutherian ancestral sequence have been reconstructed in which more than three million TE segments have been masked by RepeatMasker and subsequently mapped to the human genome by LiftOver. Since every disk access is time consuming, we used a hash-map data structure to speed up the access to these files. The hash-map table uses the TE coordinates in the boreoeutherian ancestor to access the values that are the TE human coordinates. The down side to this approach is the large amount of memory required. However, this remained manageable and resulted in significant speed-ups. In addition, our intention is to extract the alignment between the boreoeutherian ancestor and human sequences form the Ancestor program output file. However, the mapped TE segments are dispersed through this massive alignment. It is highly time consuming to navigate the whole alignment multiple times. Therefore, we partitioned the sequence alignment into short blocks and save the position of the first nucleotide at each block. Once we read a TE coordinates from the LiftOver hash-map table, we only search around the closest alignment block. This approach drastically reduces our search space. The second problem is that despite the fact that the two boreoeutherian ancestor sequences, one obtained from the Ancestor output and the other one from the RepeatMasker output, are supposed 40 to be identical, sometimes they do not align perfectly. This means that there are cases in which the two extracted boreoeutherian ancestor sequences (corresponding to a single TE fragment) are shifted by a few bases at one or both ends of one of the sequence. We believe that the issue is caused by a small bug in the LiftOver program, which converts the genome positions between the two genome sequences. We realized this issue when our results did show a significant increase in the divergence and the results did not match our test cases. Trimming the ends would not easily solve the problem because the shifting has disturbed the whole alignment. Therefore, we had to observe many examples of each case, identify the shifting pattern and write a program to trim both sequences accordingly. Since there has not been a fixed pattern for the shifting, we wrote a program to test all possible shifts (taking the longest common sequence) and report the one that reconciles the two sequences. The third problem arose when we faced missing alignment blocks between the boreoeutherian ancestor and human. Those are due to the insertions of DNA fragments (either of TE origin or not) within an ancestral TE but occurring in the human lineage after the boreoeutherian ancestor. For instance, an Alu could have been inserted inside an ancient L1 fragment, creating a nested TE (Figure 2.8). If we were to consider those missing blocks as gaps, then the calculated divergence would have been incorrectly increased. By eliminating those cases, however, we would have lost many identified ancient TEs. In addition, considering the inserted fragments as part of the ancient TE would have been counting the overlapping regions more than once and categorizing them under different TE families. Thus, in our implementation, we recognize those cases, extract the missing blocks and consider each of the remaining blocks as a separate TE fragment with the same divergence and family. Figure 2. 8: Hypothetical example of nested TEs inserted into the genome L1 (ATT) is represented by purple square and Alu (CG) represents by blue square 41 Realizing the last two issues mentioned above consumed considerable amount of time since considering the size of the data, recognizing the errors was not evident in the primary stages. Therefore, we were only able to trace the problems in the final steps and by visualizing the results in the UCSC genome browser. After applying the correction algorithm to the set of TEs identified in the boreoeutherian ancestor and mapped to the human, we obtain a BED file containing TEs coordinates in the human genome, the corrected percentage of divergence, and the family they belong to. The results, however, does not include those TE copies that have been inserted into the human genome after the boreoeutherian ancestor, such as Alus. We thus run RepeatMasker on the human genome (UCSC hg19 assembly) and from the .out output file produce a second BED file with the same format as the first one, containing all TEs identified in the human genome. Obviously, there are several overlapping segments corresponding to those TEs recognizable by RepeatMasker in both Boreoeutherian ancestor and human genomes. In order to eliminate the redundancy and obtain the intersection of our two BED files, we run FeatureBits, which is another UCSC tool to report the intersection of two given BED files. The result is a single BED file containing the human genome coordinates, divergence percentage, and family of those transposable elements identified only in the human genome, in both the boreoeutherian ancestor and human genomes, and only in the boreoeutherian ancestor genome (BED files are available at: http://cs.mcgill.ca/~aahiat1/FeatureBits/). In the following section, we discuss the results achieved by applying this pipeline. 3 Results and Discussion In the following section, we report the results obtained by applying the pipeline in the Figure 6 that is explained in the previous section. The results include the divergence profile of TEs in the human and boreoeutherian ancestor genomes, the revised annotation of TEs identified in the human genome and the share gained by each TE family, as well as the analysis on some novel TE families obtained by de novo TE discovery techniques. 42 3.1 Human and Boreoeutherian Ancestor Divergence Profiles Using RepeatMasker, we annotated the recognizable TEs in both the boreoeutherian ancestor and human genomes. For each major TE family reported, we plotted the percentage of divergence from the consensus versus the genome coverage where each point on a curve represent the genome coverage of a TE family at a certain divergence level. In the human genome (Figure 2.9), Alus and L1s respectively are the youngest and most abundant elements identified since they are currently the only active ones. L2 and CR1 elements, on the contrary, are older and therefore not as many copies as the other two are annotated. In the boreoeutherian ancestor genome (Figure 2.10), however, LINE L2 and CR1 elements for example, are detected in significantly larger fractions than in human and this is due to the fact that these elements are much less decayed in the boreoeutherian ancestor sequence. Alus, on the other hand, are mostly absent in the boreoeutherian ancestor as they are primate specific TEs. We will discuss each of these cases in more details later. Note that the rate of TE insertion has changed over time. Therefore, the bumps on some of the curves (most noticeable in the human L1 curve) depict sudden increase in TE insertions at the certain times. If we take a closer look at the L2 (Figure 2.11) and CR1 (Figure 2.12) subfamilies belonging to the LINE family, we see that in the comparison plots, the blue curves which corresponds to the coverage of TEs identified in the boreoeutherian ancestor are higher than those of the human shown in red. This means that RepeatMasker has detected more copies from these families in the boreoeutherian ancestor genome. These copies are younger as well as denoted by the shift of the blue curve to the left. These changes are expected because the elements from the LINE L2 and CR1 subfamilies have been inserted into the mammalian genomes long before the boreoeutherian ancestor. Therefore, RepeatMasker was able to find larger number of copies with less divergence. On the other hand, Alu elements belonging to the SINE family that are abundant in the human genome are not expected in the boreoeutherian ancestor genome, because the Alu elements have been inserted long after the boreoeutherian ancestor and are primate lineage specific. However, due to the RepeatMasker detection errors and the fact that the boreoeutherian ancestor sequence inferred by the Ancestor is not 100% accurate, a small number of Alu (denoted by the dark blue curve in Figure 2.13) elements are detected in the boreoeutherian ancestor. 43 Figure 2. 9: Divergence profile of annotated major TEs in the human genome Figure 2. 10: Divergence profile of annotated major TEs in the boreoeutherian ancestor genome 44 Figure 2. 11: LINE/L2 elements divergence profile comparison between the human and boreoeutherian ancestor genomes Figure 2. 12: LINE/CR1 elements divergence profile comparison between the human and boreoeutherian ancestor genomes Figure 2. 13: SINE/Alu elements divergence profile comparison between the human and boreoeutherian ancestor genomes 45 3.2 Mapping and Divergence Correction By running RepeatMasker on the boreoeutherian ancestor, we obtained a .cat file (Figure 2.5) from which we extract all the identified TE regions to be lifted by LiftOver to the human genome (Figure 2.6). Our TEMapper program combines the results obtained from RepeatMasker, LiftOver, and Ancestor (Figure 2.7) programs and produces a BED file containing the revised annotation of human TEs. The details of the TEMapper program and the ancient TE identification pipeline are explained fully in the methods section. We calculated the genome coverage of every TE family at each divergence level for the TEs identified in the boreoeutherian ancestor only, boreoeutherian ancestor and human, and human only. This was done by running FeatureBits on the files containing the boreoeutherian ancestor and human annotated TE regions to obtain the intersection, which corresponds to the TEs existing in both boreoeutherian and human genomes. Subtracting the intersected regions form both files, we separately obtained TEs existing in only the human and only the boreoeutherian ancestor genomes. For each major family of TEs, we plotted the three subsets of our data and here we report the most interesting ones to discuss. Figure 2.14-2.16 illustrate the coverage gained by LINE/CR1, LINE/L2, and SINE/MIR respectively in our revised TE annotations were the stacked area under the blue curve corresponds to the human only, purple to the human and boreoeutherian ancestor share and red to the boreoeutherian only portions. Each point on the curve represents the genome coverage at a certain divergence level. In the CR1 (Figure 2.14), L2 (Figure 2.15), and MIR (Figure 2.16) plots, the red curves, which corresponding to the TEs found only through the boreoeutherian ancestor, are shifted to the right, meaning that these TE copies are more diverged from their consensus sequences. In addition, the peaks of these curves are higher than others, which show the genome coverage gained by each of those families. These results are compatible with our expectation since the LINE family and SINE/MIR are the oldest TE families known which has been inserted into mammalian genome more than 70 million years ago, before the boreoeutherian ancestor. Therefore, we had expected to identify more copies in the human genome with higher percentage of divergences. 46 Figure 2. 14: Coverage gained by revised annotation of LINE/CR1 elements Figure 2. 15: Coverage gained by revised annotation of LINE/L2 elements Figure 2. 16: Coverage gained by revised annotation of SINE/MIR elements 47 In Table 2.1, we report the break down of coverage gained by each TE family identified using RepeatMasker. The coverage gained refers to the portion of the identified TEs in human which was not originally annotated as TE by RepeatMasker when executed on human only but was labeled as such after mapping to human TEs identified in boreoeutherian ancestor. To clarify, considering a genomic position p and a specific TE family X, we categorized our identified TEs as follow: A specific genomic position p is labeled as belonging to TE family X of type “human only” if p is labeled as: A. family X in human but not identified as a TE in the boreoeutherian ancestor. B. family X in Human but family Y in boreoeutherian ancestor (in order to account those copies that are identified in boreoeutherian ancestor as well but categorized under a different family). A specific genomic position p is labeled as belonging to TE family X of type “boreoeutherian ancestor and human” if p is labeled as: C. family X in both human and Boreoeutherian ancestor. D. family X in Boreoeutherian ancestor and family Y in human (in order to account those copies that are already identified in human but categorized under a different family). A specific genomic position p is labeled as belonging to TE family X of type “boreoeutherian ancestor only” if p is labeled as: E. family X in Boreoeutherian ancestor but not identified as a TE in human. The coverage gained is calculated as follow: F. Total coverage of family X in previous annotation of the human genome =A+B+C G. Total coverage of family X in revised annotation of the human genome = A+C+D+E Coverage gained from family X in revised annotation of the human genome = G-F 48 Family A B C D E F G Coverage Coverage Gained Gained (Mb) (Mb) (Mb) (Mb) (Mb) (Mb) (Mb) (Mb) (%) L1 398.93 4.29 159.16 11.07 16.1 562.37 585.2 22.88 4.1 L2 26.34 0.68 99.9 4.78 34.09 126.91 156.1 38.20 30.1 CR1 2.08 0.08 11.69 0.58 5.10 13.9 19.50 5.60 40.47 Others 0.97 0.04 5.73 0.32 1.66 6.74 8.68 1.94 28.79 Alu 273.36 14.05 18.17 1.75 0.75 305.58 294.2 -11.55 -3.78 MIR 14.96 0.53 72.50 3.14 19.03 87.99 109.6 21.64 24.59 Others 0.31 0.01 1.34 0.03 1.17 1.65 2.84 1.19 72.05 ERVL- 67.97 2.00 56.14 2.41 4.62 126.11 131.1 5.02 3.98 ERV1 78.91 1.55 10.25 0.56 1.79 90.72 91.51 0.79 0.87 ERVL 32.66 0.84 34.07 1.64 2.70 67.58 71.08 3.50 5.18 ERVK 8.83 0.12 0.01 0.0 0.01 8.96 8.86 -0.1 -1.16 Others 5.6 0.13 10.0 0.4 3.65 15.66 19.63 3.97 25.34 DNA TCMAR- 27.55 0.86 11.98 0.60 3.38 40.39 43.51 3.12 7.73 Transposon TIGGER 14.02 0.50 40.22 2.69 5.71 54.74 62.64 7.90 14.43 2.43 0.10 8.20 0.51 2.43 10.72 13.55 2.85 26.58 0.93 0.04 2.87 0.15 0.34 3.8 4.29 0.45 11.64 2.05 0.12 1.12 0.07 0.31 3.29 3.55 0.26 7.96 Others 5.02 0.25 1.71 0.17 0.25 6.99 7.16 0.17 2.43 RNAs 1.54 0.26 0.21 0.05 0.12 2.01 1.92 -0.09 -4.51 Retroposon 9.92 0.55 0.01 0.40 0.01 10.48 10.51 0.03 0.29 Helitron 0.23 0.0 0.40 0.01 0.05 0.63 0.70 0.06 10.01 Total ~988 ~27 ~564 ~32 ~110 ~1580 ~1695 ~115 ~7.28 LINE SINE LTR MALR HATCHARLIE HATTIP100 HATBLACKJACK TCMARMARINER Table 2. 1: Revised TE annotation by ancient transposable element annotation pipeline Columns are explained in the text (page 48) 49 As we expected, the most substantial gain belongs to the elements of old LINE subfamily that has reside in the mammalian genomes for more that 70 million years. Therefore, there have been more ancient copies to identify from this family. CR1 and L2 elements in particular, have shown a significant increase (approximately 40% and 30% respectively) in the genome coverage in the revised annotation. The other family that accounts for a significant coverage gained is the SINE family (excluding Alus), namely ancient MIR elements (~25% for MIR and 72% for other SINE elements). On the contrary, elements belonging to SINE/Alu, LTR/ERVK, and RNA families show a slightly negative changes in coverage, which means there are regions that were originally labeled as belonging to some of these families but are now assigned to other families. For example, some fragment in the human genome can be best aligned with the LTR/ERVK consensus sequence; however, the less diverged version of the same fragment can in fact be better aligned with the LTR/ERVL consensus sequence. Therefore, in the revised annotation, we loose that coverage under the LTR/ERVK family and instead we annotate that region as LTR/ERVL. Overall, according to our results (Table 2.1), ~115 Mb of genome coverage is gained across all the TE families which corresponds to more that 7.28% of the human genome. This result is significant since it adds approximately 3.5% to the total fraction of the human genome derived from TEs. 3.3 De-novo Transposable Element Discovery In order to investigate the boreoeutherian ancestral genome for the possible presence of unknown TE families, we used the RepeatModeler [Smit & Hubler 2008] program to analyze the boreoeutherian ancestor sequence. RepeatModeler is a de-novo repeat family identification and modeling package that employs two de novo repeat finding programs, RECON [Bao & Eddy 2002] and RepeatScout [Price et al. 2005] for TE detection. Most of the consensus sequences produced were classified as belonging to known TE families or gene families (olfactory receptor and zinc figures) that were disregarded. Nonetheless, we discovered 31 novel TE families summarized in Table 2.2 (available at: http://cs.mcgill.ca/~aahiat1/RepeatModeler/). Their size varies from ~130 to 3800 bp, copy number varies from ~100 to 1580 and level of divergence form the consensus varies from ~7 to 39%. 50 Family Name Copy # in Human Range human Genome genome rnd-5_family-38 of Consensus TEClass Divergence from Sequence Classification Coverage (bp) Consensus (%) Length (bp) 226 107598 10-33 376 LTR rnd-5_family-120 601 79268 10-37 205 LTR rnd-5_family-170 872 445434 10-35 1939 LTR rnd-5_family-394 373 151834 15-35 723 LINE rnd-5_family-482 719 138338 11-35 232 DNA Transposon rnd-5_family-836 249 104599 14-35 551 LTR rnd-5_family-843 459 333348 9-30 300 LTR rnd-5_family-1234 963 108281 9-36 247 LINE rnd-5_family-1704 1581 252427 9-35 348 Unknown rnd-6_family-130 313 45540 14-34 217 DNA Transposon rnd-6_family-131 368 62472 14-34 359 Unknown rnd-6_family-133 619 883438 7-33 214 Unknown rnd-6_family-324 921 337607 7-39 823 LTR rnd-6_family-389 598 181359 12-37 534 LTR rnd-6_family-525 351 81445 12-37 332 DNA Transposon rnd-6_family-817 857 113735 10-34 132 DNA Transposon rnd-6_family-860 698 572181 7-36 1205 Unknown rnd-6_family-916 354 177062 9-33 658 LTR rnd-6_family-923 212 98551 10-27 714 LINE rnd-6_family-967 762 338536 10-34 735 LTR rnd-6_family-1522 206 22256 10-32 141 DNA Transposon rnd-6_family-1657 933 122342 7-35 262 DNA Transposon rnd-6_family-1695 478 79659 7-33 246 DNA Transposon rnd-6_family-2070 687 190661 11-37 2375 Unknown rnd-6_family-2131 307 37485 11-36 165 LTR rnd-6_family-2444 431 58933 10-31 151 DNA Transposon rnd-6_family-2885 372 144161 8-36 613 LTR rnd-6_family-4223 546 141788 10-36 307 DNA Transposon rnd-6_family-5085 194 67383 8-29 339 Unknown rnd-6_family5663 100 190737 9-34 3802 LTR rnd-6_family5936 278 105884 10-27 460 LINE Table 2. 2: Summary of de novo TE family discovery 51 Using TEClass [Abrusan et al. 2009] we classified these 31 unknown TE families. TETEClass is an automated tool for classification of unknown eukaryotic TEs. According to mechanism of transposition, TEs are classified into four categorize: DNA transposons, LTRs, LINEs, SINEs. TEClass employs different classifiers to categorize repeats into: DNA transposon versus retrotransposon, LTRs versus non-LTRs for retrotransposons, LINEs versus SINEs for non-LTR repeats, forward versus reverse sequence orientation. In cases where most of the classifiers are not in agreement, it reports the conflicting results as unknown. TEClass was reported to achieve 90–97% accuracy in the classification of novel DNA and LTR repeats, and 75% for LINEs and SINEs. The limitation of this tool is its incapability in distinguishing between TEs and non-TEs, meaning that every given sequence will be classified into one of the four categories even if it is not a TE. Therefore, it is essential to identify the non-TE classes before getting in to this classification. Assuming that the repeat families identified by RepeatModeler are TE consensus sequences, TEClass categorizes them to 9 DNA transposons, 12 LTRs, 4 LINEs, 0 SINEs and 6 unknowns. We performed a homology-based search using RepeatMasker on the boreoeutherian ancestor sequence to annotate all the TE copies belonging to each of the novel families. Once we identified these copies, we mapped them to the human genome, which essentially means applying the ancient TE annotation pipeline described in section 2. After identifying the 31 novel TE families’ copies in the human genome, our goal was to determine whether some of them may have contributed particular types of functional elements in the genome. Although many protein-coding genes are well annotated with their biological functions, non-coding regions typically lack such annotation. The genomic Regions Enrichment of Annotations Tool (GREAT) [McLean et al 2010] is a function prediction program that predicts the biological role of sets of non-coding genomic regions by analyzing the function of nearby genes. Using this tool, we investigated the nearby genes and functional elements to the annotated elements in the human genome belonging to the 31 novel ‘unknown’ TE families reported by RepeatModeler. 52 Members of each novel TE family were also analyzed to determine if they overlapped different types of functional elements in the human genome. These elements were: (i) highly conserved regions identified by the PhastCons program [Siepel et al. 2005]; (ii) Transcription factor binding sites identified by the Encode project [Birney et al. 2007]; (iii) DNAseI hypersensitive regions also identified by the Encode project, which correspond to regions of open chromatin [Birney et al. 2007]. We used the FeatureBits program to count the number of bases that overlap between each TE family and each of these types of annotation. We then computed a fold-enrichment measure, which is the ratio of the observed number of bases in the intersection to the expected amount of overlap if the two sets of regions were selected randomly in the genome. A p-value was associated to that fold-enrichment using a simple z-score calculation. Among our results, the followings are significant: • A family ‘rnd-5_family-120’ is 3.9 fold enriched for phastCons elements (p-value = 3×10!!! ), 1.6 fold enriched for DNAseI hypersensitive regions (p-value = 3×10!!" ) and 4.1 fold enriched for CTCF binding sites (p-value = 8×10!!! ), as compared to what would be expected by chance. This family is close to genes involved in cation homeostasis, dilated cardiomyopathy ( 3×10!! ), and abnormal renin activity ( 7×10!! ). • Family ‘rnd-6_family-2131’ has a 3.6-fold enrichment for transcription factor binding sites STAT3, ERalpha, NF-E2, GATA3. This family is enriched close to genes expressed in the cerebral cortex (p-value = 4×10!! ). • Family ‘rnd-6_family-916’ is enriched 4-fold near genes involved in DNA packaging and chromatin assembly. • Family ‘rnd-6_family-967’ is enriched 3-fold near gene expressed in cerebral cortex. 53 3.4 Conclusion We believe that these results are very promising and further analysis of patterns, nearby genes enrichment and functional element overlaps of these new TE families are valuable. As it was reviewed in chapter 1, TEs contribute to mutation of protein-coding genes, rewiring regulatory networks, genetic diseases and cancer. However, understanding their diverse biological roles is still under research and there are yet many to be discovered. Further analysis of these unknown TE families can potentially provide some insight on TEs association with specific TFBS, proteincoding genes involved in biological pathways and ultimately their impact on medicine and human health. 54 Chapter 3: Conclusion and Future Directions Transposable Elements, which are probably the most abundant class of repetitive sequences in the genome of all eukaryotes, are homologous DNA fragments present in multiple copies in a genome. Since their discovery about 65 years ago, they have drawn increasing attention from the scientific community. Their unique replication mechanisms and sheer abundance make them interesting biological entities to study. Despite the long belief of them being selfish parasitic sequences, they have contributed to genome size, structure and arrangement. In addition, TEs involvement in the host functional genes, regulatory network evolution, genetic disorders and cancer is now undeniable. With the rapidly increasing number of sequenced genomes, the accurate genome annotation is more essential than ever. The growing awareness of the challenges for understanding the dynamic component of the genome is clearly reflected in the increasing number of advanced methods for TE discovery. While there are mature automated coding region identification systems, there is no robust approach for TEs. Although due to the nature of the TEs there might not ever be a single approach of detection, the improvement of existing methods namely, de novo, and homology-based, structural-based, comparative and integrated, and using them in conjugate would be very effective and realistic. This indicates that TE bioinformatics is still in a growing phase. This research is motivated by the fact that the existing annotation tools are not capable of detecting TEs more than 40% diverged from their original consensus sequence. Therefore, the annotated TEs in the human genome are relatively younger copies detectable by current techniques. However, this does not mean that all the TE fragments in the human genome are annotated. Having an ancestral genome can work to our advantage since it contains younger versions of the ancient TEs in the human genome. 55 In this thesis, we designed a complete pipeline for annotation of ancient TEs in the human genome using both pre-existing methods and our new algorithms. The genome of the boreoeutherian ancestor that is the ancestor of almost all mammals lived about 75 million years ago, has been reconstructed with 98-99 % accuracy using the multiz alignment and Ancestor programs. Our pipeline employs RepeatMasker as a classic homology-based TE detection tool to identify TEs in the boreoeutherian ancestor and human genomes. Using LiftOver, the annotated TEs in the boreoeutherian sequence are lifted to the human genome. Then, the TEMapper program, which we have developed, aligns these new TE copies in human to the consensus sequences in the RepeatMasker TE library and calculates the correct divergence percentages. Finally, FeatureBits is fed with the TEs annotated in human and the TEs mapped to human to remove the duplications and provide a single output file contacting the revised TE annotation. Based on our analysis, the total coverage gained by the revised TE annotation is 115 Mb corresponding to ~3.5 % of the human genome. Because the LINE family is one of the oldest TE families known in human inserted before the boreoeutherian ancestor, the elements of this family, in particular CR1 and L2 account for most of the gain. After that elements belonging to the ancient SINE family, excluding Alus, account for a substantial amount of coverage gained. In addition, we discovered and classified 31 novel TE families and analyzed their enrichments around genes and functional and regulatory elements. Associations with (i) CTCF, STAT3, GATA3, ERalpha and NF-R2 binding sites; (ii) phastCons elements; (iii) DNAseI hypersensitive regions; (iv) genes involved in cation homeostasis, dilated cardiomyopathy, abnormal renin activity, cerebral cortex, DNA packaging and chromatin assembly are among the significant ones recognized. The analysis of the newly discovered TE families identity, structure, biological roles, associations to genes, regulatory elements, and diseases is still in the preliminary stage and need to be investigated further in future. Aside form that, other human ancestors can be inferred and fed to the ancient TE annotation pipeline for the sake of identification of more TE copies in human or other mammals. The same principle can be applied to other eukaryotes particularly plants since TEs are even more widespread and more active in their genomes compared to mammals [Feschotte et al. 2002]. 56 Another interesting direction of research that can be taken using the ancestral sequence is investing other genomic elements such as transcriptional factor binding sites (TFBS), regulatory elements, and pseudogenes. Pseudogenes [Pink & Carter 2013], in particular, are another interesting biological entities to study. They are non-functional copies of normal functional protein-coding genes that either have lost their functions at one or both transcription and translation levels or no longer expressed in the cell. They arise either during DNA replication from duplication in the DNA sequence or during reverse transcription of mRNA with subsequent reintegration of the cDNA into the genome. They can provide records of how genomic DNA has changed to insure the survival of the organism. In addition, they can be used as a model of evolutionary events such as rate of nucleotide substitution, insertion and deletion in the genome. Despite the protein coding function loss, pseudogenes are similar to TEs in that they can have regulatory roles [Mura et al. 2011] and contribution to diseases [Pink et al. 2011] and more. A similar approach as what presented by Svensson el al. 2006, can be used for pseudogene prediction in an ancestral genome. Our pipeline presented in Figure 2.2 can be adapted to locate pseudogenes using BLAST instead of RepeatMasker. Applying the adapted pipeline on the human and boreoeutherian ancestor genomes, we expect to find ancient pseudogenes that have been lost in the human genome and are not identifiable in human by current methods. All these suggest that there is much to be gained studying ancestral DNA and this work has only scratched the surface of this new field. 57 References Abrusan G, Grundmann N, DeMeester L, Makalowski W: TEclass: a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 2009, 25:1329-1330. Altschul SF, Gish W, Miller W, Myers W, Lipman J: Basic local alignment search tool. JMol Biol 1990, 215:403–10. Andrieu O, Fiston AS, Anxolabehere D, Quesneville H: Detection of transposable elements by their compositional bias. BMC Bioinformatics 2004, 5:94. Arkhipova I, Pyatkov K, Meselson M, Evgen’ev M: Retroelements containing introns in diverse invertebrate taxa. Nat. Genet. 2003, 33:123-124. Badyaev A: Stress-induced variation in evolution: from behavioural plasticity to genetic assimilation. Proc. Biol. Sci. 2005, 272:877-886. Bao Z, Eddy S: Automated de novo identification of repeat sequence families in sequenced genomes. GenomeRes 2002, 12:1269-76. Bateman A, Birney E, Cerruti L, Durin R, Etwiller L, Eddy R, Griffiths-Jones S, Howe L, Marshall M, Sonnhammer L: The Pfam protein families database. Nucleic Acids Res 2002, 30:276–80. Belancio P, Hedges J, Deininger P: Mammalian non-LTR retrotransposons: for better or worse, in sickness and in health. Genome Res. 2008, 18: 343-358. Bénit L, Lallemand JB, Casella JF, Philippe H, Heidmann T: ERV-L elements: a family of endogenous retrovirus-like elements active throughout the evolution of mammals. J 58 Virol 1999, 73:3301-3308. Bennett EA, Coleman LE, Tsui C, Pittard W, Devine S: Natural genetic variation caused by transposable elements in humans. Genetics 2004, 168:933-51. Berezikov E, Bucheton A, Busseau I: A search for reverse transcriptase-coding sequences reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster. Genome Biol 2000, 1:RESEARCH0012. Bergman C, Quesneville H: Discovering and detecting transposable elements in genome sequences. Brief Bioinform. 2007, 8(6):382-392. Birney E, Stamatoyannopoulos J, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146): 799816. Blanchette M, Green E, Miller W, Haussler D: Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res 2004, 14(12):2412-2423. Blanchette M, Kent W, Riemer C, Elnitski L, Smit A, Roskin K, Baertsch R, Rosenbloom K, Clawson H, Green E, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Research 2004, 14(4): 708-715. Bourque G, Pevzner P: Genome-Scale Evolution: Reconstructing Gene Orders in the Ancestral Species. Genome Res 2002, 12:26-36. Britten R, Kohne D: Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science 1968, 161:529–540. Brosius J, Gould S: On ‘genomenclature’: a comprehensive (and respectful) taxonomy for pseudogenes and other ‘junk DNA’. Proceeding of National Academic Science USA 59 1992, 89: 10706-10710. Brouha B, Schustak J, Badge RM, Lutz-Prigge S, Farley AH, Moran JV, Kazazian HH Jr: Hot L1s account for the bulk of retrotransposition in the human population. Proc Natl Acad Sci USA 2003, 100:5280-5285. Callinan A, Batzer A: Retrotransposable elements and human disease. Genome Dyn. 2006, 1: 104-115. Caspi A, Pachter L: Identification of transposable elements using multiple alignments of related genomes. Genome Res 2006, 16:260-70. Chen M, Chuzhanova N, Stenson PD, Férec C, Cooper N: Meta-analysis of gross insertions causing human genetic disease, novel mutational mechanisms and the role of replication slippage Hum. Mutat. 2005, 25: 207-221 Chen M, Stenson D, Cooper N, Ferec C: A systematic analysis of LINE-1 endonucleasedependent retrotranspositional events causing human genetic disease. Hum. Genet. 2005, 117:411-427. Cordaux R, Batzer M: The impact of retrotransposons on human genome evolution Nature Reviews Genetics 2009, 10, 691-703. Craig N: Unity in transposition reactions. Science 1995, 270:253–254. Deininger L, Batzer A: Alu repeats and human disease. Mol. Genet. Metab. 1999, 67: 183193. Diallo A, Makarenkov V, Blanchette M: Ancestors 1.0: a web server for ancestral sequence reconstruction. Bioinformatics 2010, 26(1):130-1. Divoky V, Indrak K, Murg M, Brabec V, Huisman J, Prchal T: A novel mechanism of beta 60 thalassemia, The insertion of L1 retrotransposable element into beta globin IVSII Blood, 1996, 88: 148a. Eddy S: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. Edgar R, Myers E: PILER: identification and classification of genomic repeats. Bioinformatics 2005, 21(suppl. 1): 52–58. Feschotte C, Jiang N, Wessler SR: Plant transposable elements: where genetics meets genomics. Nature Reviews Genetics 2002, 3(5): 329-341. Feschotte C, Pritham E: DNA transposons and the evolution of eukaryotic genomes. Annu Rev Genet. 2007, 41:331–368. Feschotte C: Transposable elements and the evolution of regulatory networks. Nat Rev Genet 2008, 9:397-405. Flutre T, Duprat E, Feuillet C, Quesneville H: Considering Transposable Element Diversification in De Novo Annotation Approaches. PLoS One 2011, 6: e16526. Gentles A, Wakefield M, Kohany O, Gu W, Batzer M, Pollock D, Jurka J: Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis domestica. Genome Res. 2009, 17: 992-1004. Gregory T: Synergy between sequence and size in Large-scale genomics. Nature Reviews Genetics 2005, 6:699-708. Gu W, Castoe T, Hedges D, Batzer M, Pollock D: Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem 2008, 380(1):77-83. Hancks C, Kazazian H: Active human retrotransposons: variation and disease. Curr Opin 61 Genet Dev 2012, 22(3):191-203. Huang R, Schneider M, Lu Y, Niranjan T, Shen P, Robinson A, Steranka P, Valle D, Civin I, Wang T, Wheelan J, Ji H, Boeke D, Burns H: Mobile interspersed repeats are major structural variants in the human genome. Eur J Hum Genet. 1993, 1(1): 30-6. Huang X: Global Sequence Alignment. Computer Applications in the Biosciences 1994, 10: 227-235. Hutchinson B, Andrew E, McDonald H, Goldberg P, Graham R, Rommens M, Hayden R: An Alu element retroposition in two families with Huntington disease defines a new active Alu subfamily. Nucleic Acids Research 1993, 21(15): 3379-3383. Jasinska A, Krzyzosiak W: Repetitive sequences that shape the human transcriptome. FEBS Letters 2004, 567(1):136–141. Jordan I, Rogozin I, Glazko G, Koonin E: Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet. 2003, 19, 68–72. Jurka J, Kapitonov V, Kohany O, Jurka M: Repetitive Sequences in Complex Genomes: Structure and Evolution. Annu.Rev.Genomecs Hum. Genet. 2007, 8:241-59. Jurka J, Kapitonov V, Pavlicek A, klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110:462-7. Jurka J, Kapitonov V, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogentic and Genome Research 2005, 110:462-467. Jurka J, Klonowski P, Dagman V, Pelton P: CENSOR—a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem 1996, 20:119–21. 62 Kapitonov V, Jurka J: Rolling-circle transposons in eukaryotes. Proceeding of National Academic Science USA 2001, 98:8714–19. Kapitonov V, Jurka J: Self-synthesizing DNA transposons in eukaryotes. Proc. Natl. Acad. Sci. USA 2006, 103:4540–45. Karamerov A, Vassetzky S.: SINEs. Wiley Interdiscip Rev RNA. Epub 2011, 2(6): 772-86. Katoh K, Misawa k, Kuma k, Miyata T: MAFFT a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl. Acid Res 2002, 30(14):305966. Kazazian H, Wong C, Youssoufian H, Scott A, Phillips D, Antonarakis S: Hemophilia A resulting from de novo insertion of L1 sequences represents a novel mechanism for mutation in man. Nature 1988, 332: 164-166 Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12(6):996-1006. Kohany O, Gentles AJ, Hankus L, Jurka J: Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 2006, 7:474. Kolpakov R, Kucherov G: Finding maximal repetitions in a word in linear time. Symp Fondation of computer scince 1999, 40: 596-604. Kondo-Iida E, Kobayashi K, Watanabe M, Sasaki J, Kumagai T, Koide H, Saito K, Osawa M, Nakamura Y, Toda T: Novel mutations and genotype phenotype relationships in 107 families with Fukuyama-type congenital muscular dystrophy (FCMD). Hum. Mol. Genet. 1999, 8:2303-2309. 63 Koplanov R, Bana G, Kucherov G: MREPS: effective and flexible detection of tandem repeats in DNA. Nucl. Acid Res 2003, 31(13): 3672-3678. Kuhn M, Haussler D, Kent J: The UCSC genome browser and associated tools. Brief Bioinform 2012, 14(2):144-161. Lerat E.: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 2010, 104: 520–533. Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong G Wang J: ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol 2005, 1:e43. Lorenzi H, Robledo G, Levin M: The VIPER elements of trypanosomes constitute a novel group of tyrosine recombinase-enconding retrotransposons. Mol. Biochem. Parasitol. 2006, 145:184-94. Ma J: Reconstructing the history of large-scale genomic changes: biological questions and computational challenges. J Comput Biol 2011, 18(7): 879-93. Marino-Ramirez L, Jordan I: Transposable element derived DNaseI-hypersensitive sites in the human genome. Biol. Direct 2006, 1, 20. McCarthy E, McDonald J: LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 2003, 19:362-367. McLean C, Bristor D, Hiller M, Clarke S, Schaar B, Lowe C, Wenger A, and Bejerano G: GREAT improves functional interpretation of cis regulatory regions. Nat. Biotechnol 2010, 28(5): 495-501. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, 64 Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 2007, 17(12): 1797808. Mills R, Bennett E, Iskow R, Devine S: Which transposable elements are active in the human genome? Trends Genet. 2007, 23:183–191. Muro M, Mah N, Andrade-Navarro A: Functional evidence of post-transcriptional regulation by pseudogenes. Biochimie 2011, 93: 1916-1921. Murphy W, Eizirik E, Johnson W, Zhang Y, Ryder O, O'Brien S: Molecular phylogenetics and the origins of placental mammals. Nature 2001, 409:614-618. Negrini S, Gorgoulis G, Halazonetis TD: Genomic instability - an evolving hallmark of cancer. Nat Rev Mol Cell Biol 2010, 11:220-228. Nekrutenko A, Li H: Transposable elements are found in a large number of human protein-coding genes. Trends in Genetics 2001, 17: 619–621 Ostertag E, Kazazian J: Biology of mammalian L1 retrotransposons Annu. Rev Genet. 2001, 35:501–538. Ostertag M, Goodier L, Zhang Y, Kazazin H: SVA Elements Are Nonautonomous Retrotransposons that Cause Disease in Humans. Am J Hum Genet. 2003; 73(6): 14441451. Paten B, Herrero J, Beal K, Birney E: Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics 2009, 25(3):295301. 65 Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E: Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res 2008, 18(11):1829-43. Pearson W: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 2000, 132:185-219. Pink C, Carter R: Pseudogenes as regulators of biological function. Essays Biochem 2013, 54: 103-112. Pink RC, Wicks K, Caley DP, Punch EK, Jacobs L, Carter DR: Pseudogenes: pseudofunctional or key regulators in health and disease? RNA 2011, 17: 792-798. Poulter R, Goodwin T: DIRS-1 and the other tyrosine recombinase retrotransposons Cytogenet. Genome Res. 2005, 110:575–588. Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21 suppl. 1:i351-8. Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehare D: Combined evidence annotation of transposable elements in genome sequences. PLoSComput Biol 2005, 1:e22. Quesneville H, Nouaud D, Anxolabehere D: Detection of new transposable element families in Drosophila melanogaster and Anopheles gambiae genomes. JMol Evol 2003, 57(suppl. 1): S50–9. Rho M, Choi JH, Kim S, Lynch M, Tang H: De novo identification of LTR retrotransposons in eukaryotic genomes. BMC Genomics 2007, 8:90. Richard Cordaux, Dale J. Hedges, Scott W. Herke, Mark A. Batzer: Estimating the retrotransposition rate of human Alu elements. Gene. 2006, 371: 134-137. 66 Sharp A, Cheng Z, Eichler E: Structural variation of the human genome. Annu. Rev. Genom. Hum. Genet. 2006, 7:407–442. Siepel A, Bejerano G, Pedersen J, Hinrichs A, Hou M,Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richard S, Weinstock GM, Wilson, RK, Gibbs RA, Kent Wj, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8): 1034-1050. Skow C, McCabe T, Mills E, Torene S, Pittard S, Neuwald F, Van Meir G, Vertino M, Devine E: Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 2010, 141:1253-1261. Slotkin K, Martienssen R.: Transposable Element and epigenetic regulation of the genes. Nature Reviews Genetics 2007, 8: 272-285. Smit A, Hubley R, Green P: RepeatMasker. Institute for Systems Biology 1996-2010, Open3.0. http://www.repeatmasker.org. Smit AFA, Hubley R: RepeatModeler Open-1.0. 2008-2010 http://www.repeatmasker.org. Solyom S, Kazazian H: Mobile elements in the human genome: implications for disease. Genome Med. 2012, 4:12. Svensson O, Arvestad L, Lagergren J: Genome-wide Survaey for Bilogically Functional Pseudogenes. PLoS Comput. Biol. 2006, 2 (5): e46 Symer E, Connelly C, Szak T, Caputo M, Cost J, Parmigiani G, Boeke D: Human l1 retrotransposition is associated with genetic instability in vivo. Cell 2002, 110:327-338. 67 Thomas A, Paquola C, Muotri R.: LINE-1 retrotransposition in the nervous system. Annu Rev Cell Dev Biol. 2012, 28:555-73. Vidaud D, Vidaud M, Bahnak R, Siguret V, Gispert Sanchez S, Laurian Y, Meyer D, Goossens M, Lavergne M : Haemophilia B due to a de novo insertion of a human specific Alu subfamily member within the coding region of the factor IX gene. Eur J Hum Genet. 1993, 1(1):30-6. Volff J: Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. BioEssay 2006, 28:913-922. Wallace R, Andersen L, Saulino A, Gregory P, Glover T, Collins F: A de novo Alu insertion results in neurofibromatosis type 1. Nature 1991, 353: 864-866. Wang T, zeng J, Lower C, Sellers R, Salama S, Yang M, Burgess S, Brachmann R, Hussler D: Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 2007, 104:18613-618 . Westesson O, Lunter G, Paten B, Holmes I: Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One 2012, 7(4):e34572. Wheeler T, Clements J, Eddy S., Hubley R, Jones T, Jurka J, Smit A, Finn R: Dfam: a database of repetitive DNA based on profile Hidden Markov Models. Nucleic Acids Research 2013, 41:D70-82. Wheeler T, Eddy S: nhmmer: DNA homology search with profile HMMs. Bioinformatics 2013, 29:2487-2489. Wimmer K, Callens T, Wernstedt A, Messiaen L: The NF1 gene contains hotspots for L1 endonuclease-dependent de novo insertion. PLoS Genet 2011, 7:e1002371. 68
© Copyright 2026 Paperzz