Ancient Transposable Elements Discovery and

Ancient Transposable Elements
Discovery and Annotation
Ayrin Ahia-Tabibi
Department of Computer Science
Center for Bioinformatics
McGill University, Montreal
November 2014
A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of
Master of Computer Science
© Ayrin Ahia-Tabibi, November 2014
Abstract
Transposable elements (TE), the largest class of repetitive DNA fragments, are the single most
abundant component of the genetic material of most eukaryotes. The sheer number, mechanism
of transpositions and repetitive natures of the TE sequences are responsible for some challenges
in genomics, although that is what makes them particularly interesting entities to study. The
recent advancement in the sequencing technologies and the availability of genomic sequences has
made the genome-wide analysis of TEs possible. The impact of TEs on structure, evolution and
size of the genome as well as genome sequencing and annotation has created growing interest and
demand for developing new bioinformatics approaches for their identification. These approaches
all aim to computationally discover, detect and analyze both known and novel families of TEs.
After their insertion in the genome, most TE copies get relatively quickly degraded, making the
recognition of old insertion events challenging. In this thesis, we develop a new pipeline to
improve the annotation of ancient transposable elements that have shaped the dynamic
component of the human genome. We make use of the availability of inferred ancestral
mammalian genome to detect these ancient TE copies using RepeatMasker. Using LiftOver these
TEs are lifted to the human genome and then fed to our TEMapper program to be aligned to their
corresponding consensus sequences and corrected for the percentage of divergences. Applying
the ancient TE annotation pipeline, we revised the annotation of TEs and reached 115Mb
coverage gained corresponding to ~7.28% improvement in the human genome. This number
corresponds to the significant 3.5% increase in TE composition of the human genome. In
addition, we discover novel TE families and investigate their association with genes and
regulatory elements.
1 Résumé
Les éléments transposables (ET), la plus grande classe de fragments d'ADN répétitif, sont les
éléments les plus abondants du génome de la plupart des eucaryotes. Leur nombre même, leur
mécanisme de transpositions et leur nature répétitive sont responsables de certains défis
importants en génomique, et c’est en partie ce qui les rend particulièrement intéressants à étudier.
L’avenue récente des technologies de séquençage et la disponibilité de séquences génomiques a
rendu possible l'analyse de ETs dans des génome entiers. L'impact des ET sur la structure,
l'évolution et la taille du génome ainsi que le séquençage du génome et l'annotation a suscité
l'intérêt et la demande pour développer de nouvelles approches bio-informatique pour leur
identification.
Après leur insertion dans le génome, la plupart des copies d’ET se dégradent assez rapidement, ce
qui rend la reconnaissance de vieux événements d'insertion difficile. Dans cette thèse, nous
développons de nouveaux algorithmes visant à améliorer l'annotation des éléments transposables
anciens qui ont façonné la composante dynamique du génome humain. Nous faisons usage de la
disponibilité de génome des mammifères ancestraux infers pour détecter ces anciennes copies ET
en utilisant le programme RepeatMasker. Les ET identifies dans des sequences ancestrales sont
ensuite transférés au genome humain grâce à l’outil et sont ensuite introduites dans notre
programme de TEMapper à être alignés à leurs séquences consensus. Grâce à ce nouveau
mécanisme d’annotation d’ET, nous avons révisé l'annotation d’ET du genome humain et
augmenté de 115 Mb la portion du genome annotée comme ayant une origine d’ET. En outre,
nous découvrons de nouvelles familles d’ET et démontrons que certaines d’entre elles sont
associées à des genes aux fonctions ou profile d’expression conservés et à des éléments de
régulation.
2 Table of Contents
Abstract ........................................................................................................................................... 1
Résumé ............................................................................................................................................ 2
Table of Contents ........................................................................................................................... 3
Table of Figures.............................................................................................................................. 5
Table of Tables ............................................................................................................................... 6
Acknowledgments .......................................................................................................................... 7
Chapter1: Introduction ................................................................................................................. 8
1
Biology Of Transposable Elements .................................................................................................. 9
1.1
Structure and Systematics of Transposable Elements................................................................ 10
1.1.1
Retrotransposons ................................................................................................................................ 12
1.1.2
DNA Transposons.............................................................................................................................. 15
1.1.3
Contribution of Transposable Elements in Shaping the Human Genome and Association to Diseases
16
2
3
Transposable Element Discovery Methods ................................................................................... 20
2.1
De Novo Methods ...................................................................................................................... 21
2.2
Homology-Based Methods ........................................................................................................ 22
2.3
Structure-Based Methods ........................................................................................................... 24
2.4
Comparative Genomic Methods ................................................................................................ 25
2.5
Integrated Methods .................................................................................................................... 26
Ancestral Mammalian Genome Inference .................................................................................... 27
Chapter 2: Annotation of Ancient Transposable Elements in the Human Genome ............. 29
1
Introduction ..................................................................................................................................... 29
2
Methods ............................................................................................................................................ 30
3
2.1
Ancestral Genome Reconstruction............................................................................................. 32
2.2
Identification of Transposable Elements in Boreoeutherian Ancestor ...................................... 34
2.3
Divergence Calculation .............................................................................................................. 36
Results and Discussion .................................................................................................................... 42
3.1
Human and Boreoeutherian Ancestor Divergence Profiles ....................................................... 43
3 3.2
Mapping and Divergence Correction ......................................................................................... 46
3.3
De-novo Transposable Element Discovery................................................................................ 50
3.4
Conclusion ................................................................................................................................. 54
Chapter 3: Conclusion and Future Directions .......................................................................... 55
References ..................................................................................................................................... 58
4 Table of Figures
Figure 1. 1: Human genome composition and percentage shares .......................................................... 10 Figure 1. 2: Schematic representation of major classes of Transposable Elements ......................... 11 Figure 1. 3: Components of the human genome ............................................................................................ 16 Figure 1. 4: Workflow of the 4-step de novo TE detection pipeline ........................................................ 21 Figure 1. 5: Workflow of 4-steps homology based TE annotation pipeline ......................................... 24 Figure 2. 1: Transposable Elements degrading over time ............................................................ 30
Figure 2. 2: Ancient Transposable Element Annotation Pipeline ................................................. 31
Figure 2. 3: Mapping TEs from the boreoeutherian ancestor genome to human genome ............ 32
Figure 2. 4: Vertebrate Phylogenetic Tree .................................................................................... 33
Figure 2. 5: RepeatMasker .cat output file example ...................................................................... 37
Figure 2. 6: LiftOver .mapped output file example ....................................................................... 37
Figure 2. 7: Ancestor program .maf output file example .............................................................. 38
Figure 2. 8: Hypothetical example of nested TEs inserted into the genome ................................. 41
Figure 2. 9: Divergence profile of annotated major TEs in the human genome ........................... 44
Figure 2. 10: Divergence profile of annotated major TEs in the boreoeutherian ancestor genome
................................................................................................................................................ 44
Figure 2. 11: LINE/L2 elements divergence profile comparison .................................................. 45
Figure 2. 12: LINE/CR1 elements divergence profile comparison ............................................... 45
Figure 2. 13: SINE/Alu elements divergence profile comparison ................................................. 45
Figure 2. 14: Coverage gained by revised annotation of LINE/CR1 elements ............................ 47
Figure 2. 15: Coverage gained by revised annotation of LINE/L2 elements ................................ 47
Figure 2. 16: Coverage gained by revised annotation of SINE/MIR elements ............................. 47
5 Table of Tables
Table 1. 1: Coding regions modified by TE insertions in human ............................................................. 18 Table 2. 1: Revised TE annotation by ancient transposable element annotation pipeline .............. 49 Table 2. 2: Summary of de novo TE family discovery ................................................................................ 51 6 Acknowledgments
This dissertation would have not been possible without the help of so many people in so many
ways. My deepest gratitude goes to my supervisor Prof. Mathieu Blanchette, who expertly guided
me through my graduate education. His enthusiasm kept me constantly engaged with my research
and his personal generosity inspired me to become a better person. I am forever thankful for his
understanding, wisdom, patience, encouragement and pushing me farther than I thought I could
go.
My appreciation extends to all my talented current and former laboratory colleagues, particularly
Rola Dali and David Becerra, for their assistance and suggestions throughout my project. They
have all truly helped make my time enjoyable in the lab. To McGill University for the
opportunity and Natural Science and Engineering Research Council of Canada (NSERC) for the
funding throughout these two years.
I am grateful to all my friends for helping me survive all the stress and always listening and
giving me word of advice. Above ground, I am indebted to my family whose value to me only
grows with age. Specially, I thank my wonderful parents, Atoosa and Kian, my beautiful sister,
Atra, and my beloved grandmother who always have faith in me, for their unconditional love and
endless support. Last but not least, I acknowledge my husband, Babak, who has been by my side
since the beginning, given me strength and inspiration, and blessed my life in the hours when the
lab lights were off.
7 Chapter1: Introduction
A Transposable Element (TE) is a DNA sequence of 200 to 5000 base pair (bp) long that can
change its position within the genome, create mutations and alter the cell's genome size. TEs are
abundant yet poorly understood components of almost all eukaryotic genomes. They are
important biological entities to study because of their role in genome structure, size,
rearrangement and contribution to gene and regulatory region evolution.
This research is motivated by the fundamental challenges in genome sequencing, genome
assembly, annotations and alignments, which are rooted in the mobile and repetitive nature of this
dynamic component. In fact, the evolutionary implications and the presence of coding regions in
some TEs can complicate the process of gene annotation and genome assembly. Therefore,
accurate TE identification and classification is essential for many applications in genomics. Yet,
existing algorithms remain incapable of annotating old TE insertions, which impedes on a
number of downstream analyses. There are several challenges associated with TE identification
caused by the nature of TEs [Lerat 2009] such as:
•
They do not follow a universal structure.
•
They insert themselves in different regions of the genome or within one another leading to
nested copies.
•
They mutate and diverge form the original copy over time.
•
They mostly lose their replication abilities once mutated.
Computationally reconstructed mammalian ancestral sequences, which will be discussed later in
the chapter, may contain remnants of very old copies from the known and unknown families of
TEs, not found in the human genome to date. Applying repeat discovery techniques, it should be
feasible to identify these ancient TE copies and therefore, elaborate on the human genome
evolution.
8 In this chapter, we review the biology of TEs, introduce different families of TEs, and discuss the
main types of TE discovery methods including de-novo, homology-based, structural-based,
comparative and integrated approaches. In addition, we explain methods for ancestral mammalian
genome reconstruction upon which the proposed TE annotation pipeline relies.
1
Biology Of Transposable Elements
The term “repetitive sequence” refers to homologous DNA fragment that are present in multiple
copies in the genome. First discovered and analyzed by McClintock in maize in 1948
[McClintock 1948], TEs are a widespread class of repetitive genomic regions (200-5000 bp long)
that have the ability to change position, create mutations, and copy-paste themselves into the
genome of the host, thereby alter the genome size. The repetitive nature of TEs is mediated by
their ability of transposition via an RNA or DNA intermediate and thus, increases the copy
number to eventually constitute a large fraction of genome sequences.
In humans, certain TE families are present in up to a million copies and all together they account
for approximately 50% of the human DNA [Gregory 2005]. Long believed to be ‘selfish’
intragenomic parasitic regions contributing no functional elements for the host, they are now
recognized as a major source of new functional genes [Volff 2006] and regulatory elements
[Feschotte 2008]. However, TE activities are also associated with a number of genetic diseases,
in particular cancer [Solyom & Kazazian 2012] all of which will be discussed later in this
chapter. The human genome composition and percentage shares of various functional and nonfunctional sequences are summarized in Figure 1.1 and are elaborated in the next section.
9 Figure 1. 1: Human genome composition and percentage shares
[Jasinska et al. 2004]
1.1 Structure and Systematics of Transposable Elements
Although on this thesis our focus is on transposable elements, they are only one of the several
types of mobile genetic elements. Repetitive DNA was originally classified into “highly”,
“middle” and “low copy” repetitive sequences, roughly corresponding to intersperse, tandem, and
segmental duplications [Britten & Kohne 1968]. Tandem repeats represent arrays of copies of
DNA fragments immediately adjacent to each other in head to tail orientation. In contrast,
interspersed repeats are DNA fragments up to 20-30 kilo bases (Kb) in length, inserted randomly
into the host DNA. Interspersed repeats are mostly inactive and incomplete copies of inserted
TEs. Low copy repeats (LCRs), also known as segmental duplications (SDs), are highly
homologous sequence elements within the eukaryotic genome. They are typically 10-300 Kb in
length, and bear more than 95% sequence identity [Sharp et al. 2006]. A SD is caused by an error
in chromosomal splicing during genetic recombination.
10 According to their mechanism of transposition (the process by which TEs move about a genome),
eukaryotic TEs can be categorized into two major types, retrotransposons and DNA transposons.
The schematic of the two types is illustrated in Figure 1.2.
Transposition can be classified as either "autonomous" or "non-autonomous" in both
retrotransposons and DNA transposons. Autonomous TEs encodes a complete set of enzymes
characteristic of its family and is self–sufficient in terms of transposition while non-autonomous
TEs requires the presence of other TEs to move. They transpose by borrowing the protein
machinery encoded by its autonomous relatives. This is often because non-autonomous TEs lack
transposase (for DNA transposons) or reverse transcriptase (for retrotransposons). In human, the
majority of TEs are non-autonomous [Jurka et al. 2007].
Figure 1. 2: Schematic representation of major classes of Transposable Elements
[Slotkin & Martienssen 2007]
11 1.1.1 Retrotransposons
Retrotransposons are described as copy and paste TEs and are the most common type of TEs.
They are first transcribed through an RNA intermediate form DNA to RNA (mRNA) and then
reverse transcribed from RNA to DNA (cDNA), which is then inserted back into a new position
in the genome [Craig 1995]. Reverse transcriptase (RT) and endonuclease/integrase (EN/INT)
enzymes, which are encoded by autonomous elements, catalyze the process of reverse
transcription and integration [Jurka et al. 2007].
1.1.1.1 Non-Long Terminal Repeat and Long Terminal Repeat Retrotransposons
The presence or absence of long terminal repeats (LTRs) further classifies retrotransposons into
non-LTR and LTR elements. Non-LTR TEs are best known for the enormous success
reproducing in the human genome and have persisted in eukaryotic genomes for hundreds of
millions of years. These ancient genetic elements, as their name implies, lack the LTRs. A typical
non-LTR retrotransposons contains one or two open reading frames (ORFs) and includes an
internal promoter in the 5' terminal region that governs transcription of the retrotransposons
[Jurka et al. 2007]. The majority of human TEs result from the current and previous non-LTR
activities.
LTR retrotransposons which account for ~8% of the human genome, are retroviral-like in
structure and mechanism. Although the mechanism of retrotransposition is not yet completely
understood, the structure of LTR retrotransposons is as follows. They have direct LTRs that
range from ~100 bp to over 5 Kb in size. The two LTR regions, 5' LTR and 3' LTR, are very
similar. They are identical when the element inserts into the host genome, and once inserted, they
begin to evolve independently [Jurka et al. 2007]. An LTR retrotransposons carry two ORFs for
the gag and pol proteins and sometimes a third one downstream, for the env protein. Gag (GroupSpecific antigens) is a polyprotein that forms the core structural proteins of retroviruses. Pol
(DNA Polymarase) is the reverse transcriptase, the essential enzyme that carries out the reverse
transcription process. Env (Envelop protein) is a viral protein that serves to form the viral
envelope.
12 For the sake of completeness, it is worth mentioning that although the two classes of non-LTR
and LTR elements are well established and studied, there are also Penelope and Dictyostelium
intermediate repeat sequence (DIRS) retrotransposons that were more recently discovered
[Arkhipova et al. 2003; Badyaev 2005; Lorenzi et al. 2006; Poulter & Goodwin 2005]. The
Penelopes encode 2.5 Kb long ORF and characterized by an unusual LTR not typical for a
standard non-LTR retrotansposons. The DIRS retrotransposons consist of ~4.1 Kb of unique
internal sequence flanked by inverted terminal repeats of unequal lengths. Member of all these
four classes of retrotransposons are present in the genome of all eukaryotic kingdom: protista,
plant, fungi, animalia with the exception of Penelope that has not yet been identified in plants.
However, since the existing tools are not advanced enough to recognize Penelope and DIRS as
distinct classes in the human genome, we do not focus on them here.
Retrotransposons are commonly grouped into three main families:
•
Endogenous Retrovirus-Like Elements (ERVs) [Benit et al. 1999]: TEs with LTRs
that encode a reverse transcriptase protein, similar to retroviruses. In humans,
these elements have been named HERVs (for human endogenous retroviruses),
and several families of such elements have been characterized. Analysis of a large
series of mammalian genomic DNAs shows that ERVs are present among all
placental mammals, suggesting that these elements were already present at least
70 million years ago [Cordaox et al. 2009]. However, their activity is presently
very limited in humans, if it occurs at all [Mills et al. 2007]
•
Long Interspersed Elements (LINEs) [Ostertag & Kazazian 2001]: Mainly L1s,
L2s, and CR1s that do encode reverse transcriptase (autonomous) but lack LTRs,
and are transcribed by RNA polymerase II.
•
Short Interspersed Elements (SINEs) [Kramerov & Vassetzky 2011]: Mainly Alus
(for arthrobacter luteus), SVA (for SINE-VNTR-Alu) and MIRs (for mammalianwide interspersed repeat). They are not-LTR retrotransposons that do not encode
reverse transcriptase (non-autonomous) and are transcribed by RNA polymerase
III.
13 1.1.1.2 Long Interspersed Element (LINE)
LINE retrotransposons, which have been present in the human genome for at least 70 million
years, are between 5-9 Kb long and comprise an astounding ~21% of the human DNA. An
average human genome contains ~80-100 active LINE elements that can retrotranspose to new
genomic locations; the rest are inactive copies. They are currently the only known
retrotransposons in the human genome that code for the proteins machinery required for their
own transpositions. In addition, LINE1 (L1) is the only element from this family that is still
active in the human genome today and is found in all mammals [Cordaox et al. 2009]. However,
remnants of L2 and CR1 elements are also found in the human genome.
1.1.1.3 Short Interspersed Element (SINE)
Alu elements are short stretches of DNA, ~300 bp long on average, and are therefore classified as
SINEs. They do not encode protein products and depend on LINE retrotransposons for their
replication [Cordaox et al. 2009]. Alu elements of different kinds occur in large numbers in
primate genomes only [Cordaox et al. 2009]. There are over one million Alu elements
interspersed throughout the human genome, and it is estimated that about 11% of the human
genome consists of Alu sequences. In fact, Alus are the most abundant transposable elements in
the human genome. Most human Alu insertions can be found in the corresponding positions in
the genomes of other primates [Nekrutenko & Li 2001]; however, about 7,000 Alu insertions are
unique to humans.
Another member of the SINE class is SVA elements [Ostertag 2003] that like Alus are primate
linage specific. The SVA family is the hominid-specific youngest TE family identified in human
with ~2700 copies. SVA elements are composed of three parts, SINE, VNTR and Alu and vary in
size as a result of polymorphisms in VNTR (for variable number tandem repeat). Like other
retroelements, Alu and SVA insertions can have both negative effects, by implicating for genetic
disorders, and potentially positive effects, by creating new gene families, on the human genome.
Alus, SVAs and L1s that together account for one third of the human genome, are the only TEs
currently active in humans. They have undeniably been shown to be responsible for genetic
disorders [Kazazian et al. 1988; Deininger & Batzer 1999; Chen et al. 2005; Callinan & Batzer
2006; Belancio et al. 2008; Solyom & Kazazian 2012].
14 1.1.2 DNA Transposons
The second type of TEs is DNA transposons [Jurka et al. 2007; Cordaox et al 2009] that are
mainly described as cut-and-paste TEs [Craig 1995]. Having been active until ~37 million years
ago, DNA transposons are each ~5 Kb long and all together make less than 3% of the human
genome. Contrary to retrotransposons, the replication of DNA transposons does not involve an
RNA intermediate. Various types of transposases enzymes that cut the TE and paste it in a target
site catalyze the transpositions. DNA Transposons are typically bound by terminal inverted
repeats (TIRs), which serve as the recognition sequence for the transposases. Some transposases
non-specifically bind to any target site in DNA, whereas others bind to specific DNA sequence
targets. The transposase cuts at the target site resulting in single-stranded 5' or 3' DNA extends,
called sticky ends. The DNA transposon that has been cut by the transposase is then pasted into a
new target site. The insertion sites may be identified by short direct repeats followed by a series
of TIRs important for the TE excision by transposase. After the insertion, the activity of a DNA
polymerase and a DNA ligase respectively fills in gaps and closes the sugar-phosphate backbone.
One question that may rise is that if DNA transposons transpose by cut and paste mechanism,
how do they accumulate over time? They move through a non-replicative mechanism. Relying on
the host machinery, DNA transposons increase their copy numbers through indirect mechanisms
[Feschotte & Pritham 2007]. The first mechanism is through the DNA replication process, during
which the transposon moves from a newly replicated chromatid to an unreplicated site. Therefore,
the transposon is replicated twice, which means a net gain of one copy. The second mechanism
results from the repair of double stranded DNA breaks. DNA breaks can be mutagenic to a cell
and are repaired by several ways including homologous recombination, a process by which the
cell copies the missing DNA sequence from the homologous chromosome. TEs, among other
sequences, found on the homologous chromosome will be copied to the damaged chromosome
resulting in the possibility of introducing TE copies that were not originally present.
15 Figure 1. 3: Components of the human genome
[Gregory 2005]
Although here our focus is on the TEs identified in the human genome, it is worth mentioning
that other than the cut-and-paste transposons there are two other classes of DNA transposons,
Helitrons and Polintons. Helitrons transpose via replicative rolling-circle transposition
[Kapitonov & Jurka 2001] and are present in the genome of plants, fungi, insects, nematodes and
vertebrates. Like cut-and-paste transposons, Helitrons cannot synthesize their own DNA and
instead they duplicate using the host replication machinery. On the other hand, Politrons are selfsynthesizing [Kapitonov & Jurka 2006] transposons that are 15-20 kb long and are identified in
protists, fungi and animals.
The summary of human genome composition, which we have discussed in this section, is
captured in Figure 1.3.
1.1.3 Contribution of Transposable Elements in Shaping the Human Genome and
Association to Diseases
In this section, we highlight some of the significant roles of TEs in mutating protein coding
regions, rewiring regulatory networks, cancers and genetic diseases.
16 1.1.3.1 Transposable Elements Role in Protein-Coding Regions
Highly mutagenic active TEs frequently insert into protein-coding genes and therefore are found
in a large number of human protein-coding genes. Studies show that approximately 4% of human
genes contain TEs or TE fragments within their coding regions [Nekrutenko & Li 2001].
Consequently, they cause chromosome breakage, illegitimate recombination, and genome
rearrangement. In addition, TE insertions influence gene splicing patterns by alternative splicing.
Table 1.1 provides some examples of TE insertions in the human genes and their effects on the
coding regions.
There are two ways by which TEs could have integrated into coding regions. The first way is the
one in which a TE is inserted into a protein coding exon. The most common path however, is the
one in which a TE is inserted into a noncoding intron region and subsequently recruited as a new
exon. About 90% of TE insertions are into introns and this high rate is possible because many
TEs carry potential splice sites [Nekrutenko & Li 2001]. Thus, TE insertion might be an
important cause for the high frequency of alternative splicing in human protein-coding genes. For
example, the fact that ~1.4 million Alu elements are interspersed throughout the human genome,
each of which carrying several potential splicing sites, provides numerous possibilities for
formation of alternate transcripts [Brosius & Gould 1992].
Orthologous genes can encode functionally different proteins or differ in terms of expression. For
example, an Alu insertion occurs 1 in every 200 human births [Deininger & Batzer 1999], which
means the chimpanzee genome does not contain many of the Alu elements found in the human
genome. Since Alus are not discovered in non-primates, they might have a significant
contribution to the divergence between primates and other mammals. Overall there are about 400
human genes containing Alu inserts in the coding regions that are not found outside of the
primate lineage [Nekrutenko & Li 2001].
17 Gene
TE Type Effect on Coding Region
Human hematopoietic progenitor kinase (HPK1)
Alu
Extension
Plakophilin 2a and b
Alu
Extension
Adenosine deaminase
Alu
Extension
Proteasome subunit p27
L1
Changing stop codon
Hepatocyte nuclear factor-3/fork head protein
L1
Extension
Methyl-CpG binding protein
L2
Changing stop codon
Down syndrome critical region gene 5
L2
Changing start codon
8-oxo-dGTPase
LTR
Changing start codon
LTR50 LTR
Changing stop codon
Myelin-associated
oligodendrocytic
basic protein
Table 1. 1: Coding regions modified by TE insertions in human
[Nekrutenko & Li 2001]
1.1.3.2 Transposable Elements’ Roles in Regulatory Networks
TEs have been a rich source of material for the assembly and tinkering of regulatory systems and
have had a key role in the evolution of human gene regulation. They can influence neighboring
genes by functioning as enhancers or promoters. There are many ways by which TEs can directly
influence the expression of a nearby gene, both at the transcriptional and post-transcriptional
levels. To give some examples, studies [Feschotte 2008] reveal that at least 16 % of eutherianspecific conserved non-coding elements were derived from TEs. A study [Jordan et al. 2003]
reports that nearly 25% of experimentally characterize human promoters contains TE fragments,
including cis-regulatory elements. Another study [Ramirez et al 2006] shows that one quarter of
the DNAse I hypersensitive sites identification in human T cells overlap with annotated TEs.
18 There are three main reasons for TEs being a source of regulatory elements [Feschotte 2008].
First, they tend to cluster around genes involved in development and transcriptional regulation.
Second, they are over represented in genomic segments containing transcription factor binding
sites (TFBSs). Finally, there are a growing number of highly conserved TEs that act as
transcriptional enhancers.
TEs wire genetic networks. The TE families scattering throughout the genome allows the same
motifs to be engaged at many chromosomal locations and therefore, results in bringing multiple
genes into the same regulatory networks. For instance, a human genome study [Wang et al. 2007]
suggests that a set of closely related families of LTRs have dispersed more than 1500 binding
sites for the master regulatory factor p53. These sites encompass 30% of all p53 binding sites that
have been mapped.
TEs also show far more overlap than expected with non-coding RNAs such as microRNA
(miRNA) that are important players in regulation of gene expression. This suggests that certain
TE families such as Miniature Inverted Repeat Transposable Element (MITEs) possess
characteristics that make them prone to give rise to miRNA, for instance [Feschotte et al. 2002].
1.1.3.3 Transposable Elements’ Role in Cancer and Genetic Diseases
The mobility and sheer number of both retrotransposons and DNA transposons allows them to
shape our genotype and phenotype both on evolutionary scale and on individual level. The fact
that TEs cause variation, within or in between individuals, is now evident with the everincreasing number of genome-scale studies. These studies have expanded the pool of human
disorders resulting from TE insertion [Chen et al. 2005]. The non-LTR retrotransposons, Alu and
L1 in particular, can undoubtedly cause diseases through insertional mutagenesis, recombination
and structural variation, providing enzymatic activities for other mobile DNA, transcriptional
over activation and epigenetic effects. Since DNA transposons are considered immobile in the
human genome, no human disease is known to arise as a result of their activities.
19 It is now likely that some types of cancer, neurological disorders and genetic diseases arise as a
result of retrotransposons mutagenesis [Mills et al. 2007]. There have been 96 known
retrotransposons inserts in disease cases, out of which 25 are caused by L1s, 60 are attributable to
Alus and 7 to SVAs [Hancks & Kazazian 2012]. Overall, retrotransposons insertions accounts for
about 1 in 250 (0.4%) of disease-causing mutations [Wimmer et al. 2011]. Alus predominantly
cause diseases by homologous recombination between two Alu sequences, although insertion into
or near exons and Alu splicing from introns are also possible. Hemophilia B [Vidaud et al.
1993], Huntington disease [Hutchinson et al. 1993], and neurofibromatosis [Wallace et al. 1991]
are among the famous diseases associated to Alu insertions. The predominant mechanism by
which L1 causes diseases is insertional mutagenesis into or nearby genes. Highly active L1
elements account for most disease-causing insertions [Brouha et al. 2003].
Studies show that retrotransposon, in particular L1 insertions, were over-represented in proteincoding genes. If for example, a L1 element inserts into a gene that functions in neurological
development, it might lead to neurological diseases such as Rett syndrome [Tomas et al. 2012].
L1 insertions have also been found in brain and lung tumors [Iskow et al. 2010] and are attractive
candidates for both somatic drivers and hereditary factors in germ cell tumors and in other cancer
types. Genetic instability [Symer 2002], which is a hallmark of cancer [Negrini et al. 2010], is
associated with L1 elements. Thus, over-activation of L1s could have the potential contribution
to tumorigenesis. In addition to cancer, L1 has been associated to well-know genetic diseases
such as hemophilia A [Kazazian et al. 1988], beta thalassemia [Divoky et al. 1996], and muscular
dystrophy [Kondo-Iida et al. 1999]. Also, several novel L1 insertions on the X chromosome were
discovered in males with presumptively X-linked disorders [Huang et al. 1993].
2
Transposable Element Discovery Methods
The following section categorizes and summarizes the TE discovery methods, including de-novo,
homology-based, structural-based, comparative and integrated approaches, in the field of
computational genomics.
20 2.1 De Novo Methods
De novo TE discovery approaches reviewed in Bergman & Quesneville 2007, look for repetition
of mobile DNA at multiple positions within a genome without using any prior knowledge about
the structure of TEs or similarities to known sequences. These methods aim to discover
consensus sequences of TEs family from similar sequences. Once identified, the sequences are
typically clustered, filtered, and characterized. However, these methods also tend to find repeats
such as tandem repeats or segmental duplication because de novo methods are not specific to
TEs. In additions, they are not able to detect TE families with low copy number or nonoverlapping fragments. What increases computational complexity is the biological complexity of
TEs, including the fragmented nature of the TE instances and sequence similarities of related TE
families. Discovering several co-linear repeats instead of a single repeat, aggregation of
discovered nested TEs into a large family, and detecting families with over-split or multiple
distinct sub families are among the major problems for de novo TE discovery methods.
Figure 1. 4: Workflow of the 4-step de novo TE detection pipeline
[Flutre et al. 2011]
21 Although de novo techniques typically struggle with identifying degraded fragments, they are the
most effective, albeit computationally expensive, approaches identify novel TEs. P-CLOUDS
[Gu et al. 2008], RECON [Bao & Eddy 2002], and RepeatModeler [Smit & Hubler 2008] are
among best-known de novo tools. Classical computational strategies, like suffix trees or pairwise
similarity searches, were initially used for repeat detection by tools such as RepeatFinder
[Volfovsky et al. 2001]. Nonetheless, software such as P-CLOUDS has been designed to rapidly
find repeats in genome sequences by counting highly frequent words of a given length k, called
k-mers. Other methods such as ReAs [Li et al. 2005] also count frequent k-mers but try to define
consensus sequences. For each frequent k-mer, a multiple alignment of all the k-mers is built and
extended iteratively. These programs are very useful for quickly providing a view of the repeated
fraction in a given set of genomic sequences. However, they do not provide much detail about the
TEs present in these sequences. Their output only identifies highly repeated regions without
indicating precise TE fragment boundaries or TE family assignments.
Repeats can be identified by self-alignment of genomic sequences, starting with an all-by-all
alignment of the assembled sequences. BLAST [Altschul et al. 1990] and BLASTER
[Quesneville et al. 2003] heuristic algorithms are among the tools used for the alignment. Note
that the aim of this step is not to recover all TE copies of a family but to use those that are well
conserved to build a robust consensus. Stringent alignment parameters are crucial for successful
reconstruction of a valid consensus. Tools like RECON cluster the obtained matches
corresponding to repeats, into groups of similar sequences. The aim is for each cluster to
correspond to copies of a single TE family. Once clusters are defined, applying a filter eliminates
the vast majority of segmental duplications. Finally, what remain are only the clusters with at
least three members from each of which a multiple alignment is built and a consensus sequence is
derived. These 4 steps of the de novo TE detection pipeline are summarized in Figure 1.4.
2.2 Homology-Based Methods
Homology-based approaches reviewed in Bergman & Quesneville 2007 are the most commonly
used methods to detect TE families based on the homology to known TE protein-coding
sequences or to DNA consensus sequences. In fact, these approaches utilize known TEs as a
means to discover new copies of TEs in genomes. They employ fast seed-based heuristic
22 alignment algorithms such as BLAST and FASTA [Pearson 2000] with known TEs used as
queries, followed by post-processing including merging and/or extending of individual genomic
hits.
The main advantage over de novo approach is that homology based methods are more likely to
find bona fine TEs as they are based on the knowledge of the know TE sequences. Although
there exist only a few homology-based tools, they are normally the most accurate in identifying
known TEs as well as detecting degraded TEs. However, they are unable to identify TEs
unrelated to known elements.
From this category of approaches, RepeatMasker [Smit, Hubler & Green 1996], a popular tool to
identify, classify, and mask repetitive elements, including low-complexity sequences and
interspersed repeats, is widely used in computational genomics. RepeatMasker searches for
repetitive sequence by aligning the input genome sequence against a library of known repeats,
such as repBase [Jurka et al. 2005]. Sequence comparisons in RepeatMasker are performed by
one of several popular search engines including nhmmer [Wheeler & Eddy, 2013], cross_match
[Smit, Hubler & Green 1996], ABBlast/WUBlast, RMBlast [Altschul et al. 1990] and Decypher.
It makes use of curated libraries of repeats and currently supports Dfam [Travis et al. 2013]
(profile HMM library derived from Repbase sequences) and Repbase, a service of the Genetic
Information Research Institute which is the most commonly used database of repetitive DNA
elements.
In order to detect common TE protein domains, some homology-based approaches utilize hidden
Markov models (HMMs) to scan predicted ORFs from the PFAM database [Bateman et al. 2002]
as an alternative approach to fast heuristic alignment algorithms [Berezikov et al. 2007; Rho et al.
2007]. Although these approaches are effective for genomes that are closely related to those
genomes used to build the database, they have difficulties with distantly related species. This is
due to the fact that HMMs tend to capture more irrelevant data when searching for diverse
sequences. Overall, obtaining a full-length reference sequence by homology-based TE detection
method requires further analysis of structural features of TE families.
23 Figure 1. 5: Workflow of 4-steps homology based TE annotation pipeline
[Quesneville et al. 2005]
Mining the library of TE sequences obtained by the de novo TE detection pipeline, tools like
Repeat Masker, BLASTER or CENSOR [Kohany et al. 2006; Jurka et al. 1996] use pairwise
alignment algorithms to detect TE fragments. Note that if these tools are used in conjugation,
then the MATCHER program assesses the results and keep the best for each location. Short
simple repeats (SSR), short motifs repeated in tandem, are not only present in TEs but also
independently in the genome. Therefore, if TE matches are restricted to SSR that the TE
consensus may contain, then it is necessary to filter them in the second step. The last two steps
are discarding false-positive matches and finally connecting the distant fragments by applying the
long join procedure [Flutre et al 2011], before the final annotation is reported. These 4 steps of
structural based TE annotation pipeline are summarized in Figure 1.5.
2.3 Structure-Based Methods
The prior knowledge of common structural features of TEs such as LTRs, target site duplications
(TSDs), primer-binding sites (PBSs), polypurine tracts (PPTs) and ORFs for the gag, pol
(containing the RT domain) and/or env genes, is being taken advantage of by this class of
24 approaches [Bergman & Quesneville 2007]. Unlike homology-based methods, structure-based
methods are less dependent on similarity to the known TEs families and instead they rely on
detecting specific models of TE architecture. In structure-based approaches, specific models must
be developed for each TE family. The strongly structured TE families such as LTRs are easier to
detect using these methods. However, they are generally less useful when searching for degraded
TEs or for TEs without a conserved structural characteristic.
LTR_STRUC [McCarthy & McDonald 2003] for example, is a structure-based tool that works
well to identify complete TEs that comply with a conserved LTR structure at each end of the
element. Using a heuristic seed-and-extend algorithm, this tool aims to find and align local
repeats located within a user-specified distance that are used as an initial set of candidate LTRs.
The boundaries of the LTRs on the original sequence are determined by the pairwise alignment of
putative LTRs. A critical limitation of LTR_STRUC is that only TEs within the same contig can
be detected.
Another unique structural-based tool is TE-HMM [Andrieu et al. 2004]. This method takes
advantage of the fact that the nucleotide composition of TE ORFs and that of the host genes are
often different. The core of this approach is building HMMs with three states representing coding
regions and at least one state for noncoding regions. The three states are allocated one for each
codon position and the one or more states for noncoding regions allowing for the frame shift
mutations that are common in decaying TEs. TE-HMM is able to accurately identify the coding
regions in known TEs and differentiate TEs from genes in a given datasets. This is assuming that
separate HMMs for RNA-based TEs, DNA-based TEs and host genes have already been trained.
Therefore, as with all HMMs, TE-HMM is dependent on good training data sets from the same
species group. However, other structural features of unidentified TEs such as UTRs and LTRs
cannot be discovered using this approach since TE-HMM attempts to only predict coding
sequences.
2.4 Comparative Genomic Methods
Comparative genomic discovery methods reviewed in Bergman & Quesneville 2007 rely neither
on homology nor structural features. They are based on the fact that transpositions create large
25 insertions and variations between pairs of closely related genomes that can be detected in
multiple sequence alignment. Such differences are analyzed and classified by comparative
genomic approach. This method searches for insertion regions (IRs) where multiple alignments
of orthologous genome sequences are disrupted by a large (>200 bp) insertion in one or more
species [Caspi & Pachter 2006].
The effectiveness of this approach is dependent on the quality of the whole genome alignments
and will be useful when related genomes are available in order to identify new families of TEs.
The limitation of this approach is its inability to detect common ancestral TEs. It would also
perform poorly in TE-rich regions, as there might exist nested TE insertions that complicate the
detection of distinct TE families.
2.5 Integrated Methods
Due to the nature of the TEs, there might not ever be a single best approach of detection; thus,
employing the existing methods in conjugation could be very realistic and effective. The REPET
package [Flutre et al, 2011] integrates bioinformatics programs to detect, annotate and analyze
TEs. The two main pipelines in use are called TEdenovo and TEannot for the homology-based
TE annotation.
The TEdenovo pipeline compares the genome with itself using BLASTER and clusters matches
with GROUPER [Quesneville et al. 2003], RECON [Bao & Eddy 2002] and PILER [Edgar &
Myers 2005] clustering programs. It then builds a multiple alignment for every cluster to derive a
consensus sequence. After filtering for redundancy, these consensus sequences are classified to
finally obtain a library of classified, non-redundant consensus sequences.
The TEannot pipeline searches the library of consensus sequences built by TEdenovo, using
BLASTER, RepeatMasker and CENSOR. After removing the false positives by an empirical
statistical filter, MATCHER [Quesneville et al. 2003, 2005] groups the TE fragments in to
families and output the annotations.
26 3
Ancestral Mammalian Genome Inference
In this section, we explain why and how the ancestral mammalian genome, which is used as an
input to the method we have developed, is reconstructed
The possibility of computationally interfering ancestral genome, given a known phylogenetic
association, is among the interesting prospect of having a large number of extant genome
available. The scientific community has inferred ancestral gene orders and the history of
rearrangement leading to a give set of extant genomes [Bourque & Pevzner 2002; Mu 2011].
However, there are challenges associated to this level, namely the computational complexity,
limited accounting for all possible evolutionary events and orthologs identification.
We have focused in this study, as it is conventional in this field of research, on DNA sequence
evolution at the level of substitutions, insertions and deletions. The aim is to infer a set of most
likely ancestral sequences based on a known set of extant, collinear, non-rearranged orthologous
sequences. Correctly aligning the highly divergent sequences and inferring the computationally
complex maximum likelihood indels (insertion/deletion) are among the difficulties of this
inference problem. Nonetheless, significant investments in whole-genome multiple sequence
alignment [Blanchette et al. 2004; Paten et al. 2009] has resulted in a set of heuristic algorithms
developed to infer ancestral DNA sequences [Diallo et al. 2010, Paten 2008; Westesson et al.
2012]. These algorithms have made it possible to infer large sections of syntenic regions with
good accuracy.
The first step toward reconstructing an ancestral mammalian genomic sequence is to build an
accurate whole-genome multiple sequence alignment of the complete genomes of mammals
[Miller et al. 2007]. To this end, the sequences are repeat masked using RepeatMasker [Smith
and Green 1999] and then aligned using the Multiz multiple-alignment program [Blanchette et al.
2004]. It is assumed that two bases are aligned if and only if they derive from a common
ancestral base. The alignment then gets divided into several syntenic alignment blocks. Within
every block, ancestral sequences at each internal node of the mammalian phylogenetic tree
[Miller et al. 2007] are inferred. It is worth mentioning that rearrangements, duplications, or large
insertions are not expected within each block. These sequences are inferred using the Ancestors
27 1.1 [Diallo et al. 2010], which is a program that uses an evolutionary model, involving contextdependent substitutions as well as indels to infer the maximum likelihood ancestral sequences.
Finally, heuristics are used to reduce errors due to incorrect alignment.
For several reasons, the eutherian mammals phylum is a particularly interesting target for
ancestral genome inference. The fact that it includes the human genome is one of the key
motivations, as the study of ancestral genomes may shed some light on the function of various
parts of our own genome. We also argue that a good target species for a genomic reconstruction
is one that has generated a large number of independent, successful descendant linage through a
rapid series of ancestral speciation [Murphy et al. 2001]. Therefore, due to the mammalian
radiation, certain early-eutherian mammal genomes can be inferred to a surprisingly high degree
of accuracy.
The boreoeutherian ancestor is the ancestor of all eutherian mammals except Afrotherians
(e.g.elephants) and Xenarthans (e.g. sloths and armadillos). Using the reconstruction pipeline
mentioned above, most of the euchromatic genome of the boreoeutherian ancestor can be inferred
from the extant genome of each main lineages with 98-99% base-by-base accuracy [Blanchette et
al. 2004]. In the next chapter, we will discuss how the inferred genome of the boreoeutherian
ancestor can be used to identify ancient TEs in the human genome.
28 Chapter 2: Annotation of Ancient
Transposable Elements in the Human
Genome
1
Introduction
After their insertion in the genome, most TE copies get relatively quickly degraded, making the
recognition of old insertion events challenging. Although these ancient TEs may no longer be
recognizable in the human genome, inferred mammalian ancestral genomes may contain younger
version of them which can be detected by existing TE annotation tools such as RepeatMasker
[Smit, Hubley, & Green 1996].
To elaborate more, Figure 2.1 represents the aging of TE copies (symbolized by squares) in a
genome (symbolized by lines) over time. Assume the red square on the first line is a TE fragment
belonging to family A, inserted into our genome 100 million years ago. In 20 million years, it
replicates itself and makes an identical copy in another position in the genome. Then 20 million
years later, it again replicates itself and makes a third copy but as the genome evolve, the second
copy along with other parts of the genome gets mutated (pink). After another 20 million years,
another TE fragment belonging to a different family, family B (dark blue) is inserted into the
genome. The family B copy starts to transpose while the TE fragments belonging to family A
have been mutated and lost their transposition ability. Over time, TE copies belonging to both
families mutate and become more diverged from the corresponding original copies (they become
lighter an lighter in colors). Therefore, on the last line of the Figure 2.1, which represents our
genome today, we see TE fragments of different colors belonging to families A and B. All the
copies from family B, which have expanded more recently than family A, are recognizable,
although some have become lighter. However, TE copies from family A have lost their colors to
a point when one of them is no longer recognizable as a copy belonging to this family.
29 Figure 2. 1: Transposable Elements degrading over time
Lines represent a genome evolving over time.
Red squares represents family A and blue squares represents family B of TEs.
Nevertheless, if we had the genome from 50 million years ago, we would have been able to
identify all the copies belonging to family A that are still present in our genome today. Thus,
having the inferred ancestral genome of the boreoeutherian ancestor may help us in classifying
the ancient transposable element insertions in the human genome.
In this chapter, we developed an approach to identify ancient TE copies, complete the annotation
of TEs in the human genome, classify new families of TEs and identify their associations with
functional elements in the human genome. We propose the use of ancestral genomes to improve
the detection of ancient TE copies and discover new TE families. We expand on our pipeline in
the methods section and report our results using this approach in the results and discussion
sections.
2
Methods
In this section, we explain the pipeline summarized in Figure 2.2, which we developed to
complete the annotation of TEs in the human genome. We first give an outline of the method and
then provide details on each step.
30 Figure 2. 2: Ancient Transposable Element Annotation Pipeline
Given a phylogenetic tree of mammals, a mammalian ancestral genome is reconstructed. Running
RepeatMasker on the ancestral genome, we obtain TE copies whose descendants have possibly
not been identified in the human genome. These are the TEs not recognizable by RepeatMasker
as TE copies in the human genome. Once we identified these ancient TE copies, we map them to
the human genome using LiftOver [Kuhn et al. 2012], which is a tool to convert genome
positions between different genome assemblies based on an alignment (Figure 2.3). Then, we
calculate the divergence percentage of these mapped TEs in humans from the consensus
sequences.
31 Figure 2. 3: Mapping TEs from the boreoeutherian ancestor genome to human genome
Red squares represents TEs identified in the boreoeutherian ancestor genome, blue squares
represents TEs currently identified in the human genome, purple squares represents the
overlapping regions in the revised human TE annotation.
We also run RepeatMasker on the human genome and mask the identifiable TE copies that in fact
have already been recorded as TEs in the current annotation of the human genome. The
overlapping regions (shown as purple squares in Figure 2.3) are identified using FeatureBits
[Kuhn et al. 2012], a bioinformatics tool to report the intersection of two files of genomic
annotations, before the newly annotated ancient TE copies in the human genome are reported.
This pipeline is explained in details in the following sub-sections.
2.1 Ancestral Genome Reconstruction
Previous works [Blanchette et al. 2004] have shown that ancestral mammalian genomes can be
probabilistically reconstructed with up to 99% accuracy, given a phylogenetic tree We used the
multiz whole-genome multiple-alignment of actual genomic sequences from 36 mammals,
obtained from UCSC genome browser [Miller et al. 2007], and applied the Ancestor 1.1 pipeline
[Diallo et al. 2010], to reconstruct ~1.8 giga bases (Gb) of ancient genome sequence from the
boreoeutherian ancestor. The Ancestor program output file contains the alignment of the 36
mammals, the inferred boreoeutherian ancestor sequence, and the confidence score attached to
each base, all subdivided in to blocks.
32 Figure 2. 4: Vertebrate Phylogenetic Tree
The boreoeutherian ancestor is noted with a red box.
Adapted from [Miller et al. 2007]
33 As indicated on the phylogenetic tree in Figure 2.4, the boreoeutherian ancestor is a species that
used to live approximately 75 million years ago. The reason for this choice is that the accuracy of
the reconstruction method depends crucially on the length of early branches of the phylogenetic
tree. For example, the star tree with no internal shared node is the most favorable tree for
reconstruction. Thus, ancestral sequences at the center of a rapid radiation can be reconstructed
more accurately than those of the more recent ancestors. The boreoeutherian ancestor meets all
the criteria and is the most accurate ancestral sequence that is obtained by the method mentioned
above.
2.2 Identification of Transposable Elements in Boreoeutherian Ancestor
RepeatMasker is a popular program in computational genomics that screens DNA sequences to
identify and classify interspersed repeats and low complexity DNA sequences. To do so, it aligns
the query sequence against a library of known repeats, such as repBase (the most commonly used
database of repetitive DNA elements) [Jurka et al. 2005]. The output of the program is a detailed
annotation of the repeats identified in the query sequence as well as a modified version of the
query sequence in which all the annotated repeats have been masked.
TE families, in contrast to multigene families, are usually defined based on their active ancestor
and generation mechanism, although over time, individual elements may acquire diverse
biological roles. Existing tools such as RepeatMasker however, struggle to identify TEs in that
are diverged more than 40 % from the original consensus sequence. Most of these TEs have been
inserted more than 70 Million years ago and have diverged to the point that they are no longer
recognizable as TEs in the human genome. Nonetheless, the more diverged TEs look younger
(less diverged from the original consensus sequence inserted) in the ancestral sequences.
Therefore, in principle, we should be able to identify these old insertions in an ancestral sequence
using RepeatMasker and map them to the human genome.
34 Having
the
reconstructed
genome
of
the
boreoeutherian
ancestor
(available
at: http://cs.mcgill.ca/~aahiat1/Ancestor/), which was explained in section 3 of chapter 1, the first
step in our pipeline is to identify TE copies present in that genome. To this end, we run
RepeatMasker (command line option) on the boreoeutherian ancestor sequence FASTA file
[Pearson 2000]. We set the species of the input sequence (-species parameter) to mammal and the
search engine (-e parameter) to hmmer. The default RepeatMasker version uses UW-BLAST
search engine, which in fact, was initially used for our annotations. However, the first version of
a transposable element profile HMM database [Eddy 1998] was released in 2013 which notably
improved the characterization of TE sequences. The use of profile HMMs not only improves the
sensitivity over single sequence search, it also represents the additional information content in
position-specific nucleotide distributions and indel variability. Only recently, genome scale
searches of profile HMMs have become feasible; therefore, the new version of RepeatMasker
uses Dfam (profile HMM library derived from Repbase sequences) [Travis et al. 2013] and
nhmmer [Wheeler & Eddy, 2013] (see: http://www.repeatmasker.org/).
RepeatMasker produces four types of output files, .mask, .tbl, .cat, and .out that essentially
contain the TE annotation information (available at: http://cs.mcgill.ca/~aahiat1/RepeatMasker/).
The .mask file contains the masked sequence annotation, which is the same as the query sequence
except that the repetitive elements are masked using N or X letters. The .tbl file states the
percentage of genome coverage of each annotated TE family. The .cat file contains details on the
alignment between the identified TE sequences and their consensus sequences in the repeat
library and profile HMM library. It also includes information such as the position of the
identified TEs in the query sequence, the divergence from the consensus sequence, and the
classified TE family. The .out file is a self-explanatory annotation file containing the summary of
all the information on the .cat file except the alignment details.
After running RepeatMasker on the boreoeutherian ancestor genome, we extract the chromosome
number, start and end positions, divergence percentage, and annotated family from the .cat output
file. We store them in a Browser Extensible Data (BED) file format to be used as an input by
another program called UCSC Batch Coordinate Conversion tool (LiftOver) that converts
genome position from one genome assembly to another genome assembly, which in our case is
35 from the boreoeutherian ancestor to human. Running LiftOver on the boreoeutherian ancestor
BED file, we get the human genome coordinates (UCSC hg19 assembly) corresponding to the TE
sequences identified in the boreoeutherian ancestor genome. Here we use LiftOver command line
option, where the chain file that holds the instruction for the conversion is set to
boreoToHg19.chain (available at: http://cs.mcgill.ca/~aahiat1/LiftOver/). For more information
on LiftOver tool, see: https://genome.ucsc.edu/cgi-bin/hgLiftOver.
2.3 Divergence Calculation
After following the pipeline mentioned above, we obtained all the TE positions in the human
genome that had been present in the boreoeutherian ancestor genome. However, these TE copies,
which have been reported by RepeatMasker and then mapped to the human genome, are not
annotated with the correct percentage of divergence from the consensus sequences in the library
of TEs. This is because RepeatMasker calculates the divergences between the boreoeutherian
ancestor and the consensus sequences. LiftOver only lifts the identified regions to the human
genome and dose not meant to change the divergences. Correcting this is important because the
divergence is an indication of the age of TE fragments. Therefore, our goal is to correct the
percentage of divergence for each TE copy in human from its consensus sequence. This is
achieved using a java program we have developed called TEMapper (available at:
http://cs.mcgill.ca/~aahiat1/TEMapper/).
Figure 2.5 corresponds to a sample snapshot of a .cat file outputted by RepeatMasker that
includes details of the alignment between the consensus sequences and the query sequence.
Figure 2.6 shows a sample LiftOver .mapped output file in which every row corresponds to one
TE copy lifted from one genome assembly to another genome assembly. Figure 2.7 demonstrates
an alignment block that belongs to a sample Ancestor program .maf output file.
36 Figure 2. 5: RepeatMasker .cat output file example
Each block corresponds to one TE copy masked. Highlights are the information extracted from
this file: divergence percentage, chromosome number, start and end positions, family, and the
alignment between the boreoeutherian ancestor and consensus sequences.
Figure 2. 6: LiftOver .mapped output file example
Each row contains the information of a TE fragment in the boreoeutherian ancestor genome
(column 4) , lifted to the human genome (column 1-3).
37 Figure 2. 7: Ancestor program .maf output file example
Highlights are the human and boreoeutherian ancestor sequences that are aligned in one
alignment block. Sequences of other species and details of the reconstruction are ignored.
The first step is to extract the alignment between the boreoeutherian ancestor and consensus
sequences reported by RepeatMasker in the .cat file, for each TE copy. Then we search the
reconstructed file given by Ancestor and extract the alignment between the human and
boreoeutherian ancestor corresponding to the same region. Having these two alignments, we
should be able to align the human and consensus sequences through the boreoeutherian ancestor
38 sequences and produce a third alignment from which we can calculate the correct divergence. In
principle, the divergence between the consensus sequence and human must be less than or equal
to the divergence between the consensus and boreoeutherian ancestor plus the divergence
between the boreoeutherian ancestor and human:
Div % (H,C) <= Div % (C,B) + Div % (B,H)
Our alignment algorithm works as follow:
Consider:
Consensus Sequence:
𝑐 = 𝑐! … 𝑐!
Boreoeutherian Ancestor Sequence:
𝑏 = 𝑏! … 𝑏!
Human Sequence:
ℎ = ℎ! … ℎ!
•
If 𝑐! is aligned with 𝑏! and 𝑏! is aligned with ℎ! then 𝑐! is aligned with ℎ! .
•
If 𝑐! is aligned with a gap in 𝑏, then 𝑐! is aligned with a gap in human.
•
If 𝑐! is aligned with 𝑏! and 𝑏! is aligned with a gap in human then 𝑐! is aligned with a gap
in human.
For example, having the two hypothetical alignments as follow:
Consensus:
… A G C T G G C T G T C A C – T C …
Boreoeutheian ancestor:
… A G C – – G C T G T T C C A T C …
Boreoeutherian ancestor:
… A G C – G – C T G T T C C A T C …
Human:
… A G C A A A C T G T G – C A T C …
The alignment between the consensus and human sequences would be:
Consensus:
… A G C T G – G – C T G T C A C – T C …
Human:
… A G C – – A A A C T G T G – C A T C …
39 We calculate the percentage of divergence between the consensus and the human sequences by
dividing the number of mismatched bases at each position by the alignment length. It is worth
mentioning that, there might be other types of mutation events such as duplication and
rearrangement within the identified sequences. However, since they are rare and difficult to
identify, we did not account for them in our model. Thus, our alignment algorithm only considers
substitutions and indels. Note that the four sequences from the first two alignments must have
the same length, including the gaps.
Div% (H,C) = (# mismatches including gaps / alignment length) * 100
Therefore, in our example, the divergence is about 44%.
Implementing the TEMapper, not considering the inevitable implementation difficulties and
errors, has encountered three main complications. The first complication originates from the
enormous size of the data. Approximately 1.8 Gb of the boreoeutherian ancestral sequence have
been reconstructed in which more than three million TE segments have been masked by
RepeatMasker and subsequently mapped to the human genome by LiftOver. Since every disk
access is time consuming, we used a hash-map data structure to speed up the access to these files.
The hash-map table uses the TE coordinates in the boreoeutherian ancestor to access the values
that are the TE human coordinates. The down side to this approach is the large amount of
memory required. However, this remained manageable and resulted in significant speed-ups. In
addition, our intention is to extract the alignment between the boreoeutherian ancestor and human
sequences form the Ancestor program output file. However, the mapped TE segments are
dispersed through this massive alignment. It is highly time consuming to navigate the whole
alignment multiple times. Therefore, we partitioned the sequence alignment into short blocks and
save the position of the first nucleotide at each block. Once we read a TE coordinates from the
LiftOver hash-map table, we only search around the closest alignment block. This approach
drastically reduces our search space.
The second problem is that despite the fact that the two boreoeutherian ancestor sequences, one
obtained from the Ancestor output and the other one from the RepeatMasker output, are supposed
40 to be identical, sometimes they do not align perfectly. This means that there are cases in which
the two extracted boreoeutherian ancestor sequences (corresponding to a single TE fragment) are
shifted by a few bases at one or both ends of one of the sequence. We believe that the issue is
caused by a small bug in the LiftOver program, which converts the genome positions between the
two genome sequences. We realized this issue when our results did show a significant increase in
the divergence and the results did not match our test cases. Trimming the ends would not easily
solve the problem because the shifting has disturbed the whole alignment. Therefore, we had to
observe many examples of each case, identify the shifting pattern and write a program to trim
both sequences accordingly. Since there has not been a fixed pattern for the shifting, we wrote a
program to test all possible shifts (taking the longest common sequence) and report the one that
reconciles the two sequences.
The third problem arose when we faced missing alignment blocks between the boreoeutherian
ancestor and human. Those are due to the insertions of DNA fragments (either of TE origin or
not) within an ancestral TE but occurring in the human lineage after the boreoeutherian ancestor.
For instance, an Alu could have been inserted inside an ancient L1 fragment, creating a nested TE
(Figure 2.8). If we were to consider those missing blocks as gaps, then the calculated divergence
would have been incorrectly increased. By eliminating those cases, however, we would have lost
many identified ancient TEs. In addition, considering the inserted fragments as part of the ancient
TE would have been counting the overlapping regions more than once and categorizing them
under different TE families. Thus, in our implementation, we recognize those cases, extract the
missing blocks and consider each of the remaining blocks as a separate TE fragment with the
same divergence and family.
Figure 2. 8: Hypothetical example of nested TEs inserted into the genome
L1 (ATT) is represented by purple square and Alu (CG) represents by blue square
41 Realizing the last two issues mentioned above consumed considerable amount of time since
considering the size of the data, recognizing the errors was not evident in the primary stages.
Therefore, we were only able to trace the problems in the final steps and by visualizing the results
in the UCSC genome browser.
After applying the correction algorithm to the set of TEs identified in the boreoeutherian ancestor
and mapped to the human, we obtain a BED file containing TEs coordinates in the human
genome, the corrected percentage of divergence, and the family they belong to. The results,
however, does not include those TE copies that have been inserted into the human genome after
the boreoeutherian ancestor, such as Alus. We thus run RepeatMasker on the human genome
(UCSC hg19 assembly) and from the .out output file produce a second BED file with the same
format as the first one, containing all TEs identified in the human genome.
Obviously, there are several overlapping segments corresponding to those TEs recognizable by
RepeatMasker in both Boreoeutherian ancestor and human genomes. In order to eliminate the
redundancy and obtain the intersection of our two BED files, we run FeatureBits, which is
another UCSC tool to report the intersection of two given BED files. The result is a single BED
file containing the human genome coordinates, divergence percentage, and family of those
transposable elements identified only in the human genome, in both the boreoeutherian ancestor
and human genomes, and only in the boreoeutherian ancestor genome (BED files are available at:
http://cs.mcgill.ca/~aahiat1/FeatureBits/). In the following section, we discuss the results
achieved by applying this pipeline.
3
Results and Discussion
In the following section, we report the results obtained by applying the pipeline in the Figure 6
that is explained in the previous section. The results include the divergence profile of TEs in the
human and boreoeutherian ancestor genomes, the revised annotation of TEs identified in the
human genome and the share gained by each TE family, as well as the analysis on some novel TE
families obtained by de novo TE discovery techniques.
42 3.1 Human and Boreoeutherian Ancestor Divergence Profiles
Using RepeatMasker, we annotated the recognizable TEs in both the boreoeutherian ancestor and
human genomes. For each major TE family reported, we plotted the percentage of divergence
from the consensus versus the genome coverage where each point on a curve represent the
genome coverage of a TE family at a certain divergence level.
In the human genome (Figure 2.9), Alus and L1s respectively are the youngest and most
abundant elements identified since they are currently the only active ones. L2 and CR1 elements,
on the contrary, are older and therefore not as many copies as the other two are annotated. In the
boreoeutherian ancestor genome (Figure 2.10), however, LINE L2 and CR1 elements for
example, are detected in significantly larger fractions than in human and this is due to the fact
that these elements are much less decayed in the boreoeutherian ancestor sequence. Alus, on the
other hand, are mostly absent in the boreoeutherian ancestor as they are primate specific TEs. We
will discuss each of these cases in more details later. Note that the rate of TE insertion has
changed over time. Therefore, the bumps on some of the curves (most noticeable in the human
L1 curve) depict sudden increase in TE insertions at the certain times.
If we take a closer look at the L2 (Figure 2.11) and CR1 (Figure 2.12) subfamilies belonging to
the LINE family, we see that in the comparison plots, the blue curves which corresponds to the
coverage of TEs identified in the boreoeutherian ancestor are higher than those of the human
shown in red. This means that RepeatMasker has detected more copies from these families in the
boreoeutherian ancestor genome. These copies are younger as well as denoted by the shift of the
blue curve to the left. These changes are expected because the elements from the LINE L2 and
CR1 subfamilies have been inserted into the mammalian genomes long before the boreoeutherian
ancestor. Therefore, RepeatMasker was able to find larger number of copies with less divergence.
On the other hand, Alu elements belonging to the SINE family that are abundant in the human
genome are not expected in the boreoeutherian ancestor genome, because the Alu elements have
been inserted long after the boreoeutherian ancestor and are primate lineage specific. However,
due to the RepeatMasker detection errors and the fact that the boreoeutherian ancestor sequence
inferred by the Ancestor is not 100% accurate, a small number of Alu (denoted by the dark blue
curve in Figure 2.13) elements are detected in the boreoeutherian ancestor.
43 Figure 2. 9: Divergence profile of annotated major TEs in the human genome
Figure 2. 10: Divergence profile of annotated major TEs in the boreoeutherian ancestor
genome
44 Figure 2. 11: LINE/L2 elements divergence profile comparison
between the human and boreoeutherian ancestor genomes
Figure 2. 12: LINE/CR1 elements divergence profile comparison
between the human and boreoeutherian ancestor genomes
Figure 2. 13: SINE/Alu elements divergence profile comparison
between the human and boreoeutherian ancestor genomes
45 3.2 Mapping and Divergence Correction
By running RepeatMasker on the boreoeutherian ancestor, we obtained a .cat file (Figure 2.5)
from which we extract all the identified TE regions to be lifted by LiftOver to the human genome
(Figure 2.6). Our TEMapper program combines the results obtained from RepeatMasker,
LiftOver, and Ancestor (Figure 2.7) programs and produces a BED file containing the revised
annotation of human TEs. The details of the TEMapper program and the ancient TE identification
pipeline are explained fully in the methods section.
We calculated the genome coverage of every TE family at each divergence level for the TEs
identified in the boreoeutherian ancestor only, boreoeutherian ancestor and human, and human
only. This was done by running FeatureBits on the files containing the boreoeutherian ancestor
and human annotated TE regions to obtain the intersection, which corresponds to the TEs
existing in both boreoeutherian and human genomes. Subtracting the intersected regions form
both files, we separately obtained TEs existing in only the human and only the boreoeutherian
ancestor genomes.
For each major family of TEs, we plotted the three subsets of our data and here we report the
most interesting ones to discuss. Figure 2.14-2.16 illustrate the coverage gained by LINE/CR1,
LINE/L2, and SINE/MIR respectively in our revised TE annotations were the stacked area under
the blue curve corresponds to the human only, purple to the human and boreoeutherian ancestor
share and red to the boreoeutherian only portions. Each point on the curve represents the genome
coverage at a certain divergence level.
In the CR1 (Figure 2.14), L2 (Figure 2.15), and MIR (Figure 2.16) plots, the red curves, which
corresponding to the TEs found only through the boreoeutherian ancestor, are shifted to the right,
meaning that these TE copies are more diverged from their consensus sequences. In addition, the
peaks of these curves are higher than others, which show the genome coverage gained by each of
those families. These results are compatible with our expectation since the LINE family and
SINE/MIR are the oldest TE families known which has been inserted into mammalian genome
more than 70 million years ago, before the boreoeutherian ancestor. Therefore, we had expected
to identify more copies in the human genome with higher percentage of divergences.
46 Figure 2. 14: Coverage gained by revised annotation of LINE/CR1 elements
Figure 2. 15: Coverage gained by revised annotation of LINE/L2 elements
Figure 2. 16: Coverage gained by revised annotation of SINE/MIR elements
47 In Table 2.1, we report the break down of coverage gained by each TE family identified using
RepeatMasker. The coverage gained refers to the portion of the identified TEs in human which
was not originally annotated as TE by RepeatMasker when executed on human only but was
labeled as such after mapping to human TEs identified in boreoeutherian ancestor. To clarify,
considering a genomic position p and a specific TE family X, we categorized our identified TEs
as follow:
A specific genomic position p is labeled as belonging to TE family X of type “human only” if p
is labeled as:
A. family X in human but not identified as a TE in the boreoeutherian ancestor.
B. family X in Human but family Y in boreoeutherian ancestor (in order to account those
copies that are identified in boreoeutherian ancestor as well but categorized under a
different family).
A specific genomic position p is labeled as belonging to TE family X of type “boreoeutherian
ancestor and human” if p is labeled as:
C. family X in both human and Boreoeutherian ancestor.
D. family X in Boreoeutherian ancestor and family Y in human (in order to account those
copies that are already identified in human but categorized under a different family).
A specific genomic position p is labeled as belonging to TE family X of type “boreoeutherian
ancestor only” if p is labeled as:
E. family X in Boreoeutherian ancestor but not identified as a TE in human.
The coverage gained is calculated as follow:
F. Total coverage of family X in previous annotation of the human genome =A+B+C
G. Total coverage of family X in revised annotation of the human genome = A+C+D+E
Coverage gained from family X in revised annotation of the human genome = G-F
48 Family
A
B
C
D
E
F
G
Coverage
Coverage
Gained
Gained
(Mb)
(Mb)
(Mb)
(Mb)
(Mb)
(Mb)
(Mb)
(Mb)
(%)
L1
398.93
4.29
159.16
11.07
16.1
562.37
585.2
22.88
4.1
L2
26.34
0.68
99.9
4.78
34.09
126.91
156.1
38.20
30.1
CR1
2.08
0.08
11.69
0.58
5.10
13.9
19.50
5.60
40.47
Others
0.97
0.04
5.73
0.32
1.66
6.74
8.68
1.94
28.79
Alu
273.36
14.05
18.17
1.75
0.75
305.58
294.2
-11.55
-3.78
MIR
14.96
0.53
72.50
3.14
19.03
87.99
109.6
21.64
24.59
Others
0.31
0.01
1.34
0.03
1.17
1.65
2.84
1.19
72.05
ERVL-
67.97
2.00
56.14
2.41
4.62
126.11
131.1
5.02
3.98
ERV1
78.91
1.55
10.25
0.56
1.79
90.72
91.51
0.79
0.87
ERVL
32.66
0.84
34.07
1.64
2.70
67.58
71.08
3.50
5.18
ERVK
8.83
0.12
0.01
0.0
0.01
8.96
8.86
-0.1
-1.16
Others
5.6
0.13
10.0
0.4
3.65
15.66
19.63
3.97
25.34
DNA
TCMAR-
27.55
0.86
11.98
0.60
3.38
40.39
43.51
3.12
7.73
Transposon
TIGGER
14.02
0.50
40.22
2.69
5.71
54.74
62.64
7.90
14.43
2.43
0.10
8.20
0.51
2.43
10.72
13.55
2.85
26.58
0.93
0.04
2.87
0.15
0.34
3.8
4.29
0.45
11.64
2.05
0.12
1.12
0.07
0.31
3.29
3.55
0.26
7.96
Others
5.02
0.25
1.71
0.17
0.25
6.99
7.16
0.17
2.43
RNAs
1.54
0.26
0.21
0.05
0.12
2.01
1.92
-0.09
-4.51
Retroposon
9.92
0.55
0.01
0.40
0.01
10.48
10.51
0.03
0.29
Helitron
0.23
0.0
0.40
0.01
0.05
0.63
0.70
0.06
10.01
Total
~988
~27
~564
~32
~110
~1580
~1695
~115
~7.28
LINE
SINE
LTR
MALR
HATCHARLIE
HATTIP100
HATBLACKJACK
TCMARMARINER
Table 2. 1: Revised TE annotation by ancient transposable element annotation pipeline
Columns are explained in the text (page 48)
49 As we expected, the most substantial gain belongs to the elements of old LINE subfamily that has
reside in the mammalian genomes for more that 70 million years. Therefore, there have been
more ancient copies to identify from this family. CR1 and L2 elements in particular, have shown
a significant increase (approximately 40% and 30% respectively) in the genome coverage in the
revised annotation. The other family that accounts for a significant coverage gained is the SINE
family (excluding Alus), namely ancient MIR elements (~25% for MIR and 72% for other SINE
elements). On the contrary, elements belonging to SINE/Alu, LTR/ERVK, and RNA families
show a slightly negative changes in coverage, which means there are regions that were originally
labeled as belonging to some of these families but are now assigned to other families. For
example, some fragment in the human genome can be best aligned with the LTR/ERVK
consensus sequence; however, the less diverged version of the same fragment can in fact be
better aligned with the LTR/ERVL consensus sequence. Therefore, in the revised annotation, we
loose that coverage under the LTR/ERVK family and instead we annotate that region as
LTR/ERVL. Overall, according to our results (Table 2.1), ~115 Mb of genome coverage is
gained across all the TE families which corresponds to more that 7.28% of the human genome.
This result is significant since it adds approximately 3.5% to the total fraction of the human
genome derived from TEs.
3.3 De-novo Transposable Element Discovery
In order to investigate the boreoeutherian ancestral genome for the possible presence of unknown
TE families, we used the RepeatModeler [Smit & Hubler 2008] program to analyze the
boreoeutherian ancestor sequence. RepeatModeler is a de-novo repeat family identification and
modeling package that employs two de novo repeat finding programs, RECON [Bao & Eddy
2002] and RepeatScout [Price et al. 2005] for TE detection. Most of the consensus sequences
produced were classified as belonging to known TE families or gene families (olfactory receptor
and zinc figures) that were disregarded. Nonetheless, we discovered 31 novel TE families
summarized in Table 2.2 (available at: http://cs.mcgill.ca/~aahiat1/RepeatModeler/). Their size
varies from ~130 to 3800 bp, copy number varies from ~100 to 1580 and level of divergence
form the consensus varies from ~7 to 39%.
50 Family Name
Copy # in
Human
Range
human
Genome
genome
rnd-5_family-38
of
Consensus
TEClass
Divergence from
Sequence
Classification
Coverage (bp)
Consensus (%)
Length (bp)
226
107598
10-33
376
LTR
rnd-5_family-120
601
79268
10-37
205
LTR
rnd-5_family-170
872
445434
10-35
1939
LTR
rnd-5_family-394
373
151834
15-35
723
LINE
rnd-5_family-482
719
138338
11-35
232
DNA Transposon
rnd-5_family-836
249
104599 14-35
551
LTR
rnd-5_family-843
459
333348 9-30
300
LTR
rnd-5_family-1234
963
108281 9-36
247
LINE
rnd-5_family-1704
1581
252427 9-35
348
Unknown
rnd-6_family-130
313
45540
14-34
217
DNA Transposon
rnd-6_family-131
368
62472
14-34
359
Unknown
rnd-6_family-133
619
883438
7-33
214
Unknown
rnd-6_family-324
921
337607
7-39
823
LTR
rnd-6_family-389
598
181359
12-37
534
LTR
rnd-6_family-525
351
81445
12-37
332
DNA Transposon
rnd-6_family-817
857
113735
10-34
132
DNA Transposon
rnd-6_family-860
698
572181
7-36
1205
Unknown
rnd-6_family-916
354
177062
9-33
658
LTR
rnd-6_family-923
212
98551
10-27
714
LINE
rnd-6_family-967
762
338536
10-34
735
LTR
rnd-6_family-1522
206
22256
10-32
141
DNA Transposon
rnd-6_family-1657
933
122342
7-35
262
DNA Transposon
rnd-6_family-1695
478
79659
7-33
246
DNA Transposon
rnd-6_family-2070
687
190661
11-37
2375
Unknown
rnd-6_family-2131
307
37485
11-36
165
LTR
rnd-6_family-2444
431
58933
10-31
151
DNA Transposon
rnd-6_family-2885
372
144161
8-36
613
LTR
rnd-6_family-4223
546
141788
10-36
307
DNA Transposon
rnd-6_family-5085
194
67383
8-29
339
Unknown
rnd-6_family5663
100
190737
9-34
3802
LTR
rnd-6_family5936
278
105884
10-27
460
LINE
Table 2. 2: Summary of de novo TE family discovery
51 Using TEClass [Abrusan et al. 2009] we classified these 31 unknown TE families. TETEClass is
an automated tool for classification of unknown eukaryotic TEs. According to mechanism of
transposition, TEs are classified into four categorize: DNA transposons, LTRs, LINEs, SINEs.
TEClass employs different classifiers to categorize repeats into: DNA transposon versus
retrotransposon, LTRs versus non-LTRs for retrotransposons, LINEs versus SINEs for non-LTR
repeats, forward versus reverse sequence orientation. In cases where most of the classifiers are
not in agreement, it reports the conflicting results as unknown. TEClass was reported to achieve
90–97% accuracy in the classification of novel DNA and LTR repeats, and 75% for LINEs and
SINEs. The limitation of this tool is its incapability in distinguishing between TEs and non-TEs,
meaning that every given sequence will be classified into one of the four categories even if it is
not a TE. Therefore, it is essential to identify the non-TE classes before getting in to this
classification. Assuming that the repeat families identified by RepeatModeler are TE consensus
sequences, TEClass categorizes them to 9 DNA transposons, 12 LTRs, 4 LINEs, 0 SINEs and 6
unknowns.
We performed a homology-based search using RepeatMasker on the boreoeutherian ancestor
sequence to annotate all the TE copies belonging to each of the novel families. Once we
identified these copies, we mapped them to the human genome, which essentially means applying
the ancient TE annotation pipeline described in section 2.
After identifying the 31 novel TE families’ copies in the human genome, our goal was to
determine whether some of them may have contributed particular types of functional elements in
the genome. Although many protein-coding genes are well annotated with their biological
functions, non-coding regions typically lack such annotation. The genomic Regions Enrichment
of Annotations Tool (GREAT) [McLean et al 2010] is a function prediction program that predicts
the biological role of sets of non-coding genomic regions by analyzing the function of nearby
genes. Using this tool, we investigated the nearby genes and functional elements to the annotated
elements in the human genome belonging to the 31 novel ‘unknown’ TE families reported by
RepeatModeler.
52 Members of each novel TE family were also analyzed to determine if they overlapped different
types of functional elements in the human genome. These elements were: (i) highly conserved
regions identified by the PhastCons program [Siepel et al. 2005]; (ii) Transcription factor binding
sites identified by the Encode project [Birney et al. 2007]; (iii) DNAseI hypersensitive regions
also identified by the Encode project, which correspond to regions of open chromatin [Birney et
al. 2007]. We used the FeatureBits program to count the number of bases that overlap between
each TE family and each of these types of annotation. We then computed a fold-enrichment
measure, which is the ratio of the observed number of bases in the intersection to the expected
amount of overlap if the two sets of regions were selected randomly in the genome. A p-value
was associated to that fold-enrichment using a simple z-score calculation.
Among our results, the followings are significant:
•
A family ‘rnd-5_family-120’ is 3.9 fold enriched for phastCons elements (p-value
= 3×10!!! ), 1.6 fold enriched for DNAseI hypersensitive regions (p-value = 3×10!!" )
and 4.1 fold enriched for CTCF binding sites (p-value = 8×10!!! ), as compared to what
would be expected by chance. This family is close to genes involved in cation
homeostasis, dilated cardiomyopathy ( 3×10!! ), and abnormal renin activity ( 7×10!! ).
•
Family ‘rnd-6_family-2131’ has a 3.6-fold enrichment for transcription factor binding
sites STAT3, ERalpha, NF-E2, GATA3. This family is enriched close to genes expressed
in the cerebral cortex (p-value = 4×10!! ).
•
Family ‘rnd-6_family-916’ is enriched 4-fold near genes involved in DNA packaging and
chromatin assembly.
•
Family ‘rnd-6_family-967’ is enriched 3-fold near gene expressed in cerebral cortex.
53 3.4 Conclusion
We believe that these results are very promising and further analysis of patterns, nearby genes
enrichment and functional element overlaps of these new TE families are valuable. As it was
reviewed in chapter 1, TEs contribute to mutation of protein-coding genes, rewiring regulatory
networks, genetic diseases and cancer. However, understanding their diverse biological roles is
still under research and there are yet many to be discovered. Further analysis of these unknown
TE families can potentially provide some insight on TEs association with specific TFBS, proteincoding genes involved in biological pathways and ultimately their impact on medicine and
human health.
54 Chapter 3: Conclusion and Future
Directions
Transposable Elements, which are probably the most abundant class of repetitive sequences in
the genome of all eukaryotes, are homologous DNA fragments present in multiple copies in a
genome. Since their discovery about 65 years ago, they have drawn increasing attention from the
scientific community. Their unique replication mechanisms and sheer abundance make them
interesting biological entities to study. Despite the long belief of them being selfish parasitic
sequences, they have contributed to genome size, structure and arrangement. In addition, TEs
involvement in the host functional genes, regulatory network evolution, genetic disorders and
cancer is now undeniable.
With the rapidly increasing number of sequenced genomes, the accurate genome annotation is
more essential than ever.
The growing awareness of the challenges for understanding the
dynamic component of the genome is clearly reflected in the increasing number of advanced
methods for TE discovery. While there are mature automated coding region identification
systems, there is no robust approach for TEs. Although due to the nature of the TEs there might
not ever be a single approach of detection, the improvement of existing methods namely, de
novo, and homology-based, structural-based, comparative and integrated, and using them in
conjugate would be very effective and realistic. This indicates that TE bioinformatics is still in a
growing phase.
This research is motivated by the fact that the existing annotation tools are not capable of
detecting TEs more than 40% diverged from their original consensus sequence. Therefore, the
annotated TEs in the human genome are relatively younger copies detectable by current
techniques. However, this does not mean that all the TE fragments in the human genome are
annotated. Having an ancestral genome can work to our advantage since it contains younger
versions of the ancient TEs in the human genome.
55 In this thesis, we designed a complete pipeline for annotation of ancient TEs in the human
genome using both pre-existing methods and our new algorithms. The genome of the
boreoeutherian ancestor that is the ancestor of almost all mammals lived about 75 million years
ago, has been reconstructed with 98-99 % accuracy using the multiz alignment and Ancestor
programs. Our pipeline employs RepeatMasker as a classic homology-based TE detection tool to
identify TEs in the boreoeutherian ancestor and human genomes. Using LiftOver, the annotated
TEs in the boreoeutherian sequence are lifted to the human genome. Then, the TEMapper
program, which we have developed, aligns these new TE copies in human to the consensus
sequences in the RepeatMasker TE library and calculates the correct divergence percentages.
Finally, FeatureBits is fed with the TEs annotated in human and the TEs mapped to human to
remove the duplications and provide a single output file contacting the revised TE annotation.
Based on our analysis, the total coverage gained by the revised TE annotation is 115 Mb
corresponding to ~3.5 % of the human genome. Because the LINE family is one of the oldest TE
families known in human inserted before the boreoeutherian ancestor, the elements of this family,
in particular CR1 and L2 account for most of the gain. After that elements belonging to the
ancient SINE family, excluding Alus, account for a substantial amount of coverage gained. In
addition, we discovered and classified 31 novel TE families and analyzed their enrichments
around genes and functional and regulatory elements. Associations with (i) CTCF, STAT3,
GATA3, ERalpha and NF-R2 binding sites; (ii) phastCons elements; (iii) DNAseI hypersensitive
regions; (iv) genes involved in cation homeostasis, dilated cardiomyopathy, abnormal renin
activity, cerebral cortex, DNA packaging and chromatin assembly are among the significant ones
recognized.
The analysis of the newly discovered TE families identity, structure, biological roles, associations
to genes, regulatory elements, and diseases is still in the preliminary stage and need to be
investigated further in future. Aside form that, other human ancestors can be inferred and fed to
the ancient TE annotation pipeline for the sake of identification of more TE copies in human or
other mammals. The same principle can be applied to other eukaryotes particularly plants since
TEs are even more widespread and more active in their genomes compared to mammals
[Feschotte et al. 2002].
56 Another interesting direction of research that can be taken using the ancestral sequence is
investing other genomic elements such as transcriptional factor binding sites (TFBS), regulatory
elements, and pseudogenes. Pseudogenes [Pink & Carter 2013], in particular, are another
interesting biological entities to study. They are non-functional copies of normal functional
protein-coding genes that either have lost their functions at one or both transcription and
translation levels or no longer expressed in the cell. They arise either during DNA replication
from duplication in the DNA sequence or during reverse transcription of mRNA with subsequent
reintegration of the cDNA into the genome. They can provide records of how genomic DNA has
changed to insure the survival of the organism. In addition, they can be used as a model of
evolutionary events such as rate of nucleotide substitution, insertion and deletion in the genome.
Despite the protein coding function loss, pseudogenes are similar to TEs in that they can have
regulatory roles [Mura et al. 2011] and contribution to diseases [Pink et al. 2011] and more.
A similar approach as what presented by Svensson el al. 2006, can be used for pseudogene
prediction in an ancestral genome. Our pipeline presented in Figure 2.2 can be adapted to locate
pseudogenes using BLAST instead of RepeatMasker. Applying the adapted pipeline on the
human and boreoeutherian ancestor genomes, we expect to find ancient pseudogenes that have
been lost in the human genome and are not identifiable in human by current methods. All these
suggest that there is much to be gained studying ancestral DNA and this work has only scratched
the surface of this new field.
57 References
Abrusan G, Grundmann N, DeMeester L, Makalowski W: TEclass: a tool for automated
classification of unknown eukaryotic transposable elements. Bioinformatics 2009,
25:1329-1330.
Altschul SF, Gish W, Miller W, Myers W, Lipman J: Basic local alignment search tool.
JMol Biol 1990, 215:403–10.
Andrieu O, Fiston AS, Anxolabehere D, Quesneville H: Detection of transposable elements
by their compositional bias. BMC Bioinformatics 2004, 5:94.
Arkhipova I, Pyatkov K, Meselson M, Evgen’ev M: Retroelements containing introns in
diverse invertebrate taxa. Nat. Genet. 2003, 33:123-124.
Badyaev A: Stress-induced variation in evolution: from behavioural plasticity to genetic
assimilation. Proc. Biol. Sci. 2005, 272:877-886.
Bao Z, Eddy S: Automated de novo identification of repeat sequence families in
sequenced genomes. GenomeRes 2002, 12:1269-76.
Bateman A, Birney E, Cerruti L, Durin R, Etwiller L, Eddy R, Griffiths-Jones S, Howe L,
Marshall M, Sonnhammer L: The Pfam protein families database. Nucleic Acids Res 2002,
30:276–80.
Belancio P, Hedges J, Deininger P: Mammalian non-LTR retrotransposons: for better or
worse, in sickness and in health. Genome Res. 2008, 18: 343-358.
Bénit L, Lallemand JB, Casella JF, Philippe H, Heidmann T: ERV-L elements: a family of
endogenous retrovirus-like elements active throughout the evolution of mammals. J
58 Virol 1999, 73:3301-3308.
Bennett EA, Coleman LE, Tsui C, Pittard W, Devine S: Natural genetic variation caused by
transposable elements in humans. Genetics 2004, 168:933-51.
Berezikov E, Bucheton A, Busseau I: A search for reverse transcriptase-coding sequences
reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster.
Genome Biol 2000, 1:RESEARCH0012.
Bergman C, Quesneville H: Discovering and detecting transposable elements in genome
sequences. Brief Bioinform. 2007, 8(6):382-392.
Birney E, Stamatoyannopoulos J, et al.: Identification and analysis of functional elements
in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146): 799816.
Blanchette M, Green E, Miller W, Haussler D: Reconstructing large regions of an
ancestral mammalian genome in silico. Genome Res 2004, 14(12):2412-2423.
Blanchette M, Kent W, Riemer C, Elnitski L, Smit A, Roskin K, Baertsch R, Rosenbloom K,
Clawson H, Green E, Haussler D, Miller W: Aligning multiple genomic sequences with the
threaded blockset aligner. Genome Research 2004, 14(4): 708-715.
Bourque G, Pevzner P: Genome-Scale Evolution: Reconstructing Gene Orders in the
Ancestral Species. Genome Res 2002, 12:26-36.
Britten R, Kohne D: Repeated sequences in DNA. Hundreds of thousands of copies of
DNA sequences have been incorporated into the genomes of higher organisms. Science
1968, 161:529–540.
Brosius J, Gould S: On ‘genomenclature’: a comprehensive (and respectful) taxonomy
for pseudogenes and other ‘junk DNA’. Proceeding of National Academic Science USA
59 1992, 89: 10706-10710.
Brouha B, Schustak J, Badge RM, Lutz-Prigge S, Farley AH, Moran JV, Kazazian HH Jr:
Hot L1s account for the bulk of retrotransposition in the human population. Proc Natl
Acad Sci USA 2003, 100:5280-5285.
Callinan A, Batzer A: Retrotransposable elements and human disease. Genome Dyn.
2006, 1: 104-115.
Caspi A, Pachter L: Identification of transposable elements using multiple alignments of
related genomes. Genome Res 2006, 16:260-70.
Chen M, Chuzhanova N, Stenson PD, Férec C, Cooper N: Meta-analysis of gross insertions
causing human genetic disease, novel mutational mechanisms and the role of replication
slippage Hum. Mutat. 2005, 25: 207-221
Chen M, Stenson D, Cooper N, Ferec C: A systematic analysis of LINE-1 endonucleasedependent retrotranspositional events causing human genetic disease. Hum. Genet. 2005,
117:411-427.
Cordaux R, Batzer M: The impact of retrotransposons on human genome evolution
Nature Reviews Genetics 2009, 10, 691-703.
Craig N: Unity in transposition reactions. Science 1995, 270:253–254.
Deininger L, Batzer A: Alu repeats and human disease. Mol. Genet. Metab. 1999, 67: 183193.
Diallo A, Makarenkov V, Blanchette M: Ancestors 1.0: a web server for ancestral
sequence reconstruction. Bioinformatics 2010, 26(1):130-1.
Divoky V, Indrak K, Murg M, Brabec V, Huisman J, Prchal T: A novel mechanism of beta 60 thalassemia, The insertion of L1 retrotransposable element into beta globin IVSII
Blood, 1996, 88: 148a.
Eddy S: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763.
Edgar R, Myers E: PILER: identification and classification of genomic repeats.
Bioinformatics 2005, 21(suppl. 1): 52–58.
Feschotte C, Jiang N, Wessler SR: Plant transposable elements: where genetics meets
genomics. Nature Reviews Genetics 2002, 3(5): 329-341.
Feschotte C, Pritham E: DNA transposons and the evolution of eukaryotic genomes. Annu
Rev Genet. 2007, 41:331–368.
Feschotte C: Transposable elements and the evolution of regulatory networks. Nat Rev
Genet 2008, 9:397-405.
Flutre T, Duprat E, Feuillet C, Quesneville H: Considering Transposable Element
Diversification in De Novo Annotation Approaches. PLoS One 2011, 6: e16526.
Gentles A, Wakefield M, Kohany O, Gu W, Batzer M, Pollock D, Jurka J: Evolutionary
dynamics of transposable elements in the short-tailed opossum Monodelphis domestica.
Genome Res. 2009, 17: 992-1004.
Gregory T: Synergy between sequence and size in Large-scale genomics. Nature Reviews
Genetics 2005, 6:699-708.
Gu W, Castoe T, Hedges D, Batzer M, Pollock D: Identification of repeat structure in
large genomes using repeat probability clouds. Anal Biochem 2008, 380(1):77-83.
Hancks C, Kazazian H: Active human retrotransposons: variation and disease. Curr Opin
61 Genet Dev 2012, 22(3):191-203.
Huang R, Schneider M, Lu Y, Niranjan T, Shen P, Robinson A, Steranka P, Valle D, Civin I,
Wang T, Wheelan J, Ji H, Boeke D, Burns H: Mobile interspersed repeats are major
structural variants in the human genome. Eur J Hum Genet. 1993, 1(1): 30-6.
Huang X: Global Sequence Alignment. Computer Applications in the Biosciences 1994, 10:
227-235.
Hutchinson B, Andrew E, McDonald H, Goldberg P, Graham R, Rommens M, Hayden R: An
Alu element retroposition in two families with Huntington disease defines a new active
Alu subfamily. Nucleic Acids Research 1993, 21(15): 3379-3383.
Jasinska A, Krzyzosiak W: Repetitive sequences that shape the human transcriptome.
FEBS Letters 2004, 567(1):136–141.
Jordan I, Rogozin I, Glazko G, Koonin E: Origin of a substantial fraction of human
regulatory sequences from transposable elements. Trends Genet. 2003, 19, 68–72.
Jurka J, Kapitonov V, Kohany O, Jurka M: Repetitive Sequences in Complex Genomes:
Structure and Evolution. Annu.Rev.Genomecs Hum. Genet. 2007, 8:241-59.
Jurka J, Kapitonov V, Pavlicek A, klonowski P, Kohany O, Walichiewicz J: Repbase
Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005,
110:462-7.
Jurka J, Kapitonov V, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase
Update, a database of eukaryotic repetitive elements. Cytogentic and Genome Research
2005, 110:462-467.
Jurka J, Klonowski P, Dagman V, Pelton P: CENSOR—a program for identification and
elimination of repetitive elements from DNA sequences. Comput Chem 1996, 20:119–21.
62 Kapitonov V, Jurka J: Rolling-circle transposons in eukaryotes. Proceeding of National
Academic Science USA 2001, 98:8714–19.
Kapitonov V, Jurka J: Self-synthesizing DNA transposons in eukaryotes. Proc. Natl. Acad.
Sci. USA 2006, 103:4540–45.
Karamerov A, Vassetzky S.: SINEs. Wiley Interdiscip Rev RNA. Epub 2011, 2(6): 772-86.
Katoh K, Misawa k, Kuma k, Miyata T: MAFFT a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucl. Acid Res 2002, 30(14):305966.
Kazazian H, Wong C, Youssoufian H, Scott A, Phillips D, Antonarakis S: Hemophilia A
resulting from de novo insertion of L1 sequences represents a novel mechanism for
mutation in man. Nature 1988, 332: 164-166
Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D: The human genome
browser at UCSC. Genome Res. 2002, 12(6):996-1006.
Kohany O, Gentles AJ, Hankus L, Jurka J: Annotation, submission and screening of
repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics
2006, 7:474.
Kolpakov R, Kucherov G: Finding maximal repetitions in a word in linear time. Symp
Fondation of computer scince 1999, 40: 596-604.
Kondo-Iida E, Kobayashi K, Watanabe M, Sasaki J, Kumagai T, Koide H, Saito K, Osawa
M, Nakamura Y, Toda T: Novel mutations and genotype phenotype relationships in 107
families with Fukuyama-type congenital muscular dystrophy (FCMD). Hum. Mol. Genet.
1999, 8:2303-2309.
63 Koplanov R, Bana G, Kucherov G: MREPS: effective and flexible detection of tandem
repeats in DNA. Nucl. Acid Res 2003, 31(13): 3672-3678.
Kuhn M, Haussler D, Kent J: The UCSC genome browser and associated tools. Brief
Bioinform 2012, 14(2):144-161.
Lerat E.: Identifying repeats and transposable elements in sequenced genomes: how to
find your way through the dense forest of programs. Heredity 2010, 104: 520–533.
Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong G Wang J: ReAS:
Recovery of ancestral sequences for transposable elements from the unassembled reads
of a whole genome shotgun. PLoS Comput Biol 2005, 1:e43.
Lorenzi H, Robledo G, Levin M: The VIPER elements of trypanosomes constitute a novel
group of tyrosine recombinase-enconding retrotransposons. Mol. Biochem. Parasitol.
2006, 145:184-94.
Ma J: Reconstructing the history of large-scale genomic changes: biological questions
and computational challenges. J Comput Biol 2011, 18(7): 879-93.
Marino-Ramirez L, Jordan I: Transposable element derived DNaseI-hypersensitive sites in
the human genome. Biol. Direct 2006, 1, 20.
McCarthy E, McDonald J: LTR_STRUC: a novel search and identification program for
LTR retrotransposons. Bioinformatics 2003, 19:362-367.
McLean C, Bristor D, Hiller M, Clarke S, Schaar B, Lowe C, Wenger A, and Bejerano G:
GREAT improves functional interpretation of cis regulatory regions. Nat. Biotechnol
2010, 28(5): 495-501.
Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC,
Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS,
64 Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh
K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment
and conservation track in the UCSC Genome Browser. Genome Res 2007, 17(12): 1797808.
Mills R, Bennett E, Iskow R, Devine S: Which transposable elements are active in the
human genome? Trends Genet. 2007, 23:183–191.
Muro M, Mah N, Andrade-Navarro A: Functional evidence of post-transcriptional
regulation by pseudogenes. Biochimie 2011, 93: 1916-1921.
Murphy W, Eizirik E, Johnson W, Zhang Y, Ryder O, O'Brien S: Molecular phylogenetics
and the origins of placental mammals. Nature 2001, 409:614-618.
Negrini S, Gorgoulis G, Halazonetis TD: Genomic instability - an evolving hallmark of
cancer. Nat Rev Mol Cell Biol 2010, 11:220-228.
Nekrutenko A, Li H: Transposable elements are found in a large number of human
protein-coding genes. Trends in Genetics 2001, 17: 619–621
Ostertag E, Kazazian J: Biology of mammalian L1 retrotransposons Annu. Rev Genet.
2001, 35:501–538.
Ostertag M, Goodier L, Zhang Y, Kazazin H: SVA Elements Are Nonautonomous
Retrotransposons that Cause Disease in Humans. Am J Hum Genet. 2003; 73(6): 14441451.
Paten B, Herrero J, Beal K, Birney E: Sequence progressive alignment, a framework for
practical large-scale probabilistic consistency alignment. Bioinformatics 2009, 25(3):295301.
65 Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E: Genome-wide
nucleotide-level mammalian ancestor reconstruction. Genome Res 2008, 18(11):1829-43.
Pearson W: Flexible sequence similarity searching with the FASTA3 program package.
Methods Mol Biol 2000, 132:185-219.
Pink C, Carter R: Pseudogenes as regulators of biological function. Essays Biochem 2013,
54: 103-112.
Pink RC, Wicks K, Caley DP, Punch EK, Jacobs L, Carter DR: Pseudogenes: pseudofunctional or key regulators in health and disease? RNA 2011, 17: 792-798.
Poulter R, Goodwin T: DIRS-1 and the other tyrosine recombinase retrotransposons
Cytogenet. Genome Res. 2005, 110:575–588.
Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large
genomes. Bioinformatics 2005, 21 suppl. 1:i351-8.
Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehare
D: Combined evidence annotation of transposable elements in genome sequences.
PLoSComput Biol 2005, 1:e22.
Quesneville H, Nouaud D, Anxolabehere D: Detection of new transposable element
families in Drosophila melanogaster and Anopheles gambiae genomes. JMol Evol 2003,
57(suppl. 1): S50–9.
Rho M, Choi JH, Kim S, Lynch M, Tang H: De novo identification of LTR
retrotransposons in eukaryotic genomes. BMC Genomics 2007, 8:90.
Richard Cordaux, Dale J. Hedges, Scott W. Herke, Mark A. Batzer: Estimating the
retrotransposition rate of human Alu elements. Gene. 2006, 371: 134-137.
66 Sharp A, Cheng Z, Eichler E: Structural variation of the human genome. Annu. Rev.
Genom. Hum. Genet. 2006, 7:407–442.
Siepel A, Bejerano G, Pedersen J, Hinrichs A, Hou M,Rosenbloom K, Clawson H, Spieth J,
Hillier LW, Richard S, Weinstock GM, Wilson, RK, Gibbs RA, Kent Wj, Miller W, Haussler
D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Genome Res 2005, 15(8): 1034-1050.
Skow C, McCabe T, Mills E, Torene S, Pittard S, Neuwald F, Van Meir G, Vertino M,
Devine E: Natural mutagenesis of human genomes by endogenous retrotransposons. Cell
2010, 141:1253-1261.
Slotkin K, Martienssen R.: Transposable Element and epigenetic regulation of the genes.
Nature Reviews Genetics 2007, 8: 272-285.
Smit A, Hubley R, Green P: RepeatMasker. Institute for Systems Biology 1996-2010, Open3.0. http://www.repeatmasker.org.
Smit AFA, Hubley R: RepeatModeler Open-1.0. 2008-2010 http://www.repeatmasker.org.
Solyom S, Kazazian H: Mobile elements in the human genome: implications for disease.
Genome Med. 2012, 4:12.
Svensson O, Arvestad L, Lagergren J: Genome-wide Survaey for Bilogically Functional
Pseudogenes. PLoS Comput. Biol. 2006, 2 (5): e46
Symer E, Connelly C, Szak T, Caputo M, Cost J, Parmigiani G, Boeke D: Human l1
retrotransposition is associated with genetic instability in vivo. Cell 2002, 110:327-338.
67 Thomas A, Paquola C, Muotri R.: LINE-1 retrotransposition in the nervous system. Annu
Rev Cell Dev Biol. 2012, 28:555-73.
Vidaud D, Vidaud M, Bahnak R, Siguret V, Gispert Sanchez S, Laurian Y, Meyer D,
Goossens M, Lavergne M : Haemophilia B due to a de novo insertion of a human specific
Alu subfamily member within the coding region of the factor IX gene. Eur J Hum Genet.
1993, 1(1):30-6.
Volff J: Turning junk into gold: domestication of transposable elements and the creation
of new genes in eukaryotes. BioEssay 2006, 28:913-922.
Wallace R, Andersen L, Saulino A, Gregory P, Glover T, Collins F: A de novo Alu insertion
results in neurofibromatosis type 1. Nature 1991, 353: 864-866.
Wang T, zeng J, Lower C, Sellers R, Salama S, Yang M, Burgess S, Brachmann R, Hussler
D: Species-specific endogenous retroviruses shape the transcriptional network of the
human tumor suppressor protein p53. Proc. Natl Acad. Sci. USA 2007, 104:18613-618 .
Westesson O, Lunter G, Paten B, Holmes I: Accurate reconstruction of insertion-deletion
histories by statistical phylogenetics. PLoS One 2012, 7(4):e34572.
Wheeler T, Clements J, Eddy S., Hubley R, Jones T, Jurka J, Smit A, Finn R: Dfam: a
database of repetitive DNA based on profile Hidden Markov Models. Nucleic Acids
Research 2013, 41:D70-82.
Wheeler T, Eddy S: nhmmer: DNA homology search with profile HMMs. Bioinformatics
2013, 29:2487-2489.
Wimmer K, Callens T, Wernstedt A, Messiaen L: The NF1 gene contains hotspots for L1
endonuclease-dependent de novo insertion. PLoS Genet 2011, 7:e1002371.
68