(differential) imprinting from omics data

Faculty of Bioscience Engineering
Academic year 2014-2015
Methodological contributions for the detection
of (differential) imprinting from omics data
Tine Goovaerts
Promoters:
Prof. dr. ir. Tim De Meyer
Prof. dr. ir. Wim Van Criekinge
Tutor: ir. Sandra Steyaert
Master’s dissertation submitted in partial fulfilment of the requirements for
the degree of
Master in Bioscience Engineering: Cell and Gene Biotechnology
Faculty of Bioscience Engineering
Academic year 2014-2015
Methodological contributions for the detection
of (differential) imprinting from omics data
Tine Goovaerts
Promoters:
Prof. dr. ir. Tim De Meyer
Prof. dr. ir. Wim Van Criekinge
Tutor: ir. Sandra Steyaert
Master’s dissertation submitted in partial fulfilment of the requirements for
the degree of
Master in Bioscience Engineering: Cell and Gene Biotechnology
De auteur en promotors geven de toelating deze scriptie voor consultatie beschikbaar te
stellen en delen ervan te kopiëren voor persoonlijk gebruik. Elk ander gebruik valt onder
de beperkingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting
uitdrukkelijk de bron te vermelden bij het aanhalen van resultaten uit deze scriptie.
The author and promoters give the permission to use this thesis for consultation and to copy
parts of it for personal use. Every other use is subject to the copyright laws, more specifically
the source must be extensively specified when using from this thesis.
Gent, Juni 2015
i
Dankwoord
Graag begin ik deze masterproef met een woord van dank. Eerst en vooral wil ik mijn
promotor professor dr. ir. Tim De Meyer bedanken. Hij gaf me de kans mijn eigen weg
te zoeken in de complexe wereld van de bio-informatica. Ook wakkerde hij mijn interesse
in bio-informatica, en in het bijzonder in statistisch denken, verder aan. Ik wil hem echter
voornamelijk bedanken om steeds weer klaar te staan met tips en oplossingen wanneer ik
vastzat en het niet meer zag. Tim, dank u dat ik altijd bij u terecht kon met vragen en dat
ik dankzij u het voorbije jaar enorm veel heb bijgeleerd. Ook las u mijn thesis meermaals
grondig na en gaf u steeds zeer goede suggesties. Zonder uw uitstekende begeleiding was deze
masterproef niet gelukt.
Evenzeer wil ik ir. Sandra Steyaert bedanken. Ook aan haar kon ik steeds mijn vragen stellen
en telkens nam ze de tijd om me alles grondig uit te leggen. Ze las en verbeterde eveneens
mijn masterproef en hielp me zo sterk vooruit.
Daarnaast ben ik zeker ook professor dr. ir. Wim Van Criekinge, die me de kans en vrijheid
gaf om dit onderwerp uit te werken, dankbaar. Eveneens wil ik de BioBix-groep bedanken.
Ze deelden graag hun kennis met mij en stonden steeds klaar om me te helpen.
Als laatste ben ik mijn ouders dankbaar voor de opvoeding en ondersteuning die ze mij
gegeven hebben. Zonder hen zou ik niet staan waar ik nu sta en zou ik deze masterproef niet
kunnen geschreven hebben. Ook dank ik hen, mijn broers en zus om (met enige aversie) deze
masterproef na te lezen en me gedurende dit jaar te steunen.
iii
Contents
Dankwoord
iii
List of Abbreviations
vii
Summary
ix
Samenvatting
xi
Introduction
xiii
1 Literature study
1.1
1
Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.2
Histone modifications . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.3
RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.1.4
Epigenetics and disease . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2
Imprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Next-generation sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3.1
Illumina sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4
Methods to detect DNA methylation . . . . . . . . . . . . . . . . . . . . . . .
11
1.5
RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.6
Analysing next-generation sequencing data . . . . . . . . . . . . . . . . . . .
13
1.6.1
SNP and genotype calling . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.6.2
SeqEM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
The MAM-pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.7.1
Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.7.2
The pipeline
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.7.3
Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
TCGA: breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.7
1.8
v
Contents
2 Materials and Methods
2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 SeqEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Statistical methods relevant to this master’s thesis . . . . . . . . . . . . . . .
2.3.1 The Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . .
2.3.2 The Wilcoxon Rank Sum test . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 The likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Detection of (differential) imprinting . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Generation of random data and estimation of the allele frequency and
error rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Kolmogorov-Smirnov and Wilcoxon Rank Sum test on imprinted SNP
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 The likelihood ratio test to detect (differential) imprinting . . . . . . .
2.4.4 TCGA data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
21
21
21
21
22
22
22
22
24
3 Results
3.1 Parameter estimation in random SNP data using SeqEM . . . . . . . . . . .
3.1.1 SeqEM on sequences with only two alleles . . . . . . . . . . . . . . .
3.1.2 SeqEM on sequences with four alleles . . . . . . . . . . . . . . . . .
3.1.3 SeqEM on imprinted sequences . . . . . . . . . . . . . . . . . . . . .
3.2 Detection of imprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Drawbacks of previous methodology . . . . . . . . . . . . . . . . . .
3.2.2 The likelihood ratio test to detect imprinting in simulated data . . .
3.2.3 Application on TCGA data . . . . . . . . . . . . . . . . . . . . . . .
3.3 Detection of differential imprinting . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Comparison of two subsets of simulated data with KS and WRS test
3.3.2 Simulation studies of the likelihood ratio test . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
35
35
35
36
36
38
38
38
45
48
48
52
.
.
.
.
57
57
59
60
60
4 Discussion
4.1 Detection of imprinting . . . . . .
4.2 Detection of differential imprinting
4.3 Computational efficiency . . . . . .
4.4 Statistical framework . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
26
26
32
63
vi
List of Abbreviations
5-hmC
5-hydroxymethylcytosine
A
adenine
ASE
allele-specific expression
bp
base pairs
C
cytosine
CpG
cytosine-phosphate-guanine
CRT
cyclic reversible termination
DMR
differentially methylated region
DNA
deoxyribonucleic acid
DNMT
DNA methyltransferase
dsRNA
double-stranded RNA
EM
Expectation-Maximisation
FDA
Food and Drug Administration
fdr
false discovery rate
G
guanine
HDAC
histone deacetylase
ICE
imprinting control element
ICR
imprinting control region
KS test
Kolmogorov-Smirnov test
lncRNA
long non-coding RNA
vii
LOI
loss of imprinting
LRT
likelihood ratio test
MAM
monoallelic DNA methylation
MBD
methyl binding domain
miRNA
microRNA
MLE
maximum likelihood estimation
mRNA
messenger RNA
NCI
National Cancer Institute
ncRNA
non-coding RNA
NGS
next-generation sequencing
NHGRI
National Human Genome Research Institute
PMF
probability mass function
PMT
post-transcriptional modification
RNA
ribonucleic acid
RNA-Seq
RNA sequencing
RNAi
RNA interference
RRBS
reduced representation bisulfite sequencing
SE
sequencing error rate
siRNA
small interfering RNA
SNP
single-nucleotide polymorphism
ssDNA
single-stranded DNA
T
thymine
TCGA
The Cancer Genome Atlas
TET
ten-eleven translocation
WGBS
whole genome bisulfite sequencing
WRS test
Wilcoxon Rank Sum test
viii
Summary
In this master’s thesis different methodologies for the detection of (differential) imprinting
were developed. Besides genetics, also epigenetic phenomena are important regulators in cells.
Epigenetic modifications are among others crucial in monoallelic gene expression, in which
only one of the two alleles is expressed. When this is determined by the parental origin, thus
if only the paternal or maternal copy is active, it is called imprinting. Loss of imprinting
can lead to many different diseases, such as cancer. It is hence of major interest to study
imprinting and search for loci that are either imprinted or have lost their imprinting pattern.
A method that allows detection of imprinted loci in MethylCap-seq data was developed at
the BioBix lab by Steyaert et al. Though the method’s efficacy was already proven, several
downsides of the method, such as computational load, could still be improved. Hence, the
aim of this master’s thesis was the improvement of that methodology, more specifically the
SNP calling and statistical framework, and to enable detection of differential imprinting.
Simulation studies were performed to test the efficiency of a methodology in the discovery
of imprinted loci. Furtermore, methods, which were based on the comparison of control and
tumour samples, were developed to detect loss of imprinting in cancer. Efficiency of the
methodologies was here ascertained by evaluating them on randomly generated data. It was
demonstrated that detection of imprinting was very effective, but further optimisation of the
methods to screen for loss of imprinting is still necessary
The developed method for the detection of imprinting was also tested on RNA-seq control
samples from the breast cancer data obtained from TCGA. Some generally known imprinted
genes were discovered, which show the validity of the method. Though additional filtering
steps are required to remove aberrant SNPs. A next step is to use the other methodologies
as well. However, as these still have to be optimised, this was not yet done.
Although the methodologies developed in this master’s thesis are not yet perfect and have to
be further optimised, a proof of concept for the efficient detection of (loss of) imprinting is
provided here. When the development is complete, the whole TCGA dataset can be studied
and this can support cancer research. Extension to other diseases and detection of loss of
heterozygosity will then be possible as well.
ix
Samenvatting
In deze masterproef werden verschillende methodes voor de detectie van (differentiële) imprinting ontwikkeld. Naast genetica, is ook epigenetica belangrijk in de ontwikkeling van cellen.
Epigenetische modificaties zijn onder andere belangrijk in monoallelische expressie, waarbij
maar één van de twee allelen tot expressie komt. Wanneer dit bepaald is door de parentale
origine, wordt dit imprinting genoemd. Verlies van imprinting patronen kan leiden tot verschillende ziektes, zoals kanker. Het is dus zeer interessant om imprinting te bestuderen en
op zoek te gaan naar loci die imprinted zijn of net hun imprinting patroon verloren hebben.
Een methodologie voor de detectie van imprinting in MethylCap-seq data werd ontwikkeld
aan de BioBix-groep door Steyaert et al. Hoewel de doeltreffenheid van deze methode reeds
aangetoond werd, is optimalisatie van enkele nadelen, zoals computationele intensiteit, nog
nodig. Het doel van deze thesis was deze methode te optimalizeren, namelijk de SNP calling
en het statistische kader, en uit te breiden om ook verlies van imprinting te detecteren.
Simulatiestudies werden uitgevoerd om de efficiëntie van een methode in de ontdekking van
imprinted loci te testen. Ook werden methodologiën ontwikkeld om verlies van imprinting
te detecteren in kanker door controle met tumor stalen te vergelijken. Door evaluatie van de
methodes op random gegenereerde data, werd vastgesteld dat ze geschikt zijn om (verlies van)
imprinting te detecteren. Detectie van imprinting gebeurde zeer efficiënt met de ontwikkelde
methode. Echter, extra optimalisatie is nog nodig voor de detectie van verlies van imprinting.
De ontwikkelde methode voor de detectie van imprinting werd niet alleen op gesimuleerde
data getest, maar ook op RNA-seq controle stalen van borstkanker data verkregen van TCGA.
Hierbij werden gekende imprinted genen teruggevonden, die de doeltreffendheid van de ontwikkelde methodologie bewijzen. Bijkomende data filtering is echter wel nog nodig om aberrante SNPs uit de data te verwijderen. Een volgende stap is om ook de andere methodes te
testen. Aangezien deze nog verbeterd moeten worden, werd dit hier nog niet uitgevoerd.
Hoewel de methologiën ontwikkeld in deze masterproef nog niet perfect zijn en zeker nog verder geoptimaliseerd moeten worden, werd hier een proof of concept voor de efficiënte detectie
van (verlies van) imprinting aangetoond. Eens de ontwikkeling compleet is, kan de volledige TCGA dataset getest worden en zal zo kankeronderzoek sterk vooruit geholpen worden.
Uitbreiding naar andere ziektes en ook detectie van loss of heterozygosity is dan mogelijk.
xi
Introduction
In addition to normal genetics controlling among others development and regulation of cells,
also epigenetic phenomena have important regulatory functions in cells. Epigenetics is defined
as the heritable changes in gene expression not due to changes in DNA sequence. One of
the key players in epigenetics is DNA methylation which often leads to silencing of genes.
Furthermore, DNA methylation is involved in allele-specific expression (ASE). In diploid cells
all genes are present in two copies, a maternal and a paternal one. Most of these genes are
expressed from both of those copies to an equal extent, called biallelic expression. However,
some genes are monoallelically expressed. Typically due to differential DNA methylation, only
one copy is active and will be expressed. The best known ASE examples are X-inactivation
and imprinting. The latter is the specific silencing and expression of genes depending upon
the parental origin.
Dysregulation of ASE, and more specifically imprinting, is associated with many diseases.
Loss of imprinting is for example a common feature of many cancer types. In previous
years, a data-analytical framework to screen for imprinted loci using MBD-seq data was
developed at the BioBix lab. In this thesis that methodology will be further improved - more
specifically, the SNP calling and statistical framework - and tested on a different data type
(RNA-Seq). Furthermore, the methodology will be extended to also detect loss of imprinting
in cases versus controls. In chapter 1 the literature associated with a.o. epigenetics and
imprinting will be described briefly. In chapter 2 the software, hardware and statistics used
in this thesis will be discussed. Here, the ideas behind the developed methodologies will be
extensively described as well. Results of simulation studies and the detection of imprinting
in TCGA data are demonstrated in chapter 3. Chapter 4 then follows with a discussion of
the developed methods and the obtained results. A general conclusion lastly summarises this
thesis in chapter 5.
xiii
Chapter 1
Literature study
1.1
Epigenetics
Epigenetics was first described by Conrad Waddington in 1942 as “the branch of biology which
studies the causal interactions between genes and their products, which bring the phenotype
into being”. 1 Research into this new field of science has since grown massively. Literally
epigenetics means “on top of genetics” (on top of the genetic information). It describes the
heritable changes (both mitotic and meiotic) in gene expression which are not caused by
changes in the DNA sequence (normal genetics). 2 Waddington illustrated this with figures
representing “the epigenetic landscape”(Fig. 1.1). The ball rolling down represents a stem cell
and the landscape the fertilised egg. Depending on which groove the ball ends up in, it will
become a different type of cell, tissue or organ. The shape of this landscape is determined by
the interaction of genes and the environment. The genes are represented as the pegs and the
ropes are the “chemical tendencies” of the gene. Depending on how these genes are placed,
how the ropes are pulled and how they all interact, the landscape is shaped. 3
Figure 1.1: The epigenetic landscape as depicted by Waddington. A stem cell is rolling down a landscape determining what type of cell it will become (left). The bottom of the landscape
shows the different mechanisms and interactions shaping this landscape (right). 4
1
1. Literature study
Accordingly, this resulted in the understanding of cellular differentiation as an epigenetic
phenomenon, caused by alterations in epigenetic landscape instead of changes in genetic information. Therefore, epigenetics is seen as the bridge between genotypes and phenotypes. 2
This can easily be understood by looking at monozygotic twins. They are regarded as genetically identical. However, their phenotypes can differ. As genetic changes cannot explain these
differences, it is believed that they are caused by variation in the epigenetic landscape. 5,6
The main players in epigenetic phenomena are DNA methylation and histone modifications. 7
However, also RNA can alter gene expression and has a role in epigenetic control. 8 It is
important to state that these three always interact and that the outcome is the result of all
three combined.
1.1.1
DNA methylation
One of the key players in epigenetics is DNA methylation in which a methyl group is added
to the carbon 5-position of a cytosine by methyltransferases (Fig. 1.2). In mammals, methyl
additions mainly take place on cytosines in CpG dinucleotides. 9 However, CpGs are often
found in regions with a high GC content, which generally remain unmethylated. Outside of
these regions the genome is depleted for CpGs. It is on these dinucleotides that most of the
methylation events occur. 10
Figure 1.2: Schematic overview of the methylation process. The addition of a methyl group to the
5-position of a cytosine is mediated by methyltransferases. The methyl group donor
used by methyltransferases is S-adenosylmethionine. The process can also be reversed
by active or passive demethylation. 11
Generally, CpG-rich regions are located in promoters (close to transcription start sites) and
first exons, indicating their association with genes. 12 When gene promoters are methylated,
the expression of the gene is typically blocked and the gene becomes silent. 13 However, this
is slightly over-simplified. Three possible mechanisms explain the silencing of genes due to
promoter methylation. 9,14 Firstly, the addition of a methyl group prevents certain transcription factors from binding to the DNA, thus hindering normal transcription. Secondly, the
chromatin structure of methylated and unmethylated DNA is different, i.e. the appearance
2
1. Literature study
of the major groove to which DNA binding proteins bind changes. Depending on the position
of the methylated region relative to the transcription start site, the rate of transcription can
be increased or decreased due to this structural modification. 15 Lastly, adding methyl groups
can directly result in binding of specific transcriptional repressors. For example methyl-CpGbinding proteins bound to methyl groups will attract histone deacetylases (HDAC). These
enzymes remove acetyl groups and change the chromatin structure to repress transcription.
These three mechanisms do not exclude each other, but can rather be correlated. 9,14 DNA
methylation patterns are acquired through balancing methylation (by methyltransferases)
and demethylation (by active or passive demethylation). 16 Most genes, such as housekeeping
genes, have promoters that are always free from methylation to ensure their constant transcriptional activity. Other regions however, such as repetitive and transposable sequences,
are constitutively and heavily methylated to prevent their translocation thereby increasing
genomic stability. 7
Gene bodies can be methylated as well. Here however, methylation is correlated with active
transcription of genes instead of silencing. 10 Furthermore, as exons contain more methylation
than introns and the transition occurs at the boundaries between the two, gene body methylation is most likely involved in alternative splicing. 17 Additionally, methylation of enhancers
and insulators is possible, however the exact function of these events is only beginning to be
unravelled. 10
Cellular methylation patterns need to be stably transmitted to daughter cells. Hence, after
replication the patterns have to be reproduced. This is done by one of the methyltransferases,
namely DNA methyltransferase 1 (DNMT1) which is the maintenance methyltransferase,
adding methyl groups to hemimethylated DNA strands. 18 When new methylation patterns
are created, de novo methylation, DNMT3a and DNMT3b are used, which are especially
important during embryogenesis. 19 However, DNMT3b has some maintenance activity as
well. 14 Another methyltransferase, DNMT2, was identified but the activity of this variant is
not yet certain. However, it has been reported that it is involved in tRNA methylation. 20
DNA demethylation can be either passive or active. The former occurs when the DNA
methylation process is not working properly. The amount of methylation will then be greatly
reduced after DNA replication. 21 Use of enzymes to directly modify the methyl group is
called active DNA demethylation. For example, ten-eleven translocation 1-3 (TET1-3) proteins can convert 5-methylcytosine into 5-hydroxymethylcytosine (5-hmC) which will lead to
demethylation either passively or actively through for example DNA repair. 22
As described above, 5-hmC, in which a hydroxymethyl group is added to the carbon 5position of a cytosine, is an important intermediate in the DNA demethylation pathway.
However, the levels of 5-hmC are higher than expected for an intermediate and it was hence
thought that 5-hmC is an epigenetic mark as well. 23 Hydroxymethylated cytosines are present
3
1. Literature study
in all tissues. As the highest levels of 5-hmC were found in the brain, its importance in
development and diseases was revealed. Gene expression is controlled by hydroxymethylation
since gene bodies, promoters and transciption factor binding sites are enriched with 5-hmC. 24
Furthermore, studies have shown that loss of 5-hmC is correlated with many cancer types. 23,25
However, not much is known yet about 5-hmC and further research is necessary to increase
our knowledge on hydroxymethylation.
In addition to methylation playing a role in the silencing of genes, it is important in the regulation of monoallelic gene expression, e.g. related to X-chromosome inactivation or imprinting.
The phenomenon of imprinting will be extensively discussed later in section 1.2.
1.1.2
Histone modifications
In cells DNA is found in a complex of DNA and proteins called chromatin. The basic units of
chromatin are nucleosomes. Nucleosomes consist of 147 bp of DNA wrapped around histone
octamers in a 1.7 left-handed superhelical turn. 14 Approximately 50 bp of free DNA separate
successive nucleosomes (called “linker DNA”) to which the linker histone H1 binds. The
histone octamers are made of four core histones (H2A, H2B, H3, H4) with H2A and H2B
forming two dimers and H3 and H4 two tetramers. 9 These core histones are globular except
for their N-terminal tails which are unstructured (Fig. 1.3). It is on these histone tails that
epigenetic post-transcriptional modifications (PTMs) occur. 9,14,26 A wide variety of PTMs
have already been found, such as acetylation, methylation, phosphorylation, ubiquitination,
sumoyalation and ADPribosylation. 27–30 Many enzymes, such as histone modifying enzymes
and ATP-dependent chromatin remodelling enzymes, regulate these PTMs. Most of them are
dynamic, but their specificity can be very different. 26 For example, histone deacetylases and
histone acyltransferases can act on many sites, while methyltransferases and kinases are rather
specific. These PTMs take place on specific amino acids in the histone tails. Furthermore,
the effect of a PTM depends upon the specific amino acid that is modified. For example,
H3 methylated at lysine 9 will result in silencing, while H3 methylation of lysine 4 leads to
activation. 31 Histones can be modified simultaneously by diverse PTMs and at different sites,
hence cross-talk can occur. This interaction can happen between modifications on the same
site, on the same histone tail and between different tails. 32
Two possible phenomena explain the function of the PTMs. 26 The first is the unravelling of
the chromatin through separation of neighbouring nucleosomes. This divides the chromatin
in different environments: euchromatin and heterochromatin. The former is the actively
transcribed state of the chromatin, while the latter is the compact, non-active environment.
Heterochromatin can be further subdivided into parts which are always repressed, called
constitutive heterochromatin, and parts which can be active depending on a.o. tissue type,
called facultative heterochromatin. 9,14 Each of these environments is identified by a specific
set of PTMs (for example euchromatin is characterised by more histone acetylation, less
4
1. Literature study
Figure 1.3: Representation of the nucleosome complex. The four core histone (H2A, H2B, H3 and
H4) together form histone octamers. DNA is wrapped around these octamers in a 1.7
left-handed superhelical turn. Histone H1 binds to the linker DNA which separates
successive nucleosomes. 9
histone and DNA methylation and less histone H1 binding than heterochromatin). All of
these different environments along the DNA form distinct regions that will be transcribed or
silenced. 14 Secondly, the modifications attract other proteins which can bind or are blocked
from binding depending on the specific PTMs and their composition. The enzyme activity
of the recruited proteins will then further modify the chromatin. Functions of these proteins
are very diverse with a.o. activity in the transcription process. 33
1.1.3
RNA
A last important mechanism in epigenetics is RNA interference (RNAi). The mechanisms
used by RNA to silence genes are linked to defence systems to protect cells from pathogenic
DNA or RNA. 8 RNAi will lead to recognition and repression of messenger RNA (mRNA).
The most common small RNAs that execute this post-transcriptional gene silencing are small
interfering RNA (siRNA) and microRNA (miRNA). 2,34 siRNAs are obtained after cleavage
of double-stranded RNA (dsRNA) which is formed by complementary RNAs, while miRNAs
are small, self-complementary RNAs. Both target specific genes through a homology-based
post-transcriptional system. 8 After recognition of a specific mRNA, miRNA will inhibit the
elongation of translation, while siRNA will degrade the mRNA. However, siRNA can also
cause transcriptional silencing at a higher level. Therefore, it will bind DNA leading to
methylation of the DNA and conversion of the chromatin structure to the inactive state. 14
5
1. Literature study
1.1.4
Epigenetics and disease
The growing amount of research into the field of epigenetics has confirmed that epigenetic control plays an important role in the development of diseases. The Prader-Willi and Angelman
syndromes are two well-known examples that illustrate the relevance of epigenetics. These
diseases both occur due to abnormal imprinting of a specific locus. However, depending on
the parental origin of the locus, the outcome will be either Prader-Willi or Angelman. 35 Epigenetic deregulation was also demonstrated to be involved in neurodegenerative, neurological
(such as Alzheimer’s disease, Parkinson’s disease and Amyotrophic lateral sclerosis (ALS)),
autoimmune (for example rheumatoid arthritis, lupus erythematosus and Immunodeficiency,
Centromere instability and Facial anomalies syndrome (ICF syndrome)) diseases as well as
in cancer. 32
In cancer cells the epigenetic landscape is seriously disrupted with a.o. methylation profiles
specific to a particular tumour type. 32,36 Studying these alterations is interesting, because in
contrast to genetic changes, epigenetic changes are reversible and can thus be restored. 37 This
feature of reversibility has made epigenetic phenomena a possible target for therapeutics. For
example, demethylating agents can be used to re-establish correct expression of genes. 7 Some
epigenetic inhibitors, such as azacitidine and decitabine, are already granted by the food and
drug administration (FDA) and have shown their applicability. 38 Furthermore, the specific
epigenetic profiles found in tumour types provide an opportunity to diagnose cancer and to
predict the tumour’s response to a specific treatment. 36
Globally, a decrease in number of methylated CpG dinucleotides is found in cancer. This
results in an increased genomic instability. However, some CpG loci, mainly those associated with tumour repressor gene promoters, are hypermethylated to block the corresponding
gene’s expression. 7 These events help cancer development through chromosomal instability, reactivation of transposable elements and loss of imprinting (discussed in section 1.2).
Hypomethylation mainly occurs in repetitive sequences, coding regions and introns. 32 By
demethylation of coding regions and introns, silenced genes can become active and different
versions of mRNAs can be made. 36,39 In contrast, tumour repressor genes become hypermethylated. These genes are correlated with processes which are often deregulated in cancer
cells. For example, genes associated with the cell cycle, DNA repair mechanisms or apoptosis
are frequently inactivated through hypermethylation. 32,36 Aberrant DNA methylation often
co-occurs with specific histone modifications. Thus, not only the DNA methylation patterns
are disturbed in cancer cells, also aberrant histone modifications are found. For example,
loss of acetylation and changes in histone methylation were already described. In addition,
tumours express other miRNA than normal cells. Generally, the expression of miRNAs is
downregulated. However, some specific oncogenic miRNAs are overexpressed. 37 All of these
differences in the epigenetic landscape allow cancer cells to develop and even have a growth
advantage compared to normal cells.
6
1. Literature study
1.2
Imprinting
In diploid organisms each cell has two sets of chromosomes. Two copies of all genes are
present in cells, one inherited from the father and one from the mother. 40 For most genes,
both alleles are expressed to a similar extent, though this appears not to be the case for
some genes associated with non-Mendelian inherited features. Examples of this allele-specific
expression (ASE) are random ASE (such as X-inactivation) and imprinting. 41–44 The latter
is the parent-of-origin specific expression of genes. Imprinted genes are often associated with
growth of the embryo (such as Igf2 and Igf2r), shedding some light on the evolutionary origin
of imprinting. Evolutionary, imprinting in placental mammals is hypothesised to be the
result of a parental conflict. 45 More specifically, this hypothesis states that the interests of
the maternal and paternal genome are opposite. 46 The former aims at dividing resources over
all of her offspring (potentially of different fathers) and increasing overall fitness by decreasing
embryonic growth. The father’s genome, however, aims for instant success and the highest
fitness possible, thus expressing genes that increase growth. In line with this hypothesis,
maternally expressed genes are often growth repressing (a.o. Igf2r), while paternal ones are
growth promoting (a.o. Igf2). 40,47
Research on imprinting has shown that imprinted genes can be genetically identical as well as
different. Hence, imprinting is not a genetic phenomenon. In order to persist in a population,
genomic imprints need to be stable and stably inheritable as well as erasable. All these
features together led to the understanding of imprinting as an epigenetic phenomenon, more
specifically one mainly due to monoallelic DNA methylation (MAM). 45 However, histone
modifications play a minor role as well. 48
During gametogenesis DNA methylation patterns, including the imprinted ones, are globally
erased. When all patterns are removed, the imprints are remade in the gametes and are stably
preserved until the two sets of parental chromosomes divide again. So, maternal imprints are
established during egg formation, while the paternal ones are made during sperm production.
Subsequently, after fertilisation all DNA methylation marks, except for imprints, are removed
to acquire totipotent cells. 40,49
Imprinted genes are often clustered and controlled by cis-acting long non-coding RNAs
(lncRNA). The clusters contain an imprinting control region (ICR) (also called imprinting
control element (ICE)) which has a differentially methylated region (DMR). De novo DNMTs
methylate DMRs in the germline. 50 The ICR controls the expression of lncRNAs which regulate the expression of the corresponding genes. 51 For example, if the maternal DMR is
methylated, the lncRNA will not be expressed. This leads to activity of the maternal genes in
the cluster. Meanwhile, the paternal genes will be silenced due to expression of the lncRNA
because the DMR is unmethylated (Fig. 1.4). 40 However, more complex models of imprinting were discovered as well, of which the regulation of the H19 and Igf2 genes is an example.
7
1. Literature study
These genes are reciprocally imprinted and have shared enhancers. H19 is expressed maternally. By binding of CTCF on the ICR, a CTCF-dependent insulator is generated preventing
Igf2 from being expressed due to enhancers being unable to access it. On the other hand, Igf2
is expressed paternally due to methylation of the ICR and hence prevention of CTCF from
binding to it. As the ICR is methylated, the H19 promoter will be methylated as well. Here
H19 cannot be expressed and thus the enhancers will promote Igf2 expression (Fig. 1.4). 50,52
Figure 1.4: Imprinting mechanisms of Igf2r and Igf2. Methylation of the differentially methylated
region (DMR) in the imprinting control element (ICE) of the maternal strand leads to
repression of the lncRNA and hence transcription of Igf2r. However, expression of the
lncRNA in the paternal strand silences the paternal copy (top). The reciprocal regulation
of Igf2 and H19 is slightly more complex (below). CTCF binds to the maternal ICE not
allowing Igf2 to access the enhancer. Hence, Igf2 is not expressed and maternally, only
H19 is expressed. On the paternal strand, on the other hand, Igf2 is expressed due to a
methylated ICE and H19 promoter. Methylaton of the ICE prevents CTCF from binding.
Thus, here the enhancer interacts with the Igf2 promoters allowing Igf2 expression. 40
Deregulation of imprinting has already been associated with many diseases, such as BeckwithWiedemann syndrome and Turner’s syndrome. 49,53 As many imprinted genes are involved in
development, diseases caused by aberrant imprinting are often developmental or neurological
in nature. 40 Furthermore, imprinting has an important role in cancer. Loss of imprinting
(LOI) has been found in high frequency in various tumours. 54 LOI can either result in activation of the silent allele or repression of the active allele. The former will result in overgrowth
due to the normal cells being adapted to a dosage from one single allele. This helps tumours
in their over-proliferation, which is often seen when LOI of Igf2 occurs. Silencing of the active
allele, on the other hand, leads to loss of the gene product. LOI can also result in an increased
oncogene expression. 55 Studying LOI events can thus help cancer treatment and diagnosis.
It is hence of great importance to improve our knowledge of LOI and discover new LOI loci.
8
1. Literature study
1.3
Next-generation sequencing
DNA sequencing is the technique in which the order of nucleotides in a specific DNA sequence
is determined. 56 Around 1970, Sanger et al. created a sequencing method based on chain
termination. 57 It became the golden standard for DNA sequencing in the past couple of
decades. The major breakthrough of Sanger sequencing was the determination of the first
human genome sequence. 56 However, the increasing demand for faster and cheaper sequencing
methods led to the development of next-generation sequencing techniques (NGS). Hence,
many new methods replaced the before very popular Sanger sequencing. 58
The NGS technique currently most widely applied was created by Illumina, whereas the use of
once major competitors (e.g. Roche 454) has become marginal over time. They are all based
on the same process, but their differences lie in the specific manner in which the separate steps
are performed. Each method involves the development of a DNA library, which is amplified
and subsequently sequenced. Analysis of the images capturing the sequencing process and
further data processing finally yield the sequencing information. 59 The great advantage of
NGS techniques is their ability to process and sequence millions of reads in parallel leading to
a massive increase in throughput. 60 Furthermore, no bacterial cloning is needed and instead
of relying on electrophoresis, direct detection of the output is possible. 58 The specific steps
in the process differ in varying NGS techniques. For example, clusters of templates are
created in Illumina sequencing by bridge amplification which will then be sequenced using
sequencing by synthesis with a fluorescent detection step (called cyclic reversible termination
(CRT) sequencing). Roche 454 uses a sequencing by synthesis method as well, however here
pyrosequencing is used. The templates are created by emulsion PCR. 56,61 Illumina sequencing
will be discussed in more detail in section 1.3.1.
More advanced methods, such as Ion Torrent, have been developed as well. These have
led to third-generation sequencing, which enables the detection of single molecules in real
time leading to longer but typically more error prone sequences. 62 The best known examples
of third-generation sequencing are single-molecule real-time sequencing developed by Pacific
Biosciences and Oxford Nanopore Technologies’ nanopore sequencing. The field of genomics
will evolve even further with these techniques by providing information that current NGS
techniques are unable to give.
1.3.1
Illumina sequencing
The first step in Illumina sequencing is library preparation. DNA is fragmented into smaller
parts and adapters are subsequently ligated (Fig. 1.5). 61 Afterwards the fragments are loaded
onto a flow cell where amplification and sequencing will take place. On the surface of this
flow cell forward and reverse oligos, complementary to the adapter sequences, are attached.
These oligos will act as primers during amplification and sequencing of the DNA template. 63
9
1. Literature study
Single-stranded DNA (ssDNA), which can hybridise to the oligos on the surface, is created
by denaturation of dsDNA. The ssDNA fragments are then copied with the oligos as primers.
After another denaturation step the original fragments are removed. Thus, immobilised copies
of the initial library DNA fragments are then present on the flow cell. 61 Bridge amplification of these copies is subsequently performed: the surface-attached copies hybridise to the
complementary oligos on the surface by creating a bridge structure and can thus be amplified. After denaturation of the created dsDNA, two ssDNA fragments are formed which
are attached to the surface of the flow cell. Repeating this many time results in clusters of
surface-immobilised copies of the DNA template. 59,63
Finally, these clusters are sequenced using sequencing by synthesis, more specifically cyclic
reversible termination (CRT) sequencing. Nucleotides are incorporated into a growing DNA
strand which is complementary to the template DNA. This is achieved through extension of a
primer, which is hybridised to the template sequence, by DNA polymerase . 64 The nucleotides
are chemically modified to enable detection right after incorporation. In Illumina sequencing
reversible terminators are used. They contain a fluorescent label and a protecting group that
prevents incorporation of several nucleotides at once, both of which can be chemically cleaved
off. Each type of nucleotide has its own fluorescent label, so by detection of the colour of the
fluorophore, the type of nucleotide is recognised. Hence, after incorporation and detection
of the nucleotide, the two moieties can be chemically cleaved and the cycle of incorporation,
imaging, and deprotection can start over. 65,66 In this way, the template DNA is sequenced.
Figure 1.5: Overview of the Illumina sequencing process. Adapters are ligated to the template DNA
for library preparation. Afterwards the DNA is immobilised to the surface of the flow
cell where clusters of copied DNA are created by bridge amplification. These clusters
are finally sequenced using cyclic reversible termination (CRT). 67
10
1. Literature study
1.4
Methods to detect DNA methylation
As described before, epigenetics (including imprinting) plays an important role in development and disease. It is hence of major interest to discover epigenetic marks in the genome
to broaden our knowledge and understanding of epigenetic phenomena and their role in diseases. Therefore, many techniques to detect epigenetic profiles have been developed. The
most common DNA methylation profiling methods are either bisulfite-based or enrichmentbased. 68 The golden standard is whole genome bisulfite sequencing (WGBS). Treatment of
DNA with sodium bisulfite deaminates unmethylated cytosines to uracils. When subsequently
PCR is performed uracil will be recognised as thymine. The conversion of methylated cytosines, on the other hand, is considerably slower. Thus, under adequate conditions, the bulk
of unmethylated cytosines will be converted while the methylated ones remain unchanged.
Upon sequencing, methylated cytosines can thus be detected in a straightforward manner. 69,70
WGBS has a single-base resolution and allows quantification of the methylation levels. However, the technique is very expensive. Furthermore, the bisulfite treatment can complicate
sequencing and sequence alignment leading to biased results. 68,71
New techniques that avoid these high costs are therefore very interesting. Reduced representation bisulfite sequencing (RRBS) is one of them. The DNA is first enzymatically digested
to obtain CpG enriched regions, which are more likely to be relevant for the analysis. Upon
bisulfite sequencing of these regions, RRBS provides a similar high quality profile as WGBS,
yet for a vastly reduced portion of the genome. RRBS has the same advantages as WGBS,
but the cost is considerably decreased. 72
Illumina developed another technique based on bisulfite conversion. The Infinium HumanMethylation BeadChip combines bisulfite conversion of DNA with whole-genome amplification. In the Infinium I assay, used on the HumanMethylation27K BeadChip, the converted
and amplified DNA is hybridised against two different types of beads for each specific CpG
site: one type contains a probe that represents an unmethylated CpG site (matching thymine),
while the second type mirrors a methylated CpG site (matching cytosine). If a site is unmethylated it will match the unmethylated probe perfectly and hence single-base extension
is possible. This will be detected as a fluorescent signal. On the other probe, hybridisation
leads to a mismatch. No extension will thus be performed here and the detected signal will
be very low. For methylated CpG sites the same process occurs on the opposite beads. Here,
a strong signal will be seen from the methylated probe and only a low signal is detected from
the unmethylated probe. 73 The Infinium II assay consists of only one probe per site. Depending on the single-base extension with a labelled adenosine (complementary to thymine)
or guanine (complementary to cytosine) the methylation status is determined. The newest
chip developed by Illumina, the HumanMethylation450K BeadChip, combines the two assays
for analysis of more than 480000 CpG sites on only one chip. 74
11
1. Literature study
Alternative methods that avoid the disadvantages of bisulfite-based techniques were developed as well. These methods are based on enrichment of methylated DNA fragments followed
by sequencing (Fig. 1.6). 71 First, DNA is fragmented and methylated fragments are precipitated. This can be done with either methyl binding domain (MBD) containing proteins or
antibodies. 72,75 After elution, the enriched DNA is analysed by sequencing. 76 MethylCap-seq
or MBD-seq is the sequencing of DNA enriched by MBD containing proteins, while MeDIPseq is sequencing after precipitation with antibodies. These methods still offer a genomewide approach, but have a strongly reduced cost. On the downside, no base pair resolution
can be obtained. Furthermore, the CpG density and GC content can create a bias. 68,71,77
Enrichment-based techniques can also be used to analyse histone modifications or nucleosome
remodelling. 78,79 Due to its cost-efficiency and sensitivity, MethylCap-seq can be considered
as one of the best techniques to establish genome-wide DNA methylation profiles. 80 However,
new methods are still being developed. 81,82
Figure 1.6: Overview of enrichment-based method for DNA methylation profiling. The DNA is first
fragmented (a,b). Afterwards, the methylated regions are captured and enriched by
specific proteins (methyl binding domain (MBD) containing proteins) or antibodies (c).
The unbound regions are removed through a washing step and the enriched fragments
are obtained through elution (d). Finally, fragments are sequenced and analysed. (e,f). 71
12
1. Literature study
1.5
RNA-Seq
Parts of the genetic code in cells are transcribed into RNA. The full set of these RNA molecules
is called the transcriptome, including for example mRNA and non-coding RNAs (ncRNAs). 83
The transcriptome is often studied by detection and quantification of the different kinds of
RNA molecules to broaden our understanding of developmental and other processes. Furthermore, it can help us understand how transcriptional dysregulation can lead to diseases.
Techniques for studying RNA molecules based on hybridisation (e.g. microarrays) and PCR
have been developed, but state-of-the-art transcriptional profiling is performed by RNA sequencing (RNA-seq). 84,85 This method gives information on billions of bases in a very limited
time. 86 Any type of deep-sequencing platform can be used to analyse the transcripts. Firstly,
the total or messenger RNA fractions (intact or fragmented) are converted into cDNA libraries followed by sequencing. For ncRNAs a ribosomal RNA depletion step is usually
required, whereas polyA-tail capturing of mRNAs automatically removes the major part of
ribosomal noise. The sequencing procedure is very similar to DNA sequencing, but the library preparation and analysis are different. After sequencing, reads are aligned and mapped
(Fig. 1.7). Information on transcripts (their assembly and structure) and on levels of specific
transcripts can be obtained. 84,86 Even though the resolution achieved by RNA-Seq is already
remarkable, improvements to the methods and new variants are still being developed. 87
1.6
Analysing next-generation sequencing data
As described in section 1.3, 1.4 and 1.5 when studying (epi)genetic phenomena, raw sequencing
data have to be analysed. A very important step in this analysis is SNP and genotype calling,
but, first, several other steps need to be conducted. 88 Next-generation sequencing experiments
result in images of emitted signals, such as fluorescent images. These signals are created
during the production of complementary strands of the DNA templates (see section 1.3). The
process of analysing these images goes as follows: 1) base calling by analysis of the signals,
2) alignment and assembly of the sequences, 3) potential recalibration or filtering and 4)
genotype and SNP calling. 88,89
In the first step, emitted signals are converted into an actual nucleotide. Together with this
base call, a quality score is determined. The standard for this is Phred, which represents the
probability that the base call is wrong. 90 Phred is calculated as: 91
Qphred = −10log10 P (error)
(1.1)
The different reads are subsequently aligned to a reference genome. Occurrence of sequencing errors and genomic variations make a proper alignment very difficult. Many different
mapping algorithms have thus been developed to tackle this problem. Some commonly used
13
1. Literature study
Figure 1.7: Overview of RNA-Seq. In a first step, the RNA is converted into cDNA. These cDNA
libraries are made through reverse transcription. Afterwards, adapters are attached on
both sides. The libraries are then ready for sequencing and analysis (including mapping,
classification of the reads and production of expression profiles). 84
read mappers are BOWTIE, BWA, and SOAP. 92 In BOWTIE the reference genome is transformed using the Burrows-Wheeler algorithm to create a memory-efficient representation of
the genome. Character by character the reads are then aligned to this transformation. If
a perfect alignment cannot be found, backtracking allows mismatches in the alignment. 93
Mapping cDNA reads from RNA-seq data to a reference genome is even more challenging due
to potential splicing. RNA-seq alignment algorithms thus need to allow large gaps to include
introns. An example of such a mapper is STAR. Maximum mappable seeds are sequentially
searched for in the unmapped parts of a read. The seeds are afterwards clustered and stitched
together to create the complete alignment. 94 If no reference genome is available, the sequence
can be assembled de novo by assessing the overlaps of reads. 95
Subsequently, additional filtering can be done to exclude low-quality alignments. This step
is based on the Phred quality score. 96 Finally, the base calls, quality scores and alignments
need to be converted into genotypes. This is done in the SNP and genotype calling step. 88
14
1. Literature study
1.6.1
SNP and genotype calling
So far SNP and genotype calling were seen as an equivalent step, yet they are somewhat
different. SNP calling is the recognition of variable sites (polymorphisms) in the sequence.
Genotype calling, on the other hand, refers to attributing a genotype to a specific (typically
polymorphic) site. Both steps are however difficult due to sequencing errors occurring in
the data. Errors arise due to many different causes, such as incorrect base calling or wrong
alignments. 88,97 Therefore, the Phred scores are often included in the SNP and genotype
calling processes. Genotype calls can either be heuristic (a.o. based on cut-off rules) or they
can be determined by probabilistic models. 98
In the heuristic approaches, a filtering step is often done first. Based on the Phred score only
high-quality sites are kept. Genotypes are then determined using fixed filters. For example:
if the allele count of the variant allele is higher than a specific number, the sample is called
heterozygous for that locus instead of the reference homozygous status. Even higher counts
are required to call the sample homozygous for the variant allele. 88,99 The downside of this
approach is that read depths and population-wide allele frequencies are not (fully) taken into
account. This possibly leads to biased genotype calling (typically biased towards a consensus
variant). 98
Therefore, probabilistic model-based approaches were developed. In these methods, the probabilities of the genotypes given the data are calculated and the one with the highest posterior
probability is kept. The probabilities P (G|Xi ) (in which Xi is a set of variant counts for locus
i and G is a specific genotype) can be calculated using Bayes’ theorem in which the error
estimates are included. 88,98,100 This leads to:
P (G|Xi ) =
P (Xi |G)P (G)
P (Xi )
(1.2)
with P (G) and P (Xi ) the prior probabilities of the specific genotype and the variant counts,
respectively. P (Xi |G) is the probability of P (Xi ) given P (G) which can be easily calculated.
Division by P (Xi ) (the probability of the observed data) is necessary for normalisation to
ensure a total probability of 1. Calculation of this Bayes’ formula thus requires a prior
genotype probability. This prior can be calculated using the Hardy-Weinberg theorem, can be
found in the population if reference genomes are used or can be a constant value. 88 Typically,
the prior probability is derived from the data (empirical Bayes), yet their assessment is exactly
the goal of the procedure leading to a circular problem.
Because the purpose of this master’s thesis is the detection of (loss of) imprinting, a modelbased approach is needed. This is to ensure that the SNP calling is unbiased since the
filtering and cut-offs would disfavour samples without heterozygotes, while exactly those are
of interest.
15
1. Literature study
1.6.2
SeqEM
As described above, probabilistic methods for genotype calling are based on the highest
posterior probability which is calculated using Bayes’ formula. Several probabilistic-based
methods have been developed such as MAQ and SOAPsnp. 90,96 However, none of these algorithms use information on allele frequencies and error rates available in the data itself.
Hence, in 2010 Martin et al. developed SeqEM, a genotype-calling algorithm based on the
Expectation-Maximisation (EM) algorithm. 98 EM estimates the maximum likelihood of unknown parameters in models in an iterative way. 101 The likelihood of a read with variant
count Xi having a specific genotype (G) can be calculated as:
Ni
(1 − α)Xi αNi −Xi pV V
P (Xi , Gi = V V |Ni , θ) =
Xi
Ni
(1/2)Ni pRV
P (Xi , Gi = RV |Ni , θ) =
Xi
Ni
P (Xi , Gi = RR|Ni , θ) =
(α)Xi (1 − α)Ni −Xi (1 − pV V − pRV )
Xi
(1.3)
(1.4)
(1.5)
In these equations V is the variant allele, R the reference allele, N the read depth, X the
number of variant nucleotides, α the error rate and pV V and pRV the genotype frequencies
of VV and RV, respectively, for a specific read i. θ = {α, pV V , pRV } is a vector of unknown
parameters which require prior values. In SeqEM prior probabilities of pV V and pRV and the
error rate are estimated by the EM algorithm. 98 Two steps are conducted in EM estimation:
an expectation step and a maximisation step. In the first step, a probability distribution with
the current estimates of the parameters is constructed. The parameters are then maximised
and estimated again in the second step using this new distribution. 102 These two steps are
repeated until good estimates of the parameters are found. Hence, the prior probabilities and
error rate are maximised using information included in all the samples (S) in the dataset.
This is done through the following formula:
L(θ; X, G, N ) =
S
X
lnP (Xi , Gi |Ni , θ)
(1.6)
i=1
Through iteration, the parameters are estimated until the estimates of two successive rounds
differ by only a small amount. These estimated parameters are subsequently used in formulas
1.3-1.5. The genotype with the highest a posteriori probability is considered the correct
genotype call.
SeqEM does not use the information on error rates available in the quality scores, but obtains
it by looking at the whole population. This is an interesting approach, because here the
genotype calls do not depend on the accuracy of Phred. However, by not using the quality
score, lots of information on the quality of base calls is lost. 88
16
1. Literature study
The genotype frequencies in equations 1.3-1.5 can also be calculated following the HardyWeinberg equilibrium. However, as deviations from this equilibrium are searched for in the
MAM-pipeline, this option will not be used in this thesis.
1.7
The MAM-pipeline
The important role monoallelic DNA methylation was attributed in cancer makes it very
interesting to search for loci featuring MAM. This is however not an easy aim. Due to the
cost of WGBS, the technique can often not be used. 71 On the other hand, no information on
monoallelic methylation can be directly obtained from enrichment-based techniques. Indeed,
only methylated regions are precipitated and sequenced. Hence, only knowledge about those
alleles is acquired, while nothing is known about the non-methylated alleles. Thus, no MAM
can be concluded. However, in 2014 a solution to this problem was found in the Biobix
lab. Steyaert et al. created a data-analytical framework that screens for regions featured by
MAM, using enrichment-based sequencing data. 103 It is important to note that RNA-seq can
be used as input in the pipeline as well. Furthermore, other epigenetic modifications, such as
monoallelic histone modifications, can also be detected with the developed methodology. For
simplicity, however, only MAM will be discussed here.
1.7.1
Rationale
The rationale behind the methodology is based on a population genetics theorem, the HardyWeinberg principle. This states that, under certain conditions, the allele and genotype frequencies will remain constant from generation to generation in a randomly mating population.
Furthermore, it states that the allele and genotype frequencies are linked by a very simple
set of formulas. If two alleles A and a are present with allele frequencies p and q, respectively, the genotype frequencies of AA, Aa and aa will be p2 , 2pq and q2 , respectively. 104
This equilibrium holds if the population is panmictic with no effects of migration, inbreeding
or genetic drift. 105 In contrast to biallelic methylation, enrichment data for monoallelically
methylated loci will exhibit apparent deviation from Hardy-Weinberg equilibrium. When a
locus exhibits MAM, in theory only single alleles will be picked up and sequenced, implying
that all samples will be observed as homozygous. Hence, testing for MAM translates into
screening for regions that show a decrease in the observed fraction of heterozygotes. 103
1.7.2
The pipeline
The original pipeline can be seen in figure 1.8. First, the MethylCap-seq data are mapped and
screened for SNPs. These SNPs are filtered to reduce computational load. Then, two rounds
of error correction are performed using a model-based approach (see section 1.6.1). In samples
with three alleles, the allele with the lowest coverage was deleted as it was probably an error.
17
1. Literature study
In the first round, the unfiltered data, which includes sequencing errors, are corrected. When
a sequencing error is detected, it is removed from the sequence. The output of this first round
is then used in a second round. Next, two iteration rounds, 1000 and 1 000 000 iterations
respectively, using the data-analytical framework are carried out. The observed heterozygous
fraction and expected Hardy-Weinberg fraction are calculated for each SNP. Random data
are used to generate null distributions which follow the Hardy-Weinberg equilibrium. The
coverages and allele frequencies of this random data are the same as in the original data.
Comparing the fraction of heterozygotes in the original data with the null distributions gives
p-values for each SNP position. Significant MAM is called when that p-value is smaller than
the one corresponding to a false discovery rate of 0.1.
Figure 1.8: Overview of the MAM-pipeline. First, the MethylCap-Seq data are mapped, screened
for SNPs and filtered. Two rounds of sequencing error corrections are then performed
based on a Bayesian approach. The data-analytical framework then follows twice with
1000 and 1 000 000 iterations. The observed fractions heterozygotes are compared to null
distributions made in accordance to the Hardy-Weinberg expected fractions. Monoallelic
methylation is concluded if the observed heterozygous fractions deviate significantly from
the expected ones. Lastly, the found loci are annotated and validated. 103
18
1. Literature study
1.7.3
Disadvantages
Although the developed framework has many advantages, it also has some downsides. The
biggest disadvantage is the computational load. Filtering of the data was carried out to
reduce it partly. However, through filtering some data will not be analysed, hence making it
more challenging to detect MAM. So, a trade-off between the sensitivity and computational
load had to be searched for. As more and more samples have to be analysed in the future,
the computational intensity and time will increase even further. So improving for example
the SNP and genotype calling step could potentially decrease the load and enable analysis
of more loci. An error correction step was also added in order to reduce the computational
intensity. However, the approach used here was too conservative as it disfavoured detection
of MAM. An improvement of the correction steps is thus certainly beneficial. Lastly, because
loci featured by MAM are picked up less efficiently, the method is too conservative. However,
this does not mean that the found loci are not correct. All the same, even as the methodology
is too conservative, it was able to accurately detect imprinted loci. Identification of known
imprinted loci, such as Igf2/H19, as well as new loci that could be independently validated,
proved the accuracy of the method.
1.8
TCGA: breast cancer
The cancer genome atlas (TCGA) is the largest available cancer genetics database. 106 It is a
joint effort of the National Cancer Institute (NCI) and the National Human Genome Research
Institute (NHGRI). The goal of TCGA is to improve our knowlegde of (epi)genetic regulation
in cancer and how changes therein influence cancer development. 107 The project started in
2006 on a much smaller scale. Only a few cancer types were studied then, namely glioblastoma
multiforme, serous cystadenocarcinoma of the ovary and lung squamous carcinoma. However,
that list is greatly extended now, including cancers such as breast cancer, thyroid cancer,
pancreatic cancer and many more. 108,109 Different sequencing centers provide TCGA with
enormous amounts of next-generation sequencing and microarray (SNP and Infinium arrays)
data of many different types (SNP, mRNA...). 110,111 These data are publicly available on the
TCGA data portal except for raw and SNP data due to ethical reasons. Access to these data
can however be granted for research purposes only upon adequate motivation. 111,112
In this master’s thesis data from TCGA will be used, more specifically the breast cancer
data. Breast cancer is the most common cancer and the second cause of death due to cancer
in women. 113 In men breast cancer can occur as well, although it is rare. The two types of
breast cancer studied here are ductal carcinoma and lobular carcinoma. The former starts in
the milk ducts of the breast, while the latter originates in the lobules. 114 Breast cancer was
chosen in this study for varying reasons. First of all, the TCGA data portal contains over
a thousand samples, including tumour as well as control samples, from breast cancer only.
19
1. Literature study
The developed methodology to detect loss of imprinting is based on comparison of control
and tumour samples. Hence, the availability of both of them was a very important aspect.
Furthermore, breast cancer has the most samples of all cancers. Secondly, previous research
was checked to find cancers wherein LOI loci were already found. Loss of imprinting of PEG1
in breast cancer was discovered using reverse transciptase PCR by Pedersen et al. 115 Shetty et
al. found that often Igf2 is biallelically expressed in breast cancer due to loss of imprinting. 116
Thus, the known loci can be used as a validation of the new method. All these aspects of
the breast cancer data and the fact that it is a very common and deadly disease, led to the
choice of using breast cancer in this study.
20
Chapter 2
Materials and Methods
2.1
Hardware
®
A laptop with a 1.8 GHz Intel CoreTM i3-3217U processor, 8 GB DDR3 RAM and Windows
8.1 (64-bit) as operating system was used to create the scripts. Required programs (SeqEM
and R) were installed locally on this laptop. As the amount of TCGA data was very large, it
was saved on the athos server of the BioBix lab. The analysis of the TCGA data was performed
on this server as well. Athos is a Linux server with 128 GB RAM and 16 processors.
2.2
2.2.1
Software
R
R is a software environment and programming language developed by Ihaka and Gentleman. 117 This open source environment offers many different statistical and graphical techniques and is hence very practical in data analysis and statistical software development. Furthermore, supplementary user-created packages can be accessed through the Comprehensive
R Archive Network (CRAN). This gives users the opportunity to extend the R environment
and to add specific functions. In addition to the fact that R is a simple programming language, one of the main advantages is its ability to create high-quality graphs. These features
make R one of the most widely used environments for statistical computing. In this thesis the
different methodologies for detection of (differential) imprinting were created using R v3.1.1.
2.2.2
SeqEM
SeqEM is a genotype-calling algorithm based on the Expectation-Maximisation algorithm in
which the allele frequencies and error rate are calculated in an iterative way. For more detailed
information see section 1.6.2. In this thesis SeqEM v1.0 was implemented for estimation of
the necessary parameters.
21
2. Materials and Methods
2.3
Statistical methods relevant to this master’s thesis
In this thesis several statistical methods will be used to detect (differential) imprinting. The
most important ones are the Kolmogorov-Smirnov test (KS test), the Wilcoxon Rank Sum
test (WRS test) and the likelihood ratio test (LRT). How these tests are implemented will be
discussed later. However, here the general idea behind the tests is explained.
2.3.1
The Kolmogorov-Smirnov test
The two-sample Kolmogorov-Smirnov test is a non-parametric test that compares the cumulative distribution functions of two independent samples. 118,119 By looking at the largest
difference in cumulative distributions, the test determines if the two distributions are equal or
not. 120 The null hypothesis states that the two sample distributions are equal. This hypothesis is rejected when the largest difference between the two cumulative distribution functions is
greater than a critical value, based on the Kolmogorov distribution of this difference under the
null hypothesis. 121 The Kolmogorov-Smirnov test is sensitive to varying differences between
distributions, such as differences in location and shape. 119
2.3.2
The Wilcoxon Rank Sum test
Another non-parametric test to assess differences in two samples is the Wilcoxon Rank Sum
test (also called the Mann-Whitney U test). More specifically, assuming similarly shaped
distributions, it tests if the median of one of the independent populations is larger than the
other one. 122 This is done by calculating the ranks of all sample values and then comparing the
median ranks. 123 Under the null hypothesis, the two samples are identical. This means that
the ranks will be randomly distributed over the two samples and the population distributions
will be equal. However, the null hypothesis will be rejected if those ranks are not randomly
distributed, but one distribution tends to have larger values. Here, the two samples do not
have the same distributions. 124 Because of this alternative assessment of differences between
two samples, the Wilcoxon Rank Sum test was implemented as well.
2.3.3
The likelihood principle
2.3.3.1
The likelihood function
The likelihood function was first described by Fisher around 1920. 125 Nowadays likelihood
and probability are often used as synonyms, however in statistics they are distinct features.
Probability is the chance of a specific event to occur. To understand the principle of likelihood, on the other hand, the likelihood function has to be defined. Consider a sample
X = (x1 , x2 , . . . , xn ) depending upon a set of parameters θ1 , θ2 , . . . , θm in the parameter space
Θ with f (x|θ) the probability function. The likelihood function (often simply likelihood) of
22
2. Materials and Methods
θ given the outcome X = x is then defined as: 126
L(θ|x) = f (x|θ)
(2.1)
Further distinction has to be made between X being a discrete or continuous vector. In the
former case X is described by a discrete probability function (a probability mass function)
and the likelihood becomes:
L(θ|x) = Pθ (X = x)
(2.2)
If, however, X is a continuous vector, it is described by a probability density function for
which P(X = x) = 0. The likelihood function of θ is however not equal to zero but:
L(θ|x) = fθ (x)
(2.3)
Hence, likelihood can be seen as the likeliness to observe a specific sample as a function of
the different parameter values. 127
2.3.3.2
Maximum likelihood estimation
In parameter estimation, an often used technique is maximum likelihood estimation (MLE). It
determines parameter estimates out of a range of parameter values that provide the maximum
likelihood function given a data set. 128,129 Lets take another look at the set of variables
X = (x1 , x2 , . . . , xn ). Suppose they are all independent and identically distributed. The joint
distribution is described as: 130
f (x1 , x2 , . . . , xn | θ) = f (x1 |θ) × f (x2 |θ) × · · · × f (xn |θ)
(2.4)
The likelihood then becomes:
L(θ|X) = `(θ|x1 ) × `(θ|x2 ) × · · · × `(θ|xn )
(2.5)
To estimate θ̂ out of the set of possible parameters Θ, the maximum likelihood can be determined. This is done by retaining the θ which maximises the likelihood function described
above:
n
Y
L(θ̂|X) = supθ∈Θ L(θ|X) = supθ∈Θ
`(θ|xi )
(2.6)
i=1
However, often it is easier to use the logarithmic likelihood function. Rewriting function 2.6
using the equivalent log-likelihood function gives:
lnL(θ̂|X) = supθ∈Θ lnL(θ|X) = supθ∈Θ
n
X
ln `(θ|xi )
(2.7)
i=1
This logarithmic likelihood function is typically preferred in MLE, when calculations are easier
for summations than for products. Furthermore, possibly small likelihoods will result in very
small products approximating zero. This problem is solved if the sum of their logarithmic
values is determined.
23
2. Materials and Methods
2.3.3.3
The likelihood ratio test
Now consider the same sample as before. Suppose θ̂0 is the MLE of θ in parameter space
Θ0 (with Θ0 ⊂ Θ) and θ̂1 is the MLE of θ in parameter space Θ. 131 The null and alternative
hypotheses for obtaining the best model are:
H0 : θ ∈ Θ0
(2.8)
H1 : θ ∈ Θ
Testing H0 against H1 can be done using the likelihood ratio test (LRT): 132
Λ(X) =
L(θ̂0 |X)
=
f (X|θ̂0 )
L(θ̂1 |X)
f (X|θ̂1 )
supΘ0 L(θ0 |X)
=
supΘ L(θ|X)
(2.9)
It is obvious that 0 ≤ Λ(X) ≤ 1. H0 is rejected when Λ(X) < c with c a specific constant
(0 < c < 1). If the two models are nested, the test statistic −2lnΛ(X) can be used. 129,131
Nested models are models in which the null hypothesis is a special case of the alternative
hypothesis. −2lnΛ(X) is then asymptotically χ2 distributed with q degrees of freedom. q is
the difference between the number of free parameters under H1 and H0 . The null hypothesis
will then be rejected if −2lnΛ(X) > c with c = χ2q,1−α .
2.4
Detection of (differential) imprinting
In this master’s thesis a method to detect imprinting and loss of imprinting in next-generation
sequencing data was developed. First, random sequencing data were generated to test different
methodologies. These simulated data were used to check the correctness and effectiveness of
different methods. Upon technical optimisation, these methods were used for studying real
data obtained from TCGA. The necessary scripts of the described methodologies can be found
in attachment. However, as the idea behind the methodologies is more important, the actual
scripts will not be further discussed.
2.4.1
Generation of random data and estimation of the allele frequency
and error rate
The first step in this thesis was the generation of random SNP data. For simplicity only two
alleles, namely A and T, were considered here. A random allele frequency between 0 and 1
was generated for allele A (PA ) using a uniform distribution, though often a predetermined
allele frequency was used as well (indicated in the relevant sections). The frequency of T
(PT ) was calculated as 1-PA . These two frequencies were subsequently used to create the
random SNP data. First, for each sample the genotype of the SNP locus was determined by
sampling two alleles out of a pool of alleles A and T with chances of taking a specific allele
24
2. Materials and Methods
equal to their frequency. In addition, for some tests, a random coverage was chosen for all
samples. This value was obtained by again using a uniform distribution with a predetermined
minimum and maximum value. By sampling a number of alleles equal to this coverage out
of the alleles in the genotype (with a probability of taking a specific allele equal to 0.5) the
SNP sequences were generated for all samples.
After generation of the SNP random data, imprinting was included in a subset of the samples. Some sample sequences were generated as described above, representing the biallelically
expressed samples. In a subset of the samples, on the other hand, imprinting was included.
Imprinting means that one allele is not expressed, called monoallelic expression. Thus, for
heterozygotes the coverage of one random allele was deleted. If, on the other hand, the sample
was homozygous, half of the coverage was randomly removed. Note that as samples are often
heterogeneous, less than 100% imprinting had to be possible as well. For example, suppose a
sample is 50% imprinted. In homozygous samples only 25% instead of 50% of the coverage
then was deleted. For heterozygotes half of the coverage of one random allele was removed.
As current DNA sequencing techniques are not perfect and still error-prone, sequencing errors
were included in the generated data. This was done to approach reality in the best possible
way. For each random sequence, a number of alleles determined by a random number sampled
from the binomial distribution with a chance on success (i.e. the chance on having an error)
equal to a predetermined error rate and a number of trials equal to the sequencing length
were replaced by the other allele (A or T) in the random SNP data. As this procedure makes
use of only two possible alleles, this does still not reflect a realistic situation: when sequencing
errors are made, three other alleles instead of one can replace the correct one. Where relevant
in the manuscript, an allele was replaced by one randomly chosen allele out of the other three
to account for the possible occurrence of multiple alleles in case of a sequencing error.
Afterwards this random dataset was used in SeqEM. To enable prediction of the genotypes of
a SNP locus, the allele frequencies and error rate are estimated. The obtained values of these
parameters were then compared to the real values used to create the random SNP data.
However, SeqEM can only use sequences with two alleles (a reference allele and a variant allele), hence a prior correction of the sequences was necessary for those parts of the manuscript
where four alleles are considered. In the simulation studies, this correction was based on the
highest population allele frequency. (The filtering procedure used for real data is described in
section 2.4.4.2) If only one or two alleles were present in the sequencing data of a sample, no
prior correction was done. If, however, there were more than two alleles in the sequence, the
coverage of one or more alleles was deleted from the sample: When the number of the second
highest allele was not equal to the third highest, all third (and fourth) alleles were removed.
If the frequencies of the second and third (and in some cases the fourth) alleles were the same,
the allele with the highest population allele frequency, i.e. over all samples, was retained.
25
2. Materials and Methods
2.4.2
Kolmogorov-Smirnov and Wilcoxon Rank Sum test on imprinted
SNP data
Differential imprinting can be tested for by comparison of the obtained sequencing data of two
subsets of samples (e.g. tumour cases and controls) in two different ways. A first approach was
to first perform a binomial test on each sample. This test calculates the chance that the specific
sequence - sequencing data of the locus for one sample - was made with an allele frequency
equal to 0.5. For heterozygous samples the obtained p-value will be (approximately) 1 and
for the homozygous ones the values will be close to 0. The binomial test can be performed
for both subsets of samples, resulting in two sets of p-values for each SNP locus. These two
sets are subsequently compared with a Kolmogorov-Smirnov (KS) test or a Wilcoxon Rank
Sum (WRS) test. If a subset is imprinted all p-values can be expected to be around 0 as the
samples become apparently homozygous. If loss of imprinting occurs in the other subset, the
p-values will be close to 1 due to biallelic expression. Homozygotes are, however, also possible
which will still result in p-values of approximately 0. Hence, the distributions of both sets of
p-values will be different which can be tested by the KS and WRS test, and loss of imprinting
can be concluded if the null hypothesis of equal distributions is rejected. On the other hand,
if both subsets are imprinted (and assuming that there are no underlying genetic differences
between both subsets), the distributions of the two sets of p-values will be similar and the
KS and WRS test will not detect any significant differences between the two subsets.
Secondly, the ratio of the number of lowest on highest alleles in a sequence can be calculated,
instead of calculating binomial p-values. Homozygous samples will lead to a ratio of approximately 0, while the ratios for heterozygous samples will be around 1. As before, the ratios
obtained from the two subsets can be compared with a KS or WRS test.
2.4.3
The likelihood ratio test to detect (differential) imprinting
A methodology, based on MLE and LRT, to detect (differential) imprinting was developed.
In order to use likelihood ratio tests, the probability mass function (PMF) describing the
probability of observing specific coverages for each allele for the locus under study has to be
established. These probabilities depend on the underlying genotypes, making it straightforward to establish the PMF as a mixture of genotype dependent probability mass functions.
For each specific genotype, data can be modelled using the probability mass function of the
multinomial distribution, with probabilities for each allele depending on genotype, sequencing
error rate and degree of imprinting. The PMF was developed based on the Hardy-Weinberg
theorem, which provides the expected weights for each subset of samples (homozygous and
heterozygous subsets) in the mixture. For simplicity, and as SeqEM can only handle two
alleles per locus, here we only consider two alleles, denoted as A and T, yet the PMF can
be easily extended towards four alleles by considering a mixture of multinomial distributions
instead of binomial distributions. The PMF has been constructed as follows:
26
2. Materials and Methods
P M F (x) = PA2 B(x; pA = 1 − SE, pT = SE) + PT2 B(x; pA = SE, pT = 1 − SE) +
0.5 − i/2
0.5
0.5
0.5 − i/2
(1
−
SE)
+
SE,
p
=
(1
−
SE)
+
SE) + (2.10)
T
1 − i/2
1 − i/2
1 − i/2
1 − i/2
0.5
0.5 − i/2
0.5 − i/2
0.5
PA PT B(x; pA =
(1
−
SE)
+
SE,
p
=
(1 − SE) +
SE)
T
1 − i/2
1 − i/2
1 − i/2
1 − i/2
PA PT B(x; pA =
The rationale for each individual component of this PMF is given below.
2.4.3.1
PMF calculation: population allele frequency, binomial test and sequencing error
rate
In this formula, x represents the coverages for alleles A and T, i.e. x = (nA , nT ). PA and
PT are the population allele frequencies for a specific locus (over all samples), which can
be obtained from SeqEM. B(x; pA , pT ) represents the binomial probability for x given the
probabilities for each allele, pA and pT (note that these are not the same as the population
allele frequencies PA and PT ), which depend on genotype, sequencing error rate (SE) and
degree of imprinting (i). Note that this is a slightly different representation than typically
used for a binomial distribution, though it is straightforward to see that, here, the chance of
”success” equals pA (indeed, pA + pT = 1) and the total number of ”trials” equals nA + nT .
From a practical point of view, the binomial coefficient is the same for each binomial distribution in the mixture and here equal to:
b=
(nA + nT )!
nA ! nT !
(2.11)
Hence, this value will only be calculated once for each sample.
For homozygotes, potential imprinting cannot be observed in the allele coverages and can
therefore not be taken into account. Thus, the binomial probability will depend on SE only,
which can be obtained from SeqEM. As it is assumed that this error rate was equal for all
loci but may be ill-estimated when imprinting is present, the median SE over all loci is used.
For the homozygote AA, for example, the chance of observing allele A is equal to 1-SE. The
probability of T (pT ) on the other hand is equal to SE because this allele can only be obtained
due to a sequencing error. These probabilities are then used in the following formula:
P (nA , nT ) = b pnAA pnTT
(2.12)
Thus for homozygous samples AA this becomes:
P (nA , nT ) = b (1 − SE)nA SE nT
(2.13)
In the PMF this value will be multiplied by the chance of being homozygous AA, i.e. the
square of the respective population allele frequency (in the example PA2 (assuming HardyWeinberg equilibrium)).
27
2. Materials and Methods
For heterozygotes (here AT), potential imprinting has to be considered. Therefore, an imprinting factor, i, is included in the formula. This value can vary from 0 (no imprinting)
to 1 (fully imprinted sample) and can be estimated using maximum likelihood estimation
(see below). Without imprinting, pA and pT can be theoretically estimated as 0.5, as both
alleles are expressed to a same extent. Yet, when imprinting is present, the probability of
observing the imprinted allele diminishes with a factor i/2 . As the probabilities for both
alleles need to sum to one, both probabilities are normalised by division by 1 − i/2 (i.e.
0.5 + 0.5 − i/2 ). Finally, also sequencing errors rates need to be taken into account, implying
that a fraction SE of the normalised probability for one allele will be observed as the other allele and vice versa. If allele A is imprinted, the probability of taking allele A (pA ) then equals
((0.5 − i/2)/(1 − i/2))(1 − SE) + (0.5/(1 − i/2))SE, while for probability pT this becomes
(0.5/(1 − i/2))(1 − SE) + ((0.5 − i/2)/(1 − i/2))SE, yielding the following probability:
P (nA , nT |A imprinted) =
nA nT
(0.5 − i/2 )
0.5
0.5
(0.5 − i/2 )
b
(1 − SE) +
SE
(1 − SE) +
SE
(1 − i/2 )
(1 − i/2 )
(1 − i/2 )
(1 − i/2 )
(2.14)
However, for heterozygous samples the other imprinted allele has to be considered as well,
yielding:
P (nA , nT |T imprinted) =
nA nT
0.5
(0.5 − i/2 )
(0.5 − i/2 )
0.5
b
(1
−
SE)
+
SE
(1
−
SE)
+
SE
(1 − i/2 )
(1 − i/2 )
(1 − i/2 )
(1 − i/2 )
(2.15)
As for homozygotes, the binomial probability for the heterozygous fraction has to be multiplied by the genotype frequency, 2PA PB . As, based on the underlying biology, both alleles
can be assumed to have an equal chance of imprinting (50%), this leads to the mixture PMF
in Formula 2.10.
When the more complicated case of four alleles is considered, the multinomial distribution
has to be included in the PMF:
2
P M F (x) = PA
M (x; pA = 1 − SE, pT = SE/3, pC = SE/3, pG = SE/3)+
PT2 M (x; pA = SE/3, pT = 1 − SE, pC = SE/3, pG = SE/3)+
2
PC
M (x; pA = SE/3, pT = SE/3, pC = 1 − SE, pG = SE/3)+
2
PG
M (x; pA = SE/3, pT = SE/3, pC = SE/3, pG = 1 − SE)+
PA PT M (x; pA = c1 , pT = c2 , pC = SE/3, pG = SE/3) + PA PT M (x; pA = c2 , pT = c1 , pC = SE/3, pG = SE/3)+
PA PC M (x; pA = c1 , pT = SE/3, pC = c2 , pG = SE/3) + PA PC M (x; pA = c2 , pT = SE/3, pC = c1 , pG = SE/3)+
PA PG M (x; pA = c1 , pT = SE/3, pC = SE/3, pG = c2 ) + PA PG M (x; pA = c2 , pT = SE/3, pC = SE/3, pG = c1 )+
PT PC M (x; pA = SE/3, pT = c1 , pC = c2 , pG = SE/3) + PT PC M (x; pA = SE/3, pT = c2 , pC = c1 , pG = SE/3)+
PT PG M (x; pA = SE/3, pT = c1 , pC = SE/3, pG = c2 ) + PT PG M (x; pA = SE/3, pT = c2 , pC = SE/3, pG = c1 )+
PC PG M (x; pA = SE/3, pT = SE/3, pC = c1 , pG = c2 ) + PC PG M (x; pA = SE/3, pT = SE/3, pC = c2 , pG = c1 )
With
c1 =
0.5 SE
(0.5 − i/2 )
(1 − SE) +
/3
(1 − i/2 )
1 − i/2
c2 =
0.5
(0.5 − i/2 ) SE
(1 − SE) +
/3
(1 − i/2 )
(1 − i/2 )
(2.16)
28
2. Materials and Methods
The binomial distribution now becomes a multinomial distribution, with the multinomial
coefficient calculated as:
(nA + nT + nC + nG )!
m=
(2.17)
nA ! nT ! nC ! nG !
After calculation of the multinomial coefficient, the multinomial probability is calculated as:
P (nA , nT , nC , nG ) = m pnAA pnTT pnCC pnCC
(2.18)
For example, for the AA homozygote this becomes:
P (nA , nT , nC , nG ) = m (1 − SE)nA (SE/3 )nT (SE/3 )nC (SE/3 )nG
(2.19)
The binomial distributions for the heterozygotes are (example for genotype AT, with allele A
imprinted):
P (nA , nT , nC , nG |A imprinted) =
n n
0.5
0.5 SE A
(0.5 − i/2 ) SE T SE nC SE nG
(0.5 − i/2 )
(1
−
SE)
+
(1
−
SE)
+
m
(1 − i/2 )
(1 − i/2 ) 3
(1 − i/2 )
(1 − i/2 ) 3
3
3
(2.20)
However, for heterozygous samples the other imprinted allele has to be considered as well,
yielding:
P (nA , nT , nC , nG |T imprinted) =
n n
0.5
(0.5 − i/2 ) SE A (0.5 − i/2 )
0.5 SE T SE nC SE nG
m
(1
−
SE)
+
(1
−
SE)
+
(1 − i/2 )
(1 − i/2 ) 3
(1 − i/2 )
(1 − i/2 ) 3
3
3
(2.21)
Likewise as in the situation of two alleles, these values will be multiplied by their corresponding genotype frequency (assuming Hardy-Weinberg equilibrium).
2.4.3.2
Imprinting factor
To determine the amount of imprinting (varying from not to fully imprinted) maximum
likelihood estimation is used:
n
Y
i = supi
P M F (xa )
(2.22)
a=1
This is done in R via a line search. In summary, for each locus i is varied from 0 to 1 (in
steps of 0.01) and the i corresponding to the highest likelihood is retained. The likelihood
is calculated as the product of the PMF derived probabilities for every sample per locus.
However, because of the potentially small probabilities the sum of the logarithmic values is
preferred here:
n
X
i = supi
log(P M F (xa ))
(2.23)
a=1
Hence, for every locus an imprinting value is obtained. Depending upon the search for imprinting or differential imprinting, control and tumour samples are taken together or separate
to estimate i.
29
2. Materials and Methods
2.4.3.3
Likelihood ratio test
In order to test for (differential) imprinting a likelihood ratio test is performed. Two different
tests are needed to check for either (a) presence of imprinting over all samples or (b) differential
imprinting between two subsets of samples.
a) For detection of imprinting the null and alternative hypotheses are:
H0 : the locus is not imprinted
H1 : the locus is imprinted
This can also be written as:
H0 : i = 0
H1 : i > 0
In a first step the PMF for a locus is calculated with i equal to 0. Next, i is estimated for
the same locus as explained above and the corresponding PMF is used in the LRT:
L(H0 |X)
f (X|H0 )
=
f (X|H1 )
L(H1 |X)
P M F (x1 , i = 0)P M F (x2 , i = 0) · · · P M F (xn , i = 0)
=
P M F (x1 , i = î)P M F (x2 , i = î) · · · P M F (xn , i = î)
Qn
P M F (xa , i = 0)
= Qa=1
n
a=1 P M F (xa , i = î)
Λ(X) =
(2.24)
As can be seen in the formula, the null hypothesis is a special case of the alternative
hypothesis and thus this is a nested model. Hence, the test statistic for nested models,
−2ln(Λ), can be used:
Qn
P M F (xa , i = 0)
Λ(X) = −2ln Qa=1
n
a=1 P M F (xa , i = î)
(2.25)
This test statistic is χ2 distributed and H0 will be rejected if the value is greater than
χ2α . The corresponding degrees of freedom are determined by the difference in number of
estimated parameters in H1 and H0 . However, as we are testing at the border (at i equal to
0) in a constrained parameter space a mixture of two null distributions is necessary. Here,
it has been demonstrated that, under the null-hypothesis, the test statistic is distributed
as a mixture of 50% χ20 and 50% χ21 . 133
b) For testing differential imprinting between two groups, the null and alternative hypotheses
are:
H0 : i1 = i2
H1 : i1 6= i2
Thus, first a value for i calculated for the samples of both groups taken together is estimated and used to calculate the PMF. Then two i's are estimated separately for the same
locus, one for the samples of group 1 (= i1 ) and one for the samples of group 2 (= i2 ).
30
2. Materials and Methods
The two estimated i's are used to calculate the respective PMF. Finally, the likelihood
ratio test is performed:
Λ(X) =
f (X|H0 )
L(H0 |X)
=
f (X|H1 )
L(H1 |X)
P M F (x1 , i = î)P M F (x2 , i = î) · · · P M F (xn1 +n2 , i = î)
P M F (x1 , i = î1 )P M F (x2 , i = î1 ) · · · P M F (xn1 , i = î1 ) · · · P M F (xn2 , i = î2 )
Qn1 +n2
P M F (xa , i = î)
a=1
= Qn1
Qn2
a=1 P M F (xa , i = î2 )
a=1 P M F (xa , i = î1 )
(2.26)
=
This is again a nested model. Hence, the test statistic for nested models, −2ln(Λ), can be
used:
Qn1 +n2
P M F (xa , i = î)
a=1
Λ(X) = −2ln Qn1
(2.27)
Qn2
a=1 P M F (xa , i = î2 )
a=1 P M F (xa , i = î1 )
As in the test for detection of imprinting, p-values were calculated based on a 50%/50%
mixture of χ2 distributions with 0 degrees of freedom and 1 degree of freedom. However,
as will be seen below, this does not appear to be a suitable approach.
For both likelihood ratio tests the corresponding p-values are calculated. The null hypothesis
is rejected if this p-value is smaller than 0.05 for a single loci, though in most cases additional
false discovery rate estimation will be required.
2.4.3.4
Assumptions
It is important to note that several assumptions were made for these models:
i The assumption is made that the data originate from a population in Hardy-Weinberg
equilibrium and is hence derived from a panmictic population. For samples originating
from a non-panmictic population, with for example inbreeding, the proposed model is not
correct and conclusions should be made with caution.
ii In the model to detect imprinting the degree of imprinting is assumed to be equal in all
samples. So, variations in the amount of imprinting between the samples are not possible.
This translates to the assumption that the cell type composition of each tissue is similar.
iii The same assumptions are made in the model for detection of differential imprinting.
Again, the degree of imprinting has to be equal in all samples per subset, with as particular
consequence that the degree of loss of imprinting should be identical in each tumor sample.
As tumors are known to be heterogeneous, it should be evaluated whether this assumption
holds..
31
2. Materials and Methods
2.4.4
TCGA data
2.4.4.1
dbSNP
113 control samples were downloaded from the TCGA data portal. These data had already
been mapped and pre-processed. In all samples, variants were called in the non-duplicate,
uniquely mapped reads with Samtools mpileup/bcftools (v0.1.19) by ir. Sandra Steyaert. As
a first quality control, only SNP positions that had a read depth higher than 10 in at least
one sample and that were present in dbSNP were kept. Afterwards, the SNP positions for all
samples were merged and the corresponding nucleotide sequences were determined.
2.4.4.2
Prior filtering of the data
To account for sequences that SeqEM cannot handle (sequences with more than two alleles)
a prior filtering step was implemented. As dbSNP was used for SNP calling, the two alleles
considered in the analysis, the standard alleles, were chosen as the dbSNP alleles. When,
however, three dbSNP alleles were present, the standard alleles were chosen as the two alleles
from the dbSNP alleles with the highest mean allele frequencies over all samples.
Subsequently, samples were filtered to retain only those samples featuring one (homozygous)
or both (heterozygous) of the standard alleles. Sequences with only two alleles equal to the
standard alleles were retained. When, however, the most frequent allele in a sample was no
standard allele, the sample was filtered out. This was to ensure that only high quality samples
were analysed. Lastly, if the allele with the second highest frequency in the sample was not
a standard allele, an empirical Bayes approach was applied to identify and filter out putative
heterozygous samples, i.e. those samples that are featured by one standard allele but also a
non-standard allele. This procedure is outlined in the following paragraphs.
The posterior probability of obtaining a specific sample given it being heterozygous was
calculated using a multinomial distribution as:
P (data|heterozygous) = M (data; p1 , p2 , p3 , p4 )
(2.28)
with both p1 and p2 equal to the sum of the two highest sample allele frequencies (in percent)
divided by two. Allele three and four could only be obtained due to sequencing errors, p3 and
p4 were hence calculated as the sum of the two lowest sample allele frequencies (in percent)
divided by two. The probability that the sample was a heterozygote was calculated as:
P (heterozygote) = 2pq
(2.29)
with p the highest mean allele frequency over all samples of the allele which was one of the
standard alleles and q the mean allele frequency of the other allele.
32
2. Materials and Methods
Next, the posterior probability of the sequence given homozygosity was determined based on
a multinomial distribution as well:
P (data|homozygous) = M (data; p1 , p2 , p3 , p4 )
(2.30)
Here, however, p1 was the highest sample allele frequency. Because all other alleles were due
to sequencing errors, p2 , p3 and p4 were calculated as the sum of the three lowest frequencies
divided by three. The probability of the sample being homozygous was determined by:
P (homozygote) = p2
(2.31)
with p the same as before.
Finally, in the case when the second most frequent allele in the sample was a non-standard
allele, putative heterozygous samples were identified and filtered out. This identification was
based on the criterion:
P (heterozygote|data) ≥ P (homozygote|data)
(2.32)
Using Bayes theorem and knowing that P (data) is equal for both chances, this translates to
the calculation of:
P (data|heterozygous) × P (heterozygote) ≥ P (data|homozygous) × P (homozygote) (2.33)
Furthermore, mutations from the Human Gene Mutation Database or loci with only one
reference allele (a deletion or insertion) in dbSNP were filtered out as well. 134
Afterwards, for the withheld samples, all non-standard alleles were removed from the sequences to obtain only two alleles at most for analysis. SeqEM was then used to estimate the
allele frequencies and error rates which could be used in MLE and LRT. Due to the filtering
of the data, the estimated error rates will not be correct. This is, however, not a problem as
it is the filtered data that will be used in the model.
33
Chapter 3
Results
3.1
Parameter estimation in random SNP data using SeqEM
As SeqEM will be used to estimate the parameters (allele frequencies and error rate) used
in the final models to detect (differential) imprinting, the first step in this thesis was the
evaluation of these estimates. Therefore, random data, generated as described in section
2.4.1, were used as input for SeqEM.
3.1.1
SeqEM on sequences with only two alleles
Although the situation in which only two alleles are present in a sequence is not realistic,
it was assessed first to determine the accuracy of SeqEM. Data (100 samples, 1 locus and
a coverage between 10 and 100) with a predetermined PA and error rate were generated a
1000 times and used as input for SeqEM. The estimate and standard deviation of these 1000
estimates were calculated and compared to the real values used to generate the data. Results
for different allele frequencies and error rates can be seen in tables 3.1 and 3.2. The parameters
were estimated very close to the real values. Even for sequences with higher error rates, good
estimates were still obtained with SeqEM. Furthermore, the 95% confidence intervals always
included the real values of PA respectively the error rate. Hence, the estimates approximate
the real values very well and can be used in the different models.
Table 3.1: Mean estimated allele frequencies with their corresponding 95% confidence intervals of
the mean for different PA ’s and error rates for 1000 iterations. Data for 1 locus with 100
samples were created with a coverage between 10 and 100.
PA
error rate
Estimated PA
(mean ± standard deviation)
95% Confidence interval
of the mean
0.5
0.2
0.5
0.02
0.02
0.1
0.4995 ± 0.03447
0.1997 ± 0.02845
0.5001 ± 0.03584
0.4973 - 0.5016
0.1979 - 0.2014
0.4979 - 0.5024
35
3. Results
Table 3.2: Mean estimated error rates with their corresponding 95% confidence intervals of the mean
for different PA ’s and error rates for 1000 iterations. Data for 1 locus with 100 samples
were created with a coverage between 10 and 100.
PA
error rate
Estimated error rate
(mean ± standard deviation)
95% Confidence interval
of the mean
0.5
0.2
0.5
0.02
0.02
0.1
0.02003 ± 0.0028
0.02004 ± 0.0023
0.1002 ± 0.0059
0.01986 - 0.02021
0.01990 - 0.02018
0.09988 - 0.1006
3.1.2
SeqEM on sequences with four alleles
As the procedure described above makes use of only two possible alleles, it does not reflect a
realistic situation: when sequencing errors are present, the data can contain all four alleles.
Whereas using SeqEM on data with two alleles was a first proof that SeqEM is sufficiently
reliable for our purpose, data was simulated where individual sequences could contain four
alleles due to sequencing errors. Since SeqEM can only handle sequences with two alleles,
a reference and variant one, a prior error correction was done as described in section 2.4.1.
SeqEM was tested on this corrected SNP dataset. Because the errors were partially filtered
in R before using SeqEM, it was expected that the error rate would be underestimated. Data
(100 samples, 1 locus and a coverage between 10 and 100) were again created a 1000 times.
For example, for simulated data with a PA of 0.5 and an error rate of 0.02 the mean estimate
of the error rate (0.0105046 ± 0.00182887) was, as predicted, too low. However, with a mean
PA of 0.5001724 ± 0.03391529 the estimated PA was still very close to the real value. Table
3.3 shows more estimates of the allele frequency. The 95% confidence interval of the mean,
which always included the real value, is also presented. Hence, again very good estimates of
the allele frequency are obtained. The error rate, on the other hand, was estimated too low
due to the prior error correction. However, note that this is not a problem, as this parameter
will only be used for data for which the prior error correction is performed, hence the actual
value is here irrelevant.
3.1.3
SeqEM on imprinted sequences
Finally, SeqEM was evaluated on imprinted data to check if the allele frequency and error rate
were still estimated correctly. Note that imprinting affects the allelic distributions, and might
therefore have an impact on these estimates. Data for 100 samples and 1 locus with coverage
between 10 and 100 were generated. Again, a prior error correction step was done in R with
the correction principle described in part 2.4.1. Data generation and SeqEM were repeated a
1000 times and afterwards the mean estimate and standard deviation of these iterations were
calculated. Again the error rate was estimated too low due to the prior error correction. An
error rate of 0.01160587 ± 0.001896982, for example, was obtained when the real value was
36
3. Results
0.02. In table 3.4 mean estimates of the allele frequency and the corresponding confidence
intervals of the mean are displayed. The table proves that the estimated allele frequencies
were still close to the real values when imprinted data was used.
Table 3.3: Mean estimated allele frequencies and 95% confidence intervals of the mean for PA ranging
from 0.1 to 0.9 for 1000 iterations. The error rate was 0.02 for data with 100 samples, 1
locus and a coverage between 10 and 100
PA
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Estimated allele frequency
(mean ± standard deviation)
0.0989
0.2001
0.3001
0.4017
0.4986
0.5999
0.6996
0.7999
0.9002
±
±
±
±
±
±
±
±
±
0.0208
0.0288
0.0332
0.0358
0.0334
0.0346
0.0325
0.0265
0.0210
95% Confidence interval
of the mean
0.0976
0.1983
0.2981
0.3995
0.4965
0.5977
0.6976
0.7983
0.8989
-
0.1002
0.2019
0.3022
0.4039
0.5006
0.6020
0.7016
0.8016
0.9015
Table 3.4: Mean estimated allele frequencies of imprinted data and 95% confidence intervals of the
mean for PA ranging from 0.1 to 0.9. The data were generated a 1000 times with an error
rate of 0.02, 100 samples, 1 locus and coverage between 10 and 100.
PA
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Estimated allele frequency
(mean ± standard deviation)
0.0996
0.1993
0.3003
0.3982
0.5012
0.5997
0.6994
0.8017
0.8995
±
±
±
±
±
±
±
±
±
0.0209
0.0287
0.0346
0.0341
0.0348
0.0349
0.0318
0.0288
0.0208
95% Confidence interval
of the mean
0.0983
0.1975
0.2981
0.3961
0.4990
0.5975
0.6974
0.7999
0.8982
-
0.1009
0.2011
0.3024
0.4003
0.5033
0.6019
0.7013
0.8034
0.9007
SeqEM also has the option to predict the genotypes and estimate the parameters based on the
Hardy-Weinberg principle. In contrast with the actual underlying genotypes, the obtained
expression data of imprinted loci do not adhere to this theorem. Using SeqEM in this way
should, thus, lead to wrong results. The estimates of the allele frequency were still very
close to the real values, however as expected the predicted genotypes were not correct for the
imprinted data (data not shown).
37
3. Results
Since the estimated parameters are very similar to the real values (for normal and imprinted
data), SeqEM can be used in the models to screen for loci featuring (loss of) imprinting.
3.2
3.2.1
Detection of imprinting
Drawbacks of previous methodology
The methodology developed by Steyaert et al. to detect imprinted loci in MethylCap-seq data
was able to identify well-known as well as new imprinted regions. 103 However, the approach
had some downsides that were improved here. To reduce computational load, which was
one of the biggest bottlenecks, a more efficient genotype-calling step was included in the
methodology. SeqEM was employed for estimation of allele frequencies and error rate, which
are required to detect imprinting. Moreover, the statistical framework was further improved.
RNA-seq data were used to prove the efficacy of the improved method and the applicability on
other data types. Also, the method was adapted to enable the detection of partial imprinting,
which might for example arise when a tissue consists of a mixture of imprinted and nonimprinted cell types. Lastly, the methodology was extended to enable detection of differential
imprinting. This will be discussed in section 3.3.
3.2.2
The likelihood ratio test to detect imprinting in simulated data
To perfect the model to detect imprinting described in section 2.4.3, simulation studies were
conducted. The model was finalised using data that were created as described in section 2.4.1.
However, more noise was included in the generated data. For example, if a sample was 50%
imprinted, not exactly half of the coverage (of one allele) was deleted, but more variation was
allowed. Afterwards, power analyses were performed to define the method’s limitations. To
get a better understanding of where the possible problems were located, the simpler model
with only 2 alleles was considered here. Furthermore, in the simulation studies, SeqEM was
not yet included in the model. Afterwards, the methodology was used to study the control
samples of the TCGA breast cancer data. Discovery of some well-known imprinted loci further
proved that the developed methodology was able to detect imprinting.
3.2.2.1
Simulation studies to evaluate the mixture model
To ensure that the developed PMF (see section 2.4.3) models the data as anticipated, simulated data and the fitted model were created and compared. Random data for 1 locus with
1000 samples and a coverage of 100 were generated and different parameter values were tested.
The amount of samples with a specific fraction of allele A (varying from 0 to 1) was determined and normalised (divided by the total amount of samples). These fractions represent
the simulated data. The fitted mixture model was created by calculating the PMF for those
different fractions of A. The allele frequency and error rate of the simulated data were used
38
3. Results
here. The imprinting factor necessary for the PMF was estimated in the simulated data with
MLE. The simulated data and mixture model were then plotted to enable good comparison.
This procedure was done for two different allele frequencies (0.5 and 0.75), two error rates
(0.02 and 0.1) and varying amounts of imprinting (0, 50 and 100%), see figures 3.1 and 3.2.
The graphs show that the PMF perfectly models the simulated data. As expected, the graphs
for 0, 50 and 100% imprinting have three, four and two peaks respectively. In the latter case,
no heterozygotes were present and thus only two homozygous peaks are seen. One peak
represents the A homozygote at a fraction of 1, while the other represents the T homozygote
with a fraction of A equal to 0. When the locus was not imprinted, however, heterozygous
samples were still possible resulting in three peaks. The middle peak here represents the
heterozygotes with a fraction of A around 0.5. The other two peaks are equivalent to the
homozygous ones in fully imprinted data. The case in between of 50% imprinting resulted
in four peaks. Due to the expression of either A or T, the heterozygotes are separated into
two peaks as either part of the coverage of A or T is imprinted. These peaks were perfectly
modelled by the developed PMF.
Figure 3.2 shows that for an error rate of 0.1 the fit was still very good, though real error
rates will probably be lower. The graphs also show that MLE still estimates the amount of
imprinting very well as the mixture model has the same (amount of) peaks as the simulated
data. Moreover, the estimated i is always very close to the real degree of imprinting (i.e. degree
used for simulating the data). Hence, even for higher error rates the developed mixture PMF
models the data as anticipated and can be used to detect imprinting in control samples.
3.2.2.2
Simulation studies for the detection of imprinting under the null hypothesis
To evaluate if the LRT for the detection of imprinting (see equation 2.24) was valid, a set
of p-values was determined under the null hypothesis. In order for a test to be statistically
valid, the p-values have to follow a uniform distribution under the null hypothesis, though
some deviations are possible for discrete data (see below). Hence, non-imprinted data were
created and the LRT p-values were determined and depicted as histograms. Since we test for
a degree of imprinting different from 0, which is on the boundary of the parameter space, a
mixture of χ2 distributions with 50% 0 and 50% 1 degrees of freedom for the null distribution
was necessary to obtain correct p-values (see section 2.4.3.3). This mixture is equivalent to
division of the p-value, obtained with a χ2 distribution with 1 degree of freedom, by two,
except when that p was equal to 1 (corresponding to an estimated i of 0). Figure 3.3 shows
the histogram of 10000 loci with 100 samples each. The surface of the p-values for imprinting
values different from 0 was approximately 0.5. The peak at a p-value of 1 had a surface of
about 0.5 as well. So, no exact uniform distribution was obtained under the null hypothesis,
but this was expected for a PMF. Using a mixture of χ2 distributions solved this problem by
producing the here best feasible approximation of the uniform distribution.
39
3. Results
(a) PA = 0.5; estimated i = 0.02
(b) PA = 0.75; estimated i = 0
(c) PA = 0.5; estimated i = 0.5
(d) PA = 0.75; estimated i = 0.51
(e) PA = 0.5; estimated i = 1
(f ) PA = 0.75; estimated i = 1
Figure 3.1: Plots of simulated data (red) and mixture model (green) with an error rate of 0.02 and
an allele frequency of 0.5 (left) or 0.75 (right). Imprinting varies in between 0 (top), 50
(middle) or 100% (bottom). The graphs show data from 1000 samples with a coverage
of 100. Only 1 locus was considered here.
40
3. Results
(a) PA = 0.5; estimated i = 0.03
(b) PA = 0.75; estimated i = 0
(c) PA = 0.5; estimated i = 0.51
(d) PA = 0.75; estimated i = 0.5
(e) PA = 0.5; estimated i = 1
(f ) PA = 0.75; estimated i = 1
Figure 3.2: Plots of simulated data (red) and mixture model (green) with an error rate of 0.1 and
an allele frequency of 0.5 (left) or 0.75 (right). Imprinting varies in between 0 (top), 50
(middle) or 100% (bottom). The graphs show data from 1000 samples with a coverage
of 100. Only 1 locus was considered here.
41
3. Results
Figure 3.3: Histogram of the p-values under the null hypothesis. The null distribution was calculated
as a mixture of χ2 distributions with 0 and 1 degrees of freedom. The data used in this
graph had a PA of 0.5, an error rate of 0.02 and a coverage of 100. It shows the results
of 10000 loci all with 100 samples. The imprinting factor was 0 to ensure that the null
hypothesis was true.
The p-value distributions under the null hypothesis were also determined for different sample
sizes. Furthermore, cumulative distributions were created. The graphs are shown in figure
3.4. As before, p-values did not follow a uniform distribution but approximated it very well,
even for small sample sizes.
When p-values follow a uniform distribution 1% of the samples have p-values smaller than
0.01, 5% smaller than 0.05 and so on. The results of testing this for 100 samples (corresponding
to the middle figures in figure 3.4) are shown in table 3.5. As predicted, around 1% (0.9%)
of the p-values was smaller than 0.01, approximately 2% (1.4%) was smaller than 0.02, 5%
(4.1%) was smaller than 0.05 and so forth. If p-values were varied per 0.05 from 0 until 0.5,
the fraction between two consecutive p-values was around 5% (see column 2 in table 3.5). The
total amount of p-values smaller than 0.5 was 50%. This numerically represented that the
p-values were uniformly distributed for p smaller than 0.5. As the mixture of χ2 distributions
resulted in division of the p-values by two (except when they were 1), p-values between 0.5
and 1 were depleted. A peak again appeared for p equal to 1.
So, in conclusion, the distribution under the null hypothesis was not uniform, because the
data follow a discrete distribution. However, a good approximation of uniformity was obtained
with a mixture of χ2 distributions with 0 and 1 degrees of freedom for the null distribution.
3.2.2.3
Power analysis on simulated data
Afterwards, a power analysis was performed on randomly generated data with different levels
of imprinting. The power in function of the sample size was determined to ascertain which
size enables detection of imprinting. This was done for varying minor allele frequencies as
it was expected that smaller frequencies would lead to a lower power. Moreover, different
42
3. Results
Figure 3.4: Distribution (left) and cumulative distribution (right) of the mixture of p-values if H0
(no imprinting) was true for 1000 loci with variable sample sizes of 10 (top), 100 (middle)
and 1000 (bottom). Data were generated with an allele frequency of 0.5, an error rate
of 0.02 and a coverage of 100.
percentages of imprinting were tested. Figures 3.5-3.7 show graphs of the significant fraction
of p-values (fraction ≤ 0.05) of 100 loci. Next to the graphs, tables listing the corresponding
fractions are displayed. Data were created the same way as before with a coverage of 100 and
an error rate of 0.02. Samples sizes ranged from 5 until 100 in steps of 5. Different imprinting
factors, namely 0.5 and 1, and minor allele frequencies of 0.1, 0.25 and 0.5 were tested.
As expected the power was lower for samples with a low minor allele frequency or with less
imprinting. Detection of imprinting was most challenging for an allele frequency of 0.1 (Fig.
3.5). When samples were 100% imprinted, imprinting was detected in all 100 loci for a sample
size higher than 10. Around 50%-75% of the loci were detected for lower sizes. Hence, a
minimal coverage of 15 is preferred here. For samples that were 50% imprinted, on the other
hand, the power was slightly lower and the minimal coverage should be 35. As the allele
frequency increased, the power increased as well. For fully imprinted samples with a minor
43
3. Results
Table 3.5: The fraction between two sequent p-values and the cumulative fractions of those p-values.
The p-values were calculated under the null hypothesis as a mixture of χ2 distributions.
Data of 1000 loci with 100 samples, an allele frequency of 0.5, an error rate of 0.02 and a
coverage of 100 was used in the calculations.
p-value
0.01
0.02
0.05
0.10
0.15
0.20
0.25
Fraction
Cumulative fraction
p-value
Fraction
Cumulative fraction
0.009
0.005
0.027
0.050
0.054
0.056
0.041
0.009
0.014
0.041
0.091
0.145
0.201
0.242
0.30
0.35
0.40
0.45
0.50
0.55-0.95
1.00
0.047
0.056
0.046
0.053
0.050
0.000
0.506
0.289
0.345
0.391
0.444
0.494
0.494
1.000
allele frequency of 0.25 or 0.5, detection was always possible independently of the sample size
(Fig. 3.6 and 3.7). Samples that were 50% imprinted required a minimal sample size of 15 or
10 for an allele frequency of 0.25 or 0.5, respectively. This analysis showed that, especially in
fully imprinted samples with a high minor allele frequencies, the detection was very efficient.
Furthermore, detection of imprinting was possible for samples with small sample sizes and
low minor allele frequencies, though more samples are required here. One particular remark
that should, however, be made for these analyses is that for likelihood ratio tests the test
statistic only asymptotically follows a chi-square distribution. This implies that particularly
for lower sample sizes absolute results may be less reliable. However, conclusions for relative
comparisons between conditions (level of imprinting, minor allele frequency, sample size) are
anticipated to be less problematic.
sample
size
5
10
15
20
25
30
35-100
sign fraction
(i = 50%)
sign fraction
(i = 100%)
0.63
0.86
0.97
0.97
0.98
0.99
1.00
0.54
0.71
1.00
1.00
1.00
1.00
1.00
Figure 3.5: Graph representing the fraction of significant p-values (p ≤ 0.05) for sample sizes varying
from 5 to 100 in steps of 5 (right) and the corresponding fractions (left). The allele
frequency was 0.1, the error rate 0.02, the coverage 100 and the amount of imprinting
(i) was 100% (red) or 50% (blue). Data of 100 loci are shown here.
44
3. Results
sample
size
5
10
15-100
sign fraction
(i = 50%)
sign fraction
(i = 100%)
0.93
0.98
1.00
1.00
1.00
1.00
Figure 3.6: Graph representing the fraction of significant p-values (p ≤ 0.05) for sample sizes varying
from 5 to 100 in steps of 5 (right) and the corresponding fractions (left). The allele
frequency was 0.25, the error rate 0.02, the coverage 100 and the amount of imprinting
(i) was 100% (red) or 50% (blue). Data of 100 loci are shown here.
sample
size
5
10-100
sign fraction
(i = 50%)
sign fraction
(i = 100%)
0.97
1.00
1.00
1.00
Figure 3.7: Graph representing the fraction of significant p-values (p ≤ 0.05) for sample sizes varying
from 5 to 100 in steps of 5 (right) and the corresponding fractions (left). The allele
frequency was 0.5, the error rate 0.02, the coverage 100 and the amount of imprinting
(i) was 100% (red) or 50% (blue). Data of 100 loci are shown here.
3.2.3
Application on TCGA data
Chromosome 11 from the TCGA data was tested first because many known imprinted loci
are located on this chromosome. After determining the imprinting levels and p-values for the
SNPs in the data with the developed script, the p-values were corrected for multiple testing
with the false discovery rate (fdr) of Benjamini-Hochberg. From the 19790 SNPs studied on
chromosome 11, 5432 were significant after correction. For biological relevance, however, an
extra filtering was done. Only SNPs with an estimated i higher than 0.5 were retained. After
filtering, 2841 significant SNPs were left for further analysis. Chromosome 21 was studied as
well. Here, 948 SNPs of the 5858 SNPs remained after correction and filtering. As this was
a first attempt to assess the quality of the model and to identify possible aberrations, here
only key examples will be presented with unadjusted p-values. The sequencing error rate was
estimated for each chromosome separately to evaluate whether they were indeed similar.
45
3. Results
First, the SNPs in regions of known imprinted loci, such as Igf2, were analysed. The same
plots as for the simulation studies were created to enable comparison of the observed data
and the mixture model. Here, significant SNPs were discovered that confirmed imprinting in
already known loci. Figure 3.8 shows the graphs for two SNPs in Igf2, namely rs2585 and
rs7873. The figures indicate that the mixture model approximates the observed data very
well, with estimates for i very close to 1, namely as 0.97 for rs2585 and as 0.98 for rs7873.
Interestingly, the notion that imprinting is somewhat less than 100% is supported by the
bimodal distributions, particularly for rs2585, in both observed data and used model. This
underscores the benefit of allowing for partial imprinting. For SNP rs2585 the allele frequency
of the depicted standard allele (T) was estimated as 0.2832 by SeqEM and for SNP rs7873
the depicted standard allele (T) frequency was estimated as 0.8938. For chromosome 11,
the median error rate over all loci was estimated as 0.00088. This is very small compared
with typically encountered error rates of roughly 1% (factor 10 difference), even when one
takes into account the fact that a prior error correction was performed (causing a factor 2
difference in the simulation studies). 88 However, the width of the peaks shows that indeed the
error rate was fairly small. Igf2 is a well-known imprinted gene which is expressed paternally
(see section 1.2), this is hence proof that the developed methodology is at least capable to
accurately detect imprinted loci.
Figure 3.8: Plots representing the observed TCGA data (red) and the mixture model (green) of
SNPs rs2585 (left) and rs7873 (right), both found in the Igf2 gene, which is a well-known
imprinted gene. SeqEM estimated the allele frequencies of the standard allele (T) as
0.2832 (right) and 0.8938 (left) and the median error rate as 0.00088. The amount of
imprinting was estimated as 0.97 (left) and 0.98 (right), and the obtained p-values were
virtually equal to 0 for both SNPs.
Other known imprinted genes were discovered as well with the developed methodology, providing more evidence of the efficiency of the method. H19 and Igf2 are reciprocally imprinted
and thus H19 is expressed maternally. A SNP located in this gene, rs2839698, was found to be
significantly imprinted with a p-value of 3.32e-92. Graph 3.9 (left) shows the observed data
46
3. Results
and fitted model for this SNP. Again the figure as well as an i estimated as 0.99 indicate that
the locus indeed was imprinted. The allele frequency of the standard allele shown in the plot
(G) was estimated as 0.4732. A SNP located in KCNQ1, namely rs463337, was significantly
imprinted as well (Fig. 3.9 right). P was 2.04e-10 for this SNP with an estimated i of 1.
More evidence of the efficiency of the developed methodology was here found.
Figure 3.9: Plot representing the observed TCGA data (red) and the mixture model (green) of
significant SNP rs2839698(left) and rs463337 (right), which is located in the H19 gene
and KCNQ1 gene, respectively. rs2839698 had a p-value of 3.32e-92, an estimated i of
0.99 and an allele frequency of the depicted standard allele (G) of 0.4732. rs463337 had a
p-value of 2.04e-10, an estimated i of 1 and an allele frequency of the depicted standard
allele (A) of 0.8462. The error rate for chromosome 11 was 0.00088.
Chromosome 21 was studied as well. The median error rate over all loci was estimated as
0.00118, which is acceptably close to the error rate for chromosome 11. As for chromosome 11,
several significant loci could be identified (Fig. 3.10). These show that often the SNPs were
significant due to low data quality. As the coverage or the amount of samples was frequently
very small, it was hard to accurately fit the mixture PMF (Fig. 3.10 a and b). Hence, wrong
i-values gave the best fit. However, the graphical representation shows too little evidence that
the SNP was indeed imprinted. Also SNPs with a very small minor allele frequency were seen
as imprinted, but again the plots showed that this was not the case (Fig. 3.10 c). Filtering on
minimal coverage and minor allele frequency is necessary in the future to reduce this noise.
Furthermore, other aberrations, possible associated with mapping problems and allele specific
methylation, in the data were present which will have to be filtered as well (Fig. 3.10 d).
Although some obstacles are still present, it was shown here that the developed methodology
can certainly discover imprinted loci. Discovery of known imprinted loci, such as Igf2 and H19,
in real data is the best evidence of the efficiency of the method and this was here obtained.
However, further filtering of the data and reduction of the noise is necessary in the future.
Due to the aberrations in the data, no new imprinted loci could be concluded so far.
47
3. Results
(a)
(b)
(c)
(d)
Figure 3.10: Plots representing observed TCGA data (red) and the fitted mixture model (green) of
some significant SNPs on chromosome 21. In (a) rs4956885 is shown with a p-value
of 0.007, an estimated i of 0.56 and a minor allele frequency of 0.1325. (b) represents
rs9974322 with a p-value of 0.005, i estimated as 0.79 and a minor allele frequency as
0.3921. Graph (c) shows rs4023131 with a p-value of 3.01e-122, an i of 0.81 and a minor
allele frequency of 0.0345. In (d) rs3167757 is shown with a p-value of 0, i estimated as
0.58 and a minor allele frequency as 0.3894. The median error rate for chromosome 21
was 0.00118.
3.3
Detection of differential imprinting
Different methodologies to test for differential imprinting between two subsets of data were
assessed in this thesis. As the main focus was to discover loss of imprinting in cancer, the
two subsets were defined as a control subset which was imprinted and a tumour subset which
had possibly lost its imprinting.
3.3.1
Comparison of two subsets of simulated data with KS and WRS test
The first two methodologies that were tested were based on the binomial test and the
Kolmogorov-Smirnov or Wilcoxon Rank Sum test (as described in section 2.4.2). The results of the KS and WRS tests were analysed and compared to determine which test was
most sensitive. Data were generated as described in section 2.4.1. Control samples were
imprinted and thus all heterozygous samples were made homozygous in this subset by removing data for one of the alleles, thereby also reducing the coverage. Also for homozygous
samples, approximately half of the coverages were removed (see Methods, section 2.4.1). In
tumour samples, on the other hand, 100% loss of imprinting was assumed. This procedure
was repeated 1000 times. In table 3.6 the mean and median p-values for the KS and WRS
tests can be seen for a minor allele frequency, again denoted as PA , ranging from 0.1 to 0.5
with an error rate of 0.02, a coverage between 10 and 100, 200 samples (100 tumour samples
vs 100 control samples) and 1 locus.
When PA was around 0.2 the p-value obtained from WRS was not significant (p > 0.05), but
the median p-value decreased for higher PA values. The median p-value became significant
for PA equal to 0.4 and higher. As more heterozygotes are present when PA is closer to 0.5,
48
3. Results
it was expected that loss of imprinting would be easier to detect in these cases. The only
exception was found when PA was around 0.1. The p-value of the WRS test then became
very small and a significant difference was found. This was not as anticipated, since there are
less heterozygotes when the allele frequencies deviate more from 0.5.
The WRS test mainly assesses differences between the medians of both p-value populations,
so the median binomial p-values of the tumour and control data were evaluated. When PA
was around 0.1 the median binomial p-value of the tumour data became really small compared
to the median binomial p-values of higher PA ’s or of the control samples. This was due to the
samples being almost homozygous and the fact that their coverage was still relatively large
(larger than for control samples, as no alleles are removed for imprinting). These binomial
test p-values explain why the p-value of the WRS test was smaller than anticipated. In other
words, the observation of significant results for extreme PA values reflects a difference in
power of the binomial test rather than a difference in imprinting between cases and controls.
With the KS test, on the other hand, significant p-values were always obtained. However, the
same bias for smaller allele frequencies was present here. Although the KS test for PA around
0.1 was smaller than for higher allele frequencies, loss of imprinting was detected for all allele
frequencies. Note that many ties were present in the dataset, making it difficult for the KS
and WRS test to calculate an exact p-value. So in conclusion,the tests enable detection of
differential imprinting, yet there is a problem for low allele frequencies and also the presence
of ties in the data may complicate the interpretation. As the KS was always significant, it is
recognised as a more sensitive test for the outlined goal.
Table 3.6: Mean and median p-value of the Kolmorogov-Smirnov test (KS) and the Wilcoxon Rank
Sum test (WRS) on two sets of binomial p-values for 1000 iterations with PA ranging
from 0.1 to 0.5. The error rate was 0.02 for 100 samples of 1 locus and coverage between
10 and 100. Significant p-values (p ≤ 0.05) are shown in bold.
PA
p-value
(mean ± standard deviation)
p-value
(median)
0.1
KS: 4.58e-06 ± 2.98e-05
WRS: 0.0065 ± 0.0387
KS: 3.32e-05 ± 2.98e-05
WRS: 0.35 ± 0.30
KS: 7.31e-06 ± 4.89e-05
WRS: 0.39 ± 0.31
KS: 9.19e-07 ± 1.44e-05
WRS: 0.15 ± 0.23
KS: 6.13e-07 ± 9.04e-06
WRS: 0.10 ± 0.19
KS: 1.87e-08
WRS: 0.0001
KS: 4.71e-06
WRS: 0.28
KS: 2.25e-07
WRS: 0.34
KS: 1.29e-09
WRS: 0.04
KS: 5.10e-10
WRS: 0.01
0.2
0.3
0.4
0.5
49
3. Results
Afterwards, more detailed examples were also evaluated. Data for 1 locus with 100 samples
for each subset and a coverage between 10 and 100 were created with a PA of 0.5 and an error
rate of 0.02. The KS test resulted in a p-value of 7.82e-09, while p was 0.12 in the WRS test.
According to the KS test there was a significant difference between the distributions of the two
populations (p < 0.05, H0 is rejected), hence, loss of imprinting can be concluded. However,
the null hypothesis could not be rejected by the WRS test (p > 0.05). Taking a closer look into
the two sets of p-values showed that the control p-values were indeed smaller than the ones of
the tumour data (Fig. 3.11). With around 50% observed heterozygotes in the tumour data
and none in the control data (heterozygotes were made homozygous) a significant difference
between the two datasets was expected, underscoring that the Kolmogorov-Smirnov test is
here more sensitive.
Figure 3.11: Distribution of the p-values of the binomial test (left) and cumulative distribution of
those p-values (right) for control (red) and tumour (green) data with a PA of 0.5, error
rate 0.02, 100 samples each, 1 locus and coverage between 10 and 100. The KS test
resulted in a p-value of 7.82e-09 whereas this was 0.12 for the WRS test.
Subsequently, the same analysis was performed for an allele frequency of 0.1 for control as
well as tumour data. P-value of 9.57e-06 and 0.004 were obtained for the KS and WRS test,
respectively. Both tests produced significant results, thus loss of imprinting in the tumour
data was detected here. Looking in more detail into the p-values, showed that indeed the
binomial p-values of the control data were smaller than the tumour p-values (Fig. 3.12).
Because the p-value given by the KS test was smaller, it is implied that the KS test is again
the most sensitive test.
However, as comparing the binomial p-values for detection of loss of imprinting strongly
depends on the coverage of the samples, another approach was tested as well. The ratio
of the number of lowest on highest alleles was calculated, instead of determining binomial
p-values. Afterwards these ratios were compared with a KS or WRS test. This ratio does not
depend on the coverage, and thus should avoid the problem with significant results for low
50
3. Results
Figure 3.12: Distribution of the p-values of the binomial test (left) and cumulative distribution of
those p-values (right) for control (red) and tumour (green) data with a PA of 0.1, error
rate 0.02, 100 samples each, 1 locus and coverage between 10 and 100. The KS test
resulted in a p-value of 9.57e-06 and the WRS test 0.004.
minor allele frequencies. Results can be seen in table 3.7. As predicted the WRS p-value at
a PA of about 0.1 better reflects our goals than when a binomial test was used. The WRS
p-values became bigger when the allele frequency increased (becoming significant at around
0.4). The p-values of the KS test were larger than before, but still significant. Also here, ties
were found in the data, implying that the obtained p-values are not exact.
Table 3.7: Mean and median p-value of the Kolmorogov-Smirnov test (KS) and the Wilcoxon Rank
Sum test (WRS) for 1000 iterations of 2 populations of ratio’s. Data with PA ranging
from 0.1 to 0.5, an error rate of 0.02, 100 samples each, 1 locus and coverage between 10
and 100 were used. Significant p-values (p ≤ 0.05) are shown in bold.
PA
p-value
(mean ± standard deviation)
p-value
(median)
0.1
KS: 0.0078 ± 0.0133
WRS: 0.31 ± 0.29
KS: 0.00093 ± 0.0021
WRS: 0.38 ± 0.30
KS: 3.12e-05 ± 2.11e-04
WRS: 0.066 ± 0.14
KS: 1.41e-06 ± 1.25e-05
WRS: 0.013± 0.043
KS: 7.43e-07 ± 8.35e-06
WRS: 0.0070 ± 0.0263
KS: 0.0028
WRS: 0.201
KS: 0.00015
WRS: 0.31
KS: 4.82e-07
WRS: 0.011
KS: 1.26e-08
WRS: 0.00081
KS: 3.68e-09
WRS: 0.00026
0.2
0.3
0.4
0.5
Looking into the ties showed that most of the ties were found in the control dataset itself.
Since the samples are made homozygous by deleting around 50% of the alleles, the coverage
became much smaller and the sequences became more alike. Thus, if a larger coverage range
51
3. Results
is chosen, more different sequences can be obtained (also when data are again removed to
simulate imprinting) resulting in less ties in the data. For example, for a minimum and
maximum coverage of 10 and 400, respectively, only 10% ties were present. It should also
be noted that ties are not a big problem except when the obtained p-value is close to the
significance level (the p-value is often calculated slightly too small yielding anti-conservative
results), which will only be a problem for a limited number of cases and can be taken into
consideration when assessing the results. 135 If, however, ties still cause a problem for real
data, a bootstrap version of the KS test can solve this problem Using this repeatedly on
simulated data consistently gave a p-value of approximately 0. Hence, a significant difference
between the tumour and control data, and thus loss of imprinting in the tumour was detected.
However, because bootstrapping is very time-intensive, this will not be implemented unless
necessary.
3.3.2
Simulation studies of the likelihood ratio test
The last method to enable detection of differential imprinting in two subsets of data was based
on MLE and LRT, as outlined in section 2.4.3. Again simulation studies were first conducted
to check the method’s validity. Random SNP data were created using the procedure described
in section 2.4.1. As for LRT-based imprinting detection, more variation was included in the
generated data by removal of not exactly half of the alleles for the homozygous samples
(see section 3.2.2). First, the correctness of the PMF and estimated imprinting factors was
analysed. Subsequently, validity of the test was checked under the null hypothesis. As for
detection of imprinting, the model with only two alleles was used and SeqEM was not yet
included in the methodology.
3.3.2.1
Evaluation of the mixture model and estimation of the imprinting factor
To validate the correctness of the created model, simulated data and the mixture model were
generated and compared. In figure 3.13, data for 1 locus with 1000 samples, a PA of 0.5, an
error rate of 0.02 and a coverage of 100 were used. The control samples were always fully
imprinted (i=1), while the tumour samples were either not (i=0) or fully imprinted (i=1).
For 100% imprinting of the tumour samples i was estimated as 1 under the null hypothesis
(for both subsets together) as well as under the alternative hypothesis (for the two subsets
separately). The right panel in figure 3.13 shows that for the alternative hypothesis with i
estimated as 1 for the two subsets, the mixture model fits the simulated data perfectly. When
tumour samples were not imprinted i was estimated as 0.04 under the null hypothesis (for
the whole dataset) and as 1 and 0.04 for control and tumour samples, respectively, under the
alternative hypothesis. The plot of i estimated under the alternative hypothesis (for the two
subsets separately) shows again that the mixture model fits the simulated data (Fig. 3.13 left
panel).
52
3. Results
Figure 3.13: Plots of simulated data (red) and model fitted under the alternative hypothesis (green)
with an error rate of 0.02 and a PA of 0.5. Imprinting varies in between 0% (left) or
100% (right). In the left figure i was estimated as 1 for control samples and as 0.04 for
tumour samples. In both the control and tumour samples i was estimated as 1 in the
right figure. Data of 1 locus with 1000 samples and a coverage of 100 are shown here.
As described above, under the null hypothesis, i was estimated as the lowest amount of
imprinting present in one of the two subsets, whereas one could naively assume that an
average value would be more appropriate. To assess whether this estimate can be considered
reliable, simulated data for the scenario higher (i.e. 100% imprinting in controls and 0% in
cases) were plotted together with models assuming 0%, 50%, 75% and 100% imprinting for the
full population (Fig. 3.14). These plots show that indeed an amount of imprinting of around
0% fits the simulated data best, as the likelihoods to observe the heterozygous samples in the
cases would otherwise be virtually equal to zero, with major impact on the joint likelihood.
This explained why under the null hypothesis i was estimated as 0.04.
(a) i = 0
(b) i = 0.5
(c) i = 0.75
(d) i = 1
Figure 3.14: Plots of simulated data (red) and fitted model under H0 (green) with an error rate of
0.02 and a PA of 0.5. The simulated data were not imprinted. The fitted model under
H0 was created with one predetermined i for all samples instead of estimating it. In
(a) the PMF was developed with an i of 0, in (b) i was 0.5, in (c) i was 0.75 and in (d)
i was 1. Data of 1 locus with 1000 samples and a coverage of 100 are shown.
53
3. Results
3.3.2.2
Detection of differential imprinting under the null hypothesis
As for detection of imprinting, the distribution of p-values under the null hypothesis was
evaluated to see whether it follows a uniform distribution. Also here, the null hypothesis lies
on the boundary of the parameter space (cf. previous chapter), and therefore again a mixture
of χ2 distributions with 50% 0 and 50% 1 degrees of freedom was attempted to obtain pvalues. The result of 10000 loci with 100 samples each can be seen in figure 3.15. Data were
generated as described above with a PA of 0.5, an error rate of 0.02 and a coverage of 100.
Because the null hypothesis (no differential imprinting) had to be valid, tumour as well as
control samples were imprinted (i equal to 1). Figure 3.15 shows that again a peak at p equal
to 1 can be seen. Other p-values were smaller than 0.5 due to division by two. However,
not unexpectedly, the p-value does not approximate a uniform distribution - even for those
p-values < 0.5.
Figure 3.15: Histogram (right) and cumulative distribution (left) of the p-values under the null
hypothesis. The null distribution is calculated as a mixture of χ2 distributions with 0
and 1 degrees of freedom. The data used in this graph had a PA of 0.5, an error rate
of 0.02 and a coverage of 100. It shows the results of 10000 loci with 100 samples each.
The imprinting factor was 1 for tumour as well as control samples to ensure that the
null hypothesis was true.
The different fractions were also calculated to get a more detailed overview of the distribution
(Table 3.8). Moreover, the cumulative distribution of the p-values were plotted (Fig. 3.15
left). Table 3.8 shows that yet no correct distribution of p-values under the null hypothesis
was obtained, as less than 50% of the data is smaller than 0.5. Around 20% is smaller than
0.5 and the peak at a p-value of 1 contains almost 80% of the data. Furthermore, less than
1% or 5% of the data is smaller than 0.01 or 0.05, respectively, and so on. In conclusion, the
assumed null distribution of the test statistic is not yet accurately modelled, and additional
work will be required to elucidate how the constrained parameter space problem affects the
outcome.
54
3. Results
Table 3.8: The fraction between two sequent p-values and the cumulative fractions of those p-values.
Data of 10000 loci with 100 control samples, 100 tumour samples, an allele frequency of
0.5, an error rate of 0.02, a coverage of 100 and 0% imprinting were used in the calculations.
p-value
Fraction
Cumulative fraction
0.01
0.02
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55-0.95
1.00
0.0089
0.0133
0.0335
0.0439
0.0361
0.0252
0.0172
0.0150
0.0110
0.0079
0.0046
0.0018
0.000
0.7816
0.0089
0.0222
0.0557
0.0996
0.1357
0.1609
0.1781
0.1931
0.2041
0.2120
0.2166
0.2184
0.2184
1.000
55
Chapter 4
Discussion
Monoallelic methylation is the epigenetic event in which one allele is methylated and hence
only one allele is expressed. Imprinting occurs when monoallelic expression is based on the
parental origin. Research has shown the importance of imprinting in cells as well as in
diseases. Often loss of imprinting occurs in cancer and thus a lot of effort is put into studying
these imprinting patterns. However, this is very difficult as enrichment-based sequencing
techniques do not directly offer information on MAM, and hence on imprinting. Although
this problem was overcome by the methodology developed by Steyaert et al. at the BioBix-lab
in 2014, further improvement was still necessary. In this master’s thesis a novel methodology
was developed based on this previous work. Furthermore, another data-type (RNA-seq) was
studied and the method was extended to enable detection of loss of imprinting.
4.1
Detection of imprinting
A new methodology to discover loci featuring imprinting was developed based on MLE and
LRT. Detection of imprinting was possible in simulated data as well as RNA-seq data from
TCGA. The method was first proven to be valid (Fig. 3.1, 3.2, 3.3 and 3.4) and efficient (Fig.
3.5, 3.6 and 3.7) on simulated data. Afterwards, the method was used to screen TCGA data
for imprinted loci. More specifically, chromosome 11 and 21 of 113 control samples from the
breast cancer RNA-seq data were studied. Although several problems were present, detection
of imprinting was possible. This was proven by the detection of known imprinted loci on
chromosome 11, namely Igf2, H19 and KCNQ1 (Fig. 3.8 and 3.9).
The biggest problem was that yet no coverage or allele frequency filtering was implemented.
Because samples with small coverages do not represent the SNP well, fitting a good PMF
was very difficult. Furthermore, some SNPs were only present in a small number of samples
causing problems as well. However, filtering of the data will at least partially solve these
problems and this will definitely be implemented in the future. Note that this will also have
57
4. Discussion
a positive impact on fdr calculation, as filtering reduces the number of statistical testing
procedures performed.
Other obstacles were present as well. Firstly, clearly aberrant SNPs caused major difficulties.
As can be seen in figure 3.10, the PMF cannot find a good fit in these examples and hence
wrong estimates of i are obtained leading to false positives. Precaution is thus necessary when
significant SNPs are found. However, more strict filtering of the data on beforehand can again
reduce a huge amount of these difficult SNPs. One additional solution may be to test whether
the unconstrained model fits the data well, e.g. using a goodness of fit test, before performing
statistical inference regarding significant imprinting. Note that this is conceptually similar to
testing for normality before performing a t-test.
Another problem could be an underestimation of the variance for the heterozygous fraction. 136
As sequencing includes a PCR step, amplification errors made in the beginning of the procedure are exponentially magnified in the data. Furthermore, the studied samples may be
heterogeneous, with variable subpopulations of imprinted and non-imprinted cell types. As
the binomial distribution cannot capture this extra variance, a beta-binomial distribution
may be more appropriate to model the data for heterozygous samples. The beta-binomial
distribution can be considered as a binomial distribution where the probability of success
(thus of obtaining for example allele A) is not fixed but follows a beta distribution. As the
model can capture extra variance, it is expected to provide a better fit to the data.
Lastly, also the assumption of the Hardy-Weinberg equilibrium can cause a problem. If
the data are not in equilibrium, but for example inbreeding occurs, the weights used in the
mixture distribution are not valid. This may lead to false positives, as e.g. a situation of
100% inbreeding and 100% imprinting both lead to the (apparent) lack of heterozygotes and
significant detection of imprinting. However, this can be solved by estimation of the amount
of inbreeding and correction for this in the weights of the mixture model. This is rather
straightforward to implement, as we can assume that most loci are not imprinted and that
the alleles present in the expression data thus accurately represent the underlying genotypes
in most cases. Note that alternative weights can also be used if information on the underlying
genotypes is available (e.g. from SNP arrays). In this scenario, one can focus on solely the
putative heterozygous samples, and use weights based on the genotyping error rate.
Thus, though solutions are presented for the different encountered problems, further research
is necessary to ascertain that extra filtering of the data indeed improves overall results. After
further optimisation, the methodology will be used for detection of new imprinted loci. The
functional annotations of the significant SNP positions should then be determined as well.
This will allow better understanding of the process and importance of imprinting and may
shed further light on the correctness of the “parental conflict hypothesis” as the evolutionary
origin of imprinting.
58
4. Discussion
4.2
Detection of differential imprinting
In this master’s thesis, different methodologies for detection of differential imprinting were
developed as well. Simulation studies were conducted to evaluate these methods. Detection
of loss of imprinting was amongst others based on comparing two populations of binomial pvalues from control and tumour samples with the Kolmogorov-Smirnov test or the Wilcoxon
Rank Sum test. Table 3.6 shows that detection of differential imprinting was indeed possible
for diverse allele frequencies. However, a bias was detected for small minor allele frequencies.
This bias is most likely also present for more balanced allele frequencies, but with far lower
impact due to the higher fraction of heterozygous samples. Hence, binomial p-values were
replaced by the ratios of lowest on highest allele counts (Table 3.7). For both approaches,
the KS test was the most sensitive test, which is probably due to a varying assessment of
differences between two populations. The KS test is able to detect differences in shape. 137 In
figures 3.11 and 3.12 it is clear that the shapes of the distributions are not equal. Because
the power of the KS test is higher than the WRS test for differences in shape, it is obvious
that the KS test was better for the detection of loss of imprinting. Moreover, as the KS
test looks at the maximal difference of the cumulative distributions for two populations, it
may be better suited to observe a local difference in the distributions. A varying amount of
heterozygotes is a local difference, which may explain the increased sensitivity of the KS test.
However, as ties were present in the data, caution remains necessary, especially for real data.
A methodology for detection of differential imprinting was also developed based on the likelihood ratio test. Simulation studies again showed promising results with good estimates of
the degree of imprinting and a good fit of the PMF (Fig. 3.13 and 3.14). However, some
problems are still present in the methodology. Figure 3.15 shows that the distribution of
the test statistic under the null hypothesis cannot yet be accurately modelled. Hence, the
test is not valid up until now. Nevertheless, with the results so far it is expected that small
adjustments will lead to a valid and effective test for the detection of differential imprinting.
Further research is thus still necessary to optimise and improve the method. Anyway, the KS
test for detection of LOI was already very efficient, particularly for allelic ratios, so this can
already be applied. It should however be noted that this approach does not take into account
differences in sequencing depth which affect the reliability of the obtained allelic ratios.
The TCGA data still have to be studied for loci featuring loss of imprinting with these
methodologies. New insights in tumour development and regulation could be found. However,
it is expected that for the LRT-based approach, models using the beta binomial distribution
for the heterozygous fractions will be required. As tumour tissue is very heterogeneous,
especially compared to control samples, models allowing for more variance are expected to
result in significantly better fits.
59
4. Discussion
4.3
Computational efficiency
In order to reduce computational load SeqEM was included in the different methodologies for
estimation of the allele frequency and error rate. Tables 3.1 and 3.2 show that SeqEM was
indeed able to precisely estimate these parameters. However, sequences with more than two
alleles caused a problem as SeqEM cannot cope with these sequences. Upon performing a
prior correction step, estimates of the allele frequency were still correct, but the error rate was
underestimated - though this has no major consequences for our methods. More importantly
however, the filtering step necessary to remove all third and fourth alleles increased the
computational intensity. The benefit of including SeqEM in the models was thus nullified.
Even though promising results were found with SeqEM, another genotype caller able to handle
four alleles and thus without need of a prior correction is preferred in the future. An example
is ANGSD (Analyses of Next-Generation Sequencing Data) in which the allele frequency
spectrum is estimated using maximum likelihood. 138,139 Other methods based on MLE are
available as well. 140 Nevertheless, here SeqEM proved efficient for the developed methodology,
albeit with some obstacles.
Another computational intensive step is the estimation of the imprinting factor in the LRTbased methods. In R this was implemented with a line search and this was the most demanding
step in the algorithm. As more and more data will need to be analysed in the future, improvement of this step is desirable. A faster estimation could be obtained using the method of
moments. 141 Here, the sample moments are associated with the theoretical equations for the
population moments (using the models). As i is the only unknown factor in the model, this
can be easily transformed into an equation of i as a function of the expected value. So, a first
estimation of the amount of imprinting for a locus can be obtained without the use of a line
search. Further optimisation of i can then be done with the Newton-Raphson algorithm. 142
Through iteration of determining the tangent in an initial estimate (here the value obtained
from the method of moments) and using the intercept of the tangent as the start of the next
iteration, i can be estimated. Implementing these methods could significantly reduce the
computational load which is definitely preferred as an increasing amount of data will have
to be analysed. Furthermore, if the beta-binomial distribution (see futher) is employed in
the future, a line search will become too complicated and inefficient. Here, the method of
moments (for both mean and variance) would definitely be beneficial.
4.4
Statistical framework
Although the statistical framework of the LRT described above already enables detection of
imprinting, further improvement is still possible. The assumption here was that for a locus the
degree of imprinting was equal in all samples, this will however not always be true (especially
60
4. Discussion
in tumour data). The heterogeneity of samples can differ leading to unequal amounts of
imprinting for a SNP. By allowing the imprinting factor to vary between samples an even
better method could be obtained. Modelling i as a beta-binomial distribution would allow
variation in the degree of imprinting between samples. As this models reality better than the
methodology proposed so far, better detection will be possible.
Detection of loss of heterozygosity (LOH) can be very interesting as well. A locus can seem
homozygous due to only one parental copy being present. This often occurs in cancer in
which one copy of a tumour suppressor gene is lost. If the other copy is silenced by e.g.
mutation or DNA-methylation, the gene becomes non-functional leading to a lack of tumour
suppressor activity. This allows tumours to develop and grow. The developed methodology
for the detection of imprinting can also be employed for the detection of LOH. In essence,
LOH detection is equivalent to detection of 100% LOI but then in tumour samples instead of
controls.
61
Chapter 5
Conclusion
Imprinting, typically associated with DNA methylation, is the phenomenon where only allele
is expressed in a parent-of-origin specific manner. Often it is deregulated in cancer and nonmendelian inherited diseases. In 2014 a pipeline to screen for loci featuring monoallelic DNA
methylation in MethylCap-seq data was developed at the BioBix lab by Steyaert et al. This
methodology enabled detection of imprinted genes, and both known and novel loci could
be identified. In this master’s thesis an improved methodology was developed by solving
several drawbacks associated with the previous method. Firstly, the computational load was
reduced by using SeqEM for estimation of the allele frequencies and error rates. Secondly, a
novel statistical framework was developed, based on a mixture distribution model of the data.
Thirdly, the methodology was tested on RNA-seq data to show the efficiency on other data
types. Lastly, methods for the detection of loss of imprinting in cases compared to controls
were implemented. Efficiency and validity of the developed methodologies was shown using
simulated data, though some problems remained to be resolved.
Detection of imprinting was also studied in RNA-seq data from TCGA. Here, some well-known
imprinted loci were detected, which proves the efficacy of the method. However, additional
filtering of the data is necessary in the future to enhance detection of new imprinted genes.
Detection of LOI was not yet tested on real data, because optimisation of the methodologies
is still necessary. However, studying the TCGA data for LOI will definitely be performed
in the future. The research in this thesis thus serves as a proof-of-concept for the efficient
detection of (differential) imprinting. However, further optimisation is required, with next to
filtering of aberrant loci also additional effort to decrease computational load and to create
an even further extended statistical framework.
When a complete pipeline is developed after optimisation, the full TCGA dataset but also
other resources (RNA, DNA methylation, histone marks, ...) can be studied. This will provide
fundamental information on the process and origin of imprinting and how cancer cells benefit
from its dysregulation. The methodology can then also be used to study other diseases and
even allow detection of loss of heterozygosity in cancer.
63
References
1. Conrad Hal Waddington. The epigenotype. Endeavour, 1:18–
20, 1942.
2. Aaron D Goldberg, C David Allis, and Emily Bernstein. Epigenetics: a landscape takes shape. Cell, 128(4):635–638, 2007.
3. Eva Jablonka and Marion J Lamb. The changing concept of
epigenetics. Annals of the New York Academy of Sciences, 981
(1):82–96, 2002.
4. Conrad Hal Waddington. The strategy of the genes. Routledge,
2014.
5. Jordana T Bell and Tim D Spector. A twin approach to unraveling epigenetics. Trends in Genetics, 27(3):116–125, 2011.
6. Adrian Bird. Perceptions of epigenetics. Nature, 447(7143):
396–398, 2007.
7. Geneviève P Delcuve, Mojgan Rastegar, and James R Davie.
Epigenetic control. Journal of cellular physiology, 219(2):243–
250, 2009.
8. Alan P Wolffe and Marjori A Matzke. Epigenetics: regulation
through repression. science, 286(5439):481–486, 1999.
9. Lyle Armstrong. Epigenetics. Garland Science, 2013.
10. Peter A Jones. Functions of DNA methylation: islands, start
sites, gene bodies and beyond. Nature Reviews Genetics, 13
(7):484–492, 2012.
11. Mehrdad Ghavifekr Fakhr, Majid Farshdousti Hagh, Dariush
Shanehbandi, and Behzad Baradaran. DNA methylation pattern as important epigenetic criterion in cancer. Genetics research international, 2013, 2013.
12. Frank Larsen, Glenn Gundersen, Rodrigo Lopez, and Hans
Prydz. CpG islands as gene markers in the human genome.
Genomics, 13(4):1095–1107, 1992.
13. John Newell-Price, Adrian JL Clark, and Peter King. DNA
methylation and silencing of gene expression. Trends in Endocrinology & Metabolism, 11(4):142–148, 2000.
14. Godelieve Gheysen. Cursus epigenetica, 2012-2013.
15. Peter A Jones and Daiya Takai. The role of DNA methylation in mammalian epigenetics. Science, 293(5532):1068–1070,
2001.
16. Moshe Szyf. The dynamic epigenome and its implications in
toxicology. Toxicological Sciences, 100(1):7–23, 2007.
17. Louise Laurent, Eleanor Wong, Guoliang Li, Tien Huynh,
Aristotelis Tsirigos, Chin Thing Ong, Hwee Meng Low, Ken
Wing Kin Sung, Isidore Rigoutsos, Jeanne Loring, et al. Dynamic changes in the human methylome during differentiation.
Genome research, 20(3):320–331, 2010.
65
18. Tina Branscombe Miranda and Peter A Jones. DNA methylation: the nuts and bolts of repression. Journal of cellular
physiology, 213(2):384–390, 2007.
19. Masaki Okano, Daphne W Bell, Daniel A Haber, and En Li.
DNA methyltransferases DNMT3a and DNMT3b are essential
for de novo methylation and mammalian development. Cell,
99(3):247–257, 1999.
20. Mary Grace Goll, Finn Kirpekar, Keith A Maggert, Jeffrey A
Yoder, Chih-Lin Hsieh, Xiaoyu Zhang, Kent G Golic, Steven E
Jacobsen, and Timothy H Bestor. Methylation of tRNAAsp
by the DNA methyltransferase homolog DNMT2. Science, 311
(5759):395–398, 2006.
21. Rahul M Kohli and Yi Zhang. TET enzymes, TDG and the
dynamics of DNA demethylation. Nature, 502(7472):472–479,
2013.
22. Hao Wu and Yi Zhang. Mechanisms and functions of TET
protein-mediated 5-methylcytosine oxidation. Genes & development, 25(23):2436–2452, 2011.
23. Gerd P Pfeifer, Swati Kadam, and Seung-Gi Jin.
5hydroxymethylcytosine and its potential roles in development
and cancer. Epigenetics Chromatin, 6(10):1–9, 2013.
24. Rajneesh Richa and Rajeshwar P Sinha. Hydroxymethylation
of DNA: An epigenetic marker. EXCLI JOURNAL, 13:592–
610, 2014.
25. Silvia Udali, Patrizia Guarini, Sara Moruzzi, Andrea
Ruzzenente, Stephanie A Tammen, Alfredo Guglielmi, Simone
Conci, Patrizia Pattini, Oliviero Olivieri, Roberto Corrocher,
et al. Global DNA methylation and hydroxymethylation differ in hepatocellular carcinoma and cholangiocarcinoma and
relate to survival rate. Hepatology, 2015.
26. Tony Kouzarides. Chromatin modifications and their function.
Cell, 128(4):693–705, 2007.
27. David E Sterner and Shelley L Berger. Acetylation of histones
and transcription-related factors. Microbiology and Molecular
Biology Reviews, 64(2):435–459, 2000.
28. Ali Shilatifard. Chromatin modifications by methylation and
ubiquitination: implications in the regulation of gene expression. Annu. Rev. Biochem., 75:243–269, 2006.
29. Dafna Nathan, Kristin Ingvarsdottir, David E Sterner, Gwendolyn R Bylebyl, Milos Dokmanovic, Jean A Dorsey, Kelly A
Whelan, Mihajlo Krsmanovic, William S Lane, Pamela B
Meluh, et al. Histone sumoylation is a negative regulator in
Saccharomyces cerevisiae and shows dynamic interplay with
positive-acting histone modifications. Genes & development,
20(8):966–976, 2006.
30. Paul O Hassa, Sandra S Haenni, Michael Elser, and Michael O
Hottiger. Nuclear ADP-ribosylation reactions in mammalian
cells: where are we today and where are we going? Microbiology and Molecular Biology Reviews, 70(3):789–829, 2006.
References
31. Ken-ichi Noma, C David Allis, and Shiv IS Grewal. Transitions in distinct histone H3 methylation patterns at the heterochromatin domain boundaries. Science, 293(5532):1150–
1155, 2001.
32. Anna Portela and Manel Esteller. Epigenetic modifications
and human disease. Nature biotechnology, 28(10):1057–1068,
2010.
33. Shelley L Berger. Histone modifications in transcriptional regulation. Current opinion in genetics & development, 12(2):142–
148, 2002.
34. André Verdel, Aurélia Vavasseur, Madalen Le Gorrec, Leila
Touat-Todeschini, et al. Common themes in siRNA-mediated
epigenetic silencing pathways. International Journal of Developmental Biology, 53(2):245, 2009.
35. Helen E White, Victoria J Durston, John F Harvey, and
Nicholas CP Cross. Quantitative analysis of SRNPN gene
methylation by pyrosequencing as a diagnostic test for Prader–
Willi syndrome and angelman syndrome. Clinical chemistry,
52(6):1005–1013, 2006.
36. Manel Esteller. Epigenetics in cancer. New England Journal
of Medicine, 358(11):1148–1159, 2008.
37. Shikhar Sharma, Theresa K Kelly, and Peter A Jones. Epigenetics in cancer. Carcinogenesis, 31(1):27–36, 2010.
38. Mark A Dawson and Tony Kouzarides. Cancer epigenetics:
from mechanism to therapy. Cell, 150(1):12–27, 2012.
39. Andrew P Feinberg and Benjamin Tycko. The history of cancer epigenetics. Nature Reviews Cancer, 4(2):143–153, 2004.
40. Renata Z Jurkowska and Albert Jeltsch. Genomic imprinting:
The struggle of the genders at the molecular level. Angewandte
Chemie International Edition, 52(51):13524–13536, 2013.
50. Lara K Abramowitz and Marisa S Bartolomei. Genomic imprinting: recognition and marking of imprinted loci. Current
opinion in genetics & development, 22(2):72–78, 2012.
51. CW Hanna and G Kelsey. The specification of imprints in
mammals. Heredity, 113(2):176–183, 2014.
52. Anne C Ferguson-Smith and M Azim Surani. Imprinting and
the epigenetic asymmetry between parental genomes. Science,
293(5532):1086–1089, 2001.
53. J Greg Falls, David J Pulford, Andrew A Wylie, and Randy L
Jirtle. Genomic imprinting: implications for human disease.
The American journal of pathology, 154(3):635–647, 1999.
54. P Jelinic and P Shaw. Loss of imprinting and cancer. The
Journal of pathology, 211(3):261–268, 2007.
55. Randy L Jirtle. Genomic imprinting and cancer. Experimental
cell research, 248(1):18–24, 1999.
56. Ayman Grada and Kate Weinbrecht. Next-generation sequencing: methodology and application. Journal of Investigative
Dermatology, 133(8):e11, 2013.
57. Frederick Sanger, Steven Nicklen, and Alan R Coulson. DNA
sequencing with chain-terminating inhibitors. Proceedings of
the National Academy of Sciences, 74(12):5463–5467, 1977.
58. Erwin L van Dijk, Hélène Auger, Yan Jaszczyszyn, and Claude
Thermes. Ten years of next-generation sequencing technology.
Trends in genetics, 30(9):418–426, 2014.
59. Michael L Metzker. Sequencing technologies-the next generation. Nature Reviews Genetics, 11(1):31–46, 2010.
60. Elaine R Mardis. The impact of next-generation sequencing technology on genetics. Trends in genetics, 24(3):133–141,
2008.
41. Yoshiaki Tarutani and Seiji Takayama. Monoallelic gene expression and its mechanisms. Current opinion in plant biology,
14(5):608–613, 2011.
61. HPJ Buermans and JT den Dunnen. Next generation sequencing technology: advances and applications. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 1842(10):1932–
1941, 2014.
42. Rolf Ohlsson, Benjamin Tycko, and Carmen Sapienza.
Monoallelic expression:there can only be one’. Trends in Genetics, 14(11):435–438, 1998.
62. Eric E Schadt, Steve Turner, and Andrew Kasarskis. A window into third-generation sequencing. Human molecular genetics, 19(R2):R227–R240, 2010.
43. Hiroshi Shiba and Seiji Takayama. Epigenetic regulation of
monoallelic gene expression. Development, growth & differentiation, 54(1):120–128, 2012.
44. Andrew Chess. Mechanisms and consequences of widespread
random monoallelic expression. Nature Reviews Genetics, 13
(6):421–428, 2012.
63. Jay Shendure and Hanlee Ji. Next-generation DNA sequencing. Nature biotechnology, 26(10):1135–1145, 2008.
64. Gerardo Turcatti, Anthony Romieu, Milan Fedurco, and AnaPaula Tairi. A new class of cleavable fluorescent nucleotides:
synthesis and optimization as reversible terminators for DNA
sequencing by synthesis. Nucleic acids research, 36(4):e25–e25,
2008.
45. Denise P Barlow and Marisa S Bartolomei. Genomic imprinting in mammals. Cold Spring Harbor Perspectives in Biology, 6
(2), 2014.
65. Michael L Metzker. Emerging technologies in DNA sequencing. Genome research, 15(12):1767–1776, 2005.
46. Tom Moore and David Haig. Genomic imprinting in mammalian development: a parental tug-of-war. Trends in Genetics, 7(2):45–49, 1991.
66. Fei Chen, Mengxing Dong, Meng Ge, Lingxiang Zhu, Lufeng
Ren, Guocheng Liu, and Rong Mu. The history and advances
of reversible terminators used in new generations of sequencing technology. Genomics, proteomics & bioinformatics, 11(1):
34–40, 2013.
47. FM Smith, AS Garfield, and A Ward. Regulation of growth
and metabolism by imprinted genes. Cytogenetic and genome
research, 113(1-4):279–291, 2006.
48. En Li. Chromatin modification and epigenetic reprogramming
in mammalian development. Nature Reviews Genetics, 3(9):
662–673, 2002.
49. Marisa S Bartolomei. Genomic imprinting: employing and
avoiding epigenetic processes. Genes & development, 23(18):
2124–2133, 2009.
67. Ermanno Rizzi, Martina Lari, Elena Gigli, Gianluca De Bellis,
and David Caramelli. Ancient DNA studies: new perspectives
on old samples. Genet Sel Evol, 44(1):21–29, 2012.
68. R Alan Harris, Ting Wang, Cristian Coarfa, Raman P Nagarajan, Chibo Hong, Sara L Downey, Brett E Johnson, Shaun D
Fouse, Allen Delaney, Yongjun Zhao, et al. Comparison of
sequencing-based methods to profile DNA methylation and
identification of monoallelic epigenetic modifications. Nature
biotechnology, 28(10):1097–1105, 2010.
66
References
69. Marianne Frommer, Louise E McDonald, Douglas S Millar,
Christina M Collis, Fujiko Watt, Geoffrey W Grigg, Peter L
Molloy, and Cheryl L Paul. A genomic sequencing protocol
that yields a positive display of 5-methylcytosine residues in
individual DNA strands. Proceedings of the National Academy
of Sciences, 89(5):1827–1831, 1992.
70. Russell P Darst, Carolina E Pardo, Lingbao Ai, Kevin D
Brown, and Michael P Kladde. Bisulfite sequencing of DNA.
Current Protocols in Molecular Biology, pages 7–9, 2010.
71. Klaas Mensaert, Simon Denil, Geert Trooskens, Wim
Van Criekinge, Olivier Thas, and Tim De Meyer. Nextgeneration technologies and data analytical approaches for
epigenomics. Environmental and molecular mutagenesis, 55(3):
155–170, 2014.
72. Ning Li, Mingzhi Ye, Yingrui Li, Zhixiang Yan, Lee M
Butcher, Jihua Sun, Xu Han, Quan Chen, Jun Wang, et al.
Whole genome DNA methylation analysis based on high
throughput sequencing technology. Methods, 52(3):203–212,
2010.
73. Marina Bibikova, Jennie Le, Bret Barnes, Shadi SaediniaMelnyk, Lixin Zhou, Richard Shen, and Kevin L Gunderson.
Genome-wide DNA methylation profiling using infinium
assay. Epigenomics, 1(1):177–200, 2009.
®
®
74. Nizar Touleimat and Jörg Tost. Complete pipeline for infinium
human methylation 450k beadchip data processing
using subset quantile normalization for accurate DNA methylation estimation. Epigenomics, 4(3):325–341, 2012.
83. J Adams. Transcriptome: connecting the genome to gene function. Nature Education, 1(1):195, 2008.
84. Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq:
a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.
85. Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong
Shou, Debasish Raha, Mark Gerstein, and Michael Snyder.
The transcriptional landscape of the yeast genome defined by
RNA sequencing. Science, 320(5881):1344–1349, 2008.
86. Yongjun Chu and David R Corey. RNA sequencing: platform
selection, experimental design, and data interpretation. Nucleic acid therapeutics, 22(4):271–274, 2012.
87. Fatih Ozsolak and Patrice M Milos. RNA sequencing: advances, challenges and opportunities. Nature reviews genetics,
12(2):87–98, 2010.
88. Rasmus Nielsen, Joshua S Paul, Anders Albrechtsen, and
Yun S Song. Genotype and SNP calling from next-generation
sequencing data. Nature Reviews Genetics, 12(6):443–451,
2011.
89. Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler,
Michael R Speicher, Johannes Zschocke, and Zlatko Trajanoski.
A survey of tools for variant analysis of nextgeneration genome sequencing data. Briefings in bioinformatics, 15(2):256–278, 2014.
90. Heng Li, Jue Ruan, and Richard Durbin. Mapping short DNA
sequencing reads and calling variants using mapping quality
scores. Genome research, 18(11):1851–1858, 2008.
75. Arie B Brinkman, Femke Simmer, Kelong Ma, Anita Kaan,
Jingde Zhu, and Hendrik G Stunnenberg. Whole-genome DNA
methylation profiling using MethylCap-seq. Methods, 52(3):
232–236, 2010.
91. Brent Ewing and Phil Green. Base-calling of automated sequencer traces using phred. II. error probabilities. Genome
research, 8(3):186–194, 1998.
76. Filipe V Jacinto, Esteban Ballestar, and Manel Esteller.
Methyl-DNA immunoprecipitation (MeDIP): hunting down
the DNA methylome. Biotechniques, 44(1):35, 2008.
92. Cole Trapnell and Steven L Salzberg. How to map billions
of short reads onto genomes. Nature biotechnology, 27(5):455–
457, 2009.
77. Mark D Robinson, Clare Stirzaker, Aaron L Statham, Marcel W Coolen, Jenny Z Song, Shalima S Nair, Dario Strbenac, Terence P Speed, and Susan J Clark. Evaluation of
affinity-based genome-wide DNA methylation data: effects of
CpG density, amplification bias, and copy number variation.
Genome research, 20(12):1719–1729, 2010.
78. Peter J Park. ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10):669–680,
2009.
79. Partha M Das, Kavitha Ramachandran, Jane vanWert,
and Rakesh Singal. Chromatin immunoprecipitation assay.
Biotechniques, 37(6):961–969, 2004.
80. Tim De Meyer, Evi Mampaey, Michaël Vlemmix, Simon Denil,
Geert Trooskens, Jean-Pierre Renard, Sarah De Keulenaer,
Pierre Dehan, Gerben Menschaert, and Wim Van Criekinge.
Quality evaluation of methyl binding domain based kits for enrichment DNA-methylation sequencing. PloS one, 8(3):e59068,
2013.
81. Benjamin A Flusberg, Dale R Webster, Jessica H Lee, Kevin J
Travers, Eric C Olivares, Tyson A Clark, Jonas Korlach, and
Stephen W Turner. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature methods, 7
(6):461–465, 2010.
82. David Serre, Byron H Lee, and Angela H Ting. MBD-isolated
genome sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome.
Nucleic acids research, 38(2):391–399, 2010.
67
93. Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg,
et al. Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biol, 10(3):R25,
2009.
94. Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg
Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark
Chaisson, and Thomas R Gingeras. STAR: ultrafast universal
RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013.
95. Paul Flicek and Ewan Birney. Sense from sequence reads:
methods for alignment and assembly. Nature methods, 6:S6–
S12, 2009.
96. Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang,
Jian Wang, Karsten Kristiansen, and Jun Wang. SNP detection for massively parallel whole-genome resequencing.
Genome research, 19(6):1124–1132, 2009.
97. François Pompanon, Aurélie Bonin, Eva Bellemain, and Pierre
Taberlet. Genotyping errors: causes, consequences and solutions. Nature Reviews Genetics, 6(11):847–846, 2005.
98. Eden R Martin, DD Kinnamon, Michael A Schmidt, EH Powell, S Zuchner, and RW Morris.
SeqEM: an adaptive
genotype-calling approach for next-generation sequencing
studies. Bioinformatics, 26(22):2803–2810, 2010.
99. Daniel C Koboldt, Qunyuan Zhang, David E Larson, Dong
Shen, Michael D McLellan, Ling Lin, Christopher A Miller,
Elaine R Mardis, Li Ding, and Richard K Wilson. VarScan
2: somatic mutation and copy number alteration discovery in
cancer by exome sequencing. Genome research, 22(3):568–576,
2012.
References
100. Tom M Mitchell. Bayesian learning. Machine learning. New
York: McGraw-Hill Education, 1997.
101. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977.
102. Chuong B Do and Serafim Batzoglou. What is the expectation
maximization algorithm? Nature biotechnology, 26(8):897–899,
2008.
103. Sandra Steyaert, Wim Van Criekinge, Ayla De Paepe, Simon Denil, Klaas Mensaert, Katrien Vandepitte, Wim Vanden
Berghe, Geert Trooskens, and Tim De Meyer. SNP-guided
identification of monoallelic DNA-methylation events from
enrichment-based sequencing data. Nucleic acids research, 42
(20):e157–e157, 2014.
104. Oliver Mayo. A century of Hardy–Weinberg equilibrium. Twin
Research and Human Genetics, 11(03):249–256, 2008.
105. AWF Edwards. GH hardy (1908) and Hardy–Weinberg equilibrium. Genetics, 179(3):1143–1150, 2008.
106. Xing Fan, Yin-yan Wang, Chuan-bao Zhang, Gan You, Mingyang Li, Lei Wang, and Tao Jiang. Expression of RINT1 predicts seizure occurrence and outcomes in patients with lowgrade gliomas. Journal of cancer research and clinical oncology,
pages 1–6, 2014.
107. The Cancer Genome Atlas.
Program overview, Accessed
January 3, 2014. URL http://cancergenome.nih.gov/abouttcga/
overview.
108. Thomas J Hudson, Warwick Anderson, Axel Aretz, Anna D
Barker, Cindy Bell, Rosa R Bernabé, MK Bhan, Fabien Calvo,
Iiro Eerola, Daniela S Gerhard, et al. International network
of cancer genome projects. Nature, 464(7291):993–998, 2010.
117. The R project for statistical computing. what is R? introduction to R, Accessed April 7, 2015. URL http://www.r-project.
org.
118. Frank J Massey Jr. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, 46
(253):68–78, 1951.
119. David J Sheskin. Handbook of parametric and nonparametric
statistical procedures. crc Press, 2003.
120. T.W. Kirkman. Statistics to Use: Kolmogorov-Smirnov test,
Accessed January 3, 2014. URL http://www.physics.csbsju.edu/
stats/KS-test.html.
121. Peter A Lachenbruch. Comparisons of two-part models with
competitors. Statistics in Medicine, 20(8):1215–1234, 2001.
122. Henry B Mann and Donald R Whitney. On a test of whether
one of two random variables is stochastically larger than the
other. The annals of mathematical statistics, pages 50–60, 1947.
123. Sherri Jackson. Research methods and statistics: A critical
thinking approach. Cengage Learning, 2009.
124. R Ott and Micheal Longnecker. An introduction to statistical
methods and data analysis. Cengage Learning, 2008.
125. Anthony WF Edwards. The history of likelihood. International
Statistical Review/Revue Internationale de Statistique, pages 9–
15, 1974.
126. George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
127. Jay L Devore and Kenneth N Berk. Modern mathematical
statistics with applications. Cengage Learning, 2007.
109. The Cancer Genome Atlas. Cancers selected for study, Accessed January 3, 2014. URL http://cancergenome.nih.gov/
cancersselected.
128. Anne Catherine Black. Maximum likelihood estimation and multiple imputation: A Monte Carlo comparison of modern missing
data techniques for multilevel data. ProQuest, 2008.
110. Thomas J Giordano. The cancer genome atlas research network: A sight to behold. Endocrine pathology, 25(4):362–365,
2014.
129. John P Huelsenbeck and Keith A Crandall. Phylogeny estimation and hypothesis testing using maximum likelihood. Annual
Review of Ecology and Systematics, pages 437–466, 1997.
111. The Cancer Genome Atlas. Data portal, Accessed January 3,
2014. URL https://tcga-data.nci.nih.gov/tcga/tcgaHome2.jsp.
130. William Gould, Jeffrey Pitblado, and William Sribney. Maximum likelihood estimation with Stata. Stata Press, 2006.
112. Roger McLendon, Allan Friedman, Darrell Bigner, Erwin G
Van Meir, Daniel J Brat, Gena M Mastrogianakis, Jeffrey J Olson, Tom Mikkelsen, Norman Lehman, Ken Aldape,
et al. Comprehensive genomic characterization defines human
glioblastoma genes and core pathways. Nature, 455(7216):
1061–1068, 2008.
131. Vijay K Rohatgi. Statistical inference. Courier Corporation,
2003.
113. The Cancer Genome Atlas. Breast ductal carcinoma, Accessed January 3, 2014. URL http://cancergenome.nih.gov/
cancersselected/breastductal.
114. NCI. Breast cancer, Accessed January 3, 2014. URL http:
//www.cancer.gov/cancertopics/types/breast.
115. Inge S Pedersen, Peter A Dervan, Dennise Broderick, Michèele
Harrison, Nicola Miller, Emma Delany, Donal O’Shea, Paul
Costello, Alo McGoldrick, George Keating, et al. Frequent
loss of imprinting of PEG1/MEST in invasive breast cancer.
Cancer research, 59(21):5449–5451, 1999.
116. Preetha J Shetty, Sireesha Movva, Nagarjuna Pasupuleti, Bhavani Vedicherlla, Kiran K Vattam, Sambasivan Venkatasubramanian, Yog R Ahuja, and Qurratulain Hasan. Regulation of
IGF2 transcript and protein expression by altered methylation in breast cancer. Journal of cancer research and clinical
oncology, 137(2):339–345, 2011.
132. David J Olive. A Course in Statistical Theory. 2013.
133. Geert Molenberghs and Geert Verbeke. Likelihood ratio, score,
and Wald tests in a constrained parameter space. The American Statistician, 61(1):22–27, 2007.
134. Peter D Stenson, Matthew Mort, Edward V Ball, Katy Shaw,
Andrew D Phillips, and David N Cooper.
The Human
Gene Mutation Database: building a comprehensive mutation
repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Human genetics, 133
(1):1–9, 2014.
135. Calvin Dytham. Choosing and using statistics: a biologist’s
guide. John Wiley & Sons, 2011.
136. Verena Heinrich, Jens Stange, Thorsten Dickhaus, Peter
Imkeller, Ulrike Krüger, Sebastian Bauer, Stefan Mundlos, Peter N Robinson, Jochen Hecht, and Peter M Krawitz. The
allele distribution in next-generation sequencing data sets is
accurately described as the result of a stochastic branching
process. Nucleic acids research, 40(6):2426–2431, 2012.
68
References
137. Gerald Van Belle, Lloyd D Fisher, Patrick J Heagerty, and
Thomas Lumley. Biostatistics: a methodology for the health
sciences, volume 519. John Wiley & Sons, 2004.
138. Rasmus Nielsen, Thorfinn Korneliussen, Anders Albrechtsen,
Yingrui Li, and Jun Wang. SNP calling, genotype calling,
and sample allele frequency estimation from new-generation
sequencing data. PloS one, 7(7):e37558, 2012.
139. Thorfinn S Korneliussen, Anders Albrechtsen, and Rasmus
Nielsen. ANGSD: analysis of next generation sequencing data.
BMC bioinformatics, 15(1):356, 2014.
69
140. Su Y Kim, Kirk E Lohmueller, Anders Albrechtsen, Yingrui
Li, Thorfinn Korneliussen, Geng Tian, Niels Grarup, Tao
Jiang, Gitte Andersen, Daniel Witte, et al. Estimation of allele frequency and association mapping using next-generation
sequencing data. BMC bioinformatics, 12(1):231, 2011.
141. László Mátyás. Generalized method of moments estimation, volume 5. Cambridge University Press, 1999.
142. HM Antia. Numerical methods for scientists and engineers, volume 1. Springer Science & Business Media, 2002.