Comparison of Statistical models for genotype calling algorithms

Comparison of Statistical models for
genotype calling algorithms
Honours Thesis
Cynthia Ruijie Liu
Department of Mathematics and Statistics,
University of Melbourne
Bioinformatics Division,
The Walter and Eliza Hall Institute of Medical Research
Supervisors:
Dr Guoqi Qian and Dr Matthew E Ritchie
Contents
1 Introduction to Genetics
1.1 Human Genetics . . . . . . . .
1.2 Single Nucleotide Polymorphism
1.3 Microarray Technology . . . . .
1.4 Hardy-Weinberg Equilibrium . .
1.5 Genotyping . . . . . . . . . . .
1.6 Datasets . . . . . . . . . . . . .
2
3
Statistical Models
2.1 An Introduction
2.2 Normalisation .
2.3 GenoSNP . . .
2.4 Illuminus . . . .
2.5 CRLMM . . . .
2.6 Gencall . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
10
11
12
14
16
Applied To Genotype Calling Algorithms
of the EM Algorithm . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
26
28
34
37
41
.
.
.
.
44
44
50
51
55
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Comparison of Genotyping Algorithms
3.1 Comparison by call confidence . . . . . . . . . . . . . . . . .
3.2 Comparison by SNP quality measures for HapMap data . . .
3.3 Comparison by sample quality measures for MS-GWAS data
3.4 Comparison by Minor Allele Frequency for HapMap data . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Discussion
58
5 Conclusion
62
A Appendix 1
63
B Appendix 2
66
C Appendix 3
75
1
List of Figures
1.1
1.2
The structure of DNA with double helix . . . . . . . . . . . . . . . .
The Central Dogma of biology . . . . . . . . . . . . . . . . . . . . . .
7
9
1.3
Illustration of SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4
Beads in beadArray technology . . . . . . . . . . . . . . . . . . . . . 12
1.5
1.6
The 1mDuo and 1m BeadChip. . . . . . . . . . . . . . . . . . . . . . 15
Sample SNP raw signa and genotype callsl plots. . . . . . . . . . . . . 16
2.1
2.2
Five stages of the normalisation process. . . . . . . . . . . . . . . . . 27
Log intensities plot for all SNPs in given beadpool. . . . . . . . . . . 29
2.3
clusterplot of a given SNP on Illumina array for Illuminus method.
2.4
(A) is a boxplot of intensities for allele B by strip. (B) is smooth-
2.5
scatter plot for a given array. . . . . . . . . . . . . . . . . . . . . . . 38
An example network diagram for ANN. . . . . . . . . . . . . . . . . . 43
3.1
Accuracy versus drop rate plot from omni1exp12 chip (autosome only). 45
3.2
3.3
Accuracy versus drop rate from 370kDuo and 1mDuo chips. . . . . . 47
Average no call rate versus number of samples for omni1exp12 chip . 48
3.4
Concordance versus drop rate with from the omni1exp12 chip (X
. 35
chromosome) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5
Confidence score versus drop rate from the omni1exp12 chip (autosomal SNPs only). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6
Sample quality measures for MS-GWAS data . . . . . . . . . . . . . . 52
3.7
Sample quality measures for MS-GWAS data . . . . . . . . . . . . . . 53
3.8
Agreement between 20 lowest quality samples from four genotyping
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2
3.9
Accuracy versus minor allele frequency from 610Quad training and
test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.10 Accuracy versus minor allele frequency with less samples (Illuminus
only). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3
Abstract
Several statistical computing algorithms involving EM, Bayesian modelling and
classi cation are reviewed in this thesis in regard to their applications for genotype
calling single nucleotide polymorphisms using Illumina’s Infinium SNP BeadChips.
These algorithms are GenoSNP, Illuminus, CRLMM and GenCall. We will describe
the different statistical models underlying the four algorithms. In addition, we use
both HapMap data and association study data to evaluate the performance of the
algorithms.
Keywords: Genotyping, Illumina, HapMap, MS-GWAS, Microarray data, EM
algorithm, Artificial Neural Networks, SNP
4
Acknowledgement
I would like to thank my two supervisors: Guoqi and Matthew. In particular to
Guoqi for his continuous patience and guidance over the years of my undergraduate
and honours studies in Statistics. Many thanks to Matthew for his enthusiasm,
patience and support for my thesis topic. Without their help, this thesis would not
have been completed. I would also like to express my gratitude towards everyone
in the Smyth lab at WEHI for their feedback and help. Most of all I would like to
thank my parents for their never-ending support and Peter for his encouragement
throughout this year.
5
Chapter 1
Introduction to Genetics
A brief review of microarray genetics will be provided in this chapter. It includes an
overview of human genetics, Hardy-Weinberg equilibrium and SNP genotyping using
microarray.
1.1
Human Genetics
Human genetics concerns itself with an interesting organism: human beings. One of
the important reasons for being interested in human genetics is health. Researchers
are interested to know how the human genetics relates to health. They want to learn
about the genetic contribution to diseases such as infections, high blood pressure and
heart attacks. The study of genetics allow us to explore genetically inherited disease. There are several technologies developed to study genetics. The fundamental
technologies include DNA Sequencing, BeadArray and VeraCode technology.
The Human Genome Project is an international research project which began formally in 1990. The researchers studied the genetic to achieve the project goals. They
identified and mapped approximately 20,000 - 25,000 human genes. Researchers
also determined the DNA sequencing with three billion base-pairs. This project was
completed in 2003.
1. DNA
Every cell except the red blood cells in a human organism has a nucleus which
contains genetic information in chromosomes for that individual. The genetic
information is required for the development and functioning of the entire or6
ganism. Chromosomes are composed of deoxyribonucleic acid (DNA) and
proteins. DNA is the carrier of the genetic information whereas the protein
components provide different functions (Ziegler and Konig, 2006). DNA is a
long polymer with each its molecule consisting of two strands coiled around
each other to form a double helix, a structure like a spiral ladder. Each strand
has a backbone of alternating sugar and phosphate groups. Chemically, the
backbone is a sequence of bases. The four types of bases found in DNA are classified into two groups: adenine(A) and guanine(G) being purines, cytosine(C)
and thymine(T) being pyrimidines (Ziegler and Konig, 2006).
Thymine on
one strand always pairs with the opposing adenine with two hydrogen bonds
connecting them while cytosine always pairs with the opposing guanine on
the other strand with three hydrogen bonds connecting them, constituting
the rungs of the double helix ladder. Shown in Figure 1.1 is double-stranded
DNA and the DNA double helix. Each bonded pair in DNA is known as a
base-pair(bp).
Source of figure: http://jm3436.k12.sd.us/Event/intro.htm
Figure 1.1: The structure of DNA with double helix
7
Chromosome. The complete genetic information of a human is distributed
in a series of 23 two pairs of chromosomes known as autosomes and sex chromosomes (X for females ,X Y for males). Males have sex chromosomes X
and Y, females have two X chromosomes. These chromosomes have different
lengths of DNA sequences. They are numbered from the longest chromosome (
number 1) to the shortest chromosomes ( number 22). For males and females,
both of them have 22 autosomes, the only difference is sex chromosomes.
2. RNA and Proteins
Another type of nucleic acid is called RNA. Both DNA and RNA are nucleic
acids, but unlike DNA, RNA consists of only a single-standard molecule with
much shorter chain of nucleotides. RNA contains the base uracil instead of
thymine.
Genes are a segment of DNA which codes for a protein or an RNA chain. Genes
can vary in length from 75 to over 23 million base pairs. A set of three base
pairs are known as codons. Amino acid is a building blocks of protein. Each
of codon correspond to one amino acid. In humans, 20 different amino acids
occurs. Proteins are organic compounds which are made up of long chains
of amino acids. Proteins consist of polypeptides which are linear sequence of
repeating units of amino acids.
8
Source of figure:
http://www.phschool.com/science/biologyplace/biocoach/images/transcription/centdog.gif
Figure 1.2: The Central Dogma of biology
3. Gene expression
Most functions in human organisms are carried out by proteins. It is important
to know how a base sequence translate into protein structures. Figure 1.2 is
a dogma which shows the process of forming a protein from DNA. The gene
expression can be partitioned into two steps. The DNA is involved in the
transcription process which is known as template strand. The transcription
of DNA into messager RNA (mRNA) is the first step. The DNA double helix
is unzipped into two separate strands, the noncoding (template) and coding
(non-template) strand.The RNA molecule is synthesised from complementary
base pairing of amino acid to the template strand. As a result, RNA has
same direction and base sequence as the coding strand. After the process
of transcription, RNA is further edited for the process of translation where
the protein is synthesised. The mRNA which is produced by transcription
is decoded to produce amino acid chains, then determine the type of protein
formed.
9
1.2
Single Nucleotide Polymorphism
For each of the human genes only 0.1% of one individual DNA sequences are different among different individuals (Ziegler and Konig, 2006). Among the 0.1%
difference, over 80% are single nucleotide polymorphisms known as SNPs. A SNP
is a single base substitution of a nucleotide with another, it represents a DNA sequence variation that occurs when the single nucleotide in the genome sequence is
altered (Giannoulatou et al, 2007). SNPs in DNA sequences can have a major impact on how humans respond to a disease. SNPs are also stable. Hence biomedical
researchers have been studying them extensively for developing new medical diagnostics. A SNP is bi-allelic which has four possible alleles as shown in Figure 1.3.
Source of figure: http://www.mygenetree.com/images/img-snps.jpg
Figure 1.3: Illustration of SNPs
Suppose, for example, in a DNA sequence a strand AGCCGGT is paired with an10
other one GGCCGGT. Then the polymorphism is A/G. The possible polymorphism
are denoted by AA, AB and BB which are called genotype. If the polymorphism is
A/A or T/T, the genotype of this polymorphism is defined as AA. The polymorphism with C/C or T/T is defined as genotype BB. The rest of combinations between
four possible alleles are defined as AB. For a given sample, SNP with genotype AA
or BB is called homogeneous. SNP with genotype AB is called heterozygous.
1.3
Microarray Technology
A DNA microarray technology is a multiplex technology applied in molecular biology
research which enables the researchers to investigate and address issues which were
once thought to be non-traceable. Using traditional methods it is not possible to
analyse large number of genes in short time (Brown et al, 1999).
Microarray
technology makes it possible to measure the expression levels of many genes in
a signal reaction quickly and efficiently. This technology involves the hybridisation
which is used to identify particular genetic sequences if the complementary sequence
is available. These complementary sequence is known as a probe. The amount of
mRNA (also known as cDNA) bound to each site on an array which construct by
many DNA. Thousands of spotted probes are lie on a microarray chip which made
up of glass. The spots can be cDNA probes or oligonucleotide probes. We will
consider only gDNA microarrays as we do not use cDNA data for our comparison.
Illumina have created a microarray technology called BeadArray Technology.
This technology is based on randomly arranged 3 micron silica beads. Figure 1.4
shows how the beads look like. Oligonucleotide sequence is assigned to each beads.
Each beads contains large number of copies of oligonucleotide sequence (Gunderson
et al., 2004). BeadArray technology is used for various of DNA and RNA analysis
application includes SNP genotyping. It is deployed on two multi-sample array formats. Arrays are processed in parallel as a BeadChip or Sentrix Array Matrix. The
BeadChip format enables simultaneous processing of two samples at a time. The
experimental variability decreases as sample throughout increases. The BeadChip
technology comprises a series of strips on each BeadChip. Each strip contains above
20,000 bead types. For example, Each 1mDuo BeadChip has six pairs of strips. The
BeadChip normally contains two channel data.
11
Source of figure:
http://www.nature.com/nmeth/journal/v2/n12/images/nmeth1205-989-I2.jpg
Figure 1.4: Beads in beadArray technology
1.4
Hardy-Weinberg Equilibrium
Hardy-Weinberg equilibrium refers to the principle that the relative frequencies for
both allele and genotype remain constant in a large ideal population from generation
to generation.
Such a population must satisfy certain conditions in order for
Hardy-Weinberg equilibrium to hold. The conditions include random mating, no
mutation (the alleles do not change), no immigration and emigration, infinitely
large population size, and no selective pressure for or against any traits (Salanti et
al, 2005). As these conditions are hardly all satisfied in reality, Hardy-Weinberg
equilibrium is almost always impossible; but can serve as an approximation in many
cases or be used to test the violation of any of the conditions. The principle was
named after Godfrey Hardy and Wilhelm Weinberg who found it independently in
the early twentieth century. To illustrate Hardy-Weinberg equilibrium, suppose all
12
the conditions required hold and consider two different alleles A and B from a gene
locus. Now take a population of 100 beans of certain species, with 81 of the beans
having (red,red) allele pairs (R/R) , 18 having (red,green) or (green red) allele pairs
(R/G), and 1 having (green, green) allele pair (G/G). The beans are thoroughly
mixed into a closed environment so that random mating will occur to produce the
next generation. There are 6 possible ways in mating in terms of combinations of
alleles pairs: R/R and R/R, R/G and R/R, G/G and R/R, R/G and R/G, R/G and
G/G, or G/G and G/G. The probabilities for these 6 combinations are calculated
as in the following table:
Table 1.1: Example of Hardy-Weinberg equilibrium
Combination individual probability pair probability
81
6561
81
· 100
R/R and R/R
100
10000
18
81
2916
R/G and R/R
· 100
·2
100
10000
1
81
162
G/G and R/R
· 100
·2
100
10000
18
324
18
· 100
R/G and R/G
100
10000
18
1
36
R/G and G/G
· 100
·2
100
10000
1
1
1
G/G and G/G
· 100
100
10000
If now regarding the new generation of beans as having genotype AA, AB or BB,
the frequency of the genotypes will be
AA:
AB:
6561
10000
1458
10000
BB:
+
+
1458
10000
162
10000
81
10000
+
+
+
81
10000
162
10000
18
10000
+
+
=
8100
10000
18
10000
1
10000
=
=
= 0.81
1800
10000
100
10000
= 0.18
= 0.01
Sum of the frequency of genotypes are 1. Now there frequencies can be generalised
fully using gene symbols, let p = frequency of allele A and q = frequency of allele B,
The Hardy-Weinberg equilibrium is illustrated mathematically with the equation:
p2 + 2pq + q 2 = 1.
13
It is important to note that p added to q always equals one. The next generation’s
genotypes will occur as
frequency of genotype AA = p2
frequency of genotype AB = 2pq
frequency of genotype BB = q 2
The Hardy-Weinberg equilibrium for the X chromosome is more subtle than for
the autosomes. This principle is applied to most of genotyping algorithms. It is
basic concept for mixture proportion setting in mixture models. We will discuss the
mixture models in Chapter 2.
1.5
Genotyping
SNP genotyping is a technology by which genome are sampled at specific locations
to provide a measurement of genetic variation. Two variants of the same genetic
locus are referred to as alleles. SNPs are the most common type of genetic variation
in humans. The distribution of SNPs from HapMap with large numbers of samples
can be assayed and genotyped using different SNP chips (Brown et al, 1999). Now
there are more than 10 types of chips available which can be used to assay over 1
million SNPs per individual (sample). There are two main companies providing the
SNP technology: Illumina and Affymetrix. We will focus on the comparison of chips
from Illumina. Some of the genotyping algorithms such as CRLMM are created for
both Illumina and Affymetrix platforms. The genotyping algorithm for Illumina is
simpler as Illumina data do not have fragment length. For each individual a DNA
sample is required to produce DNA of similar quality for genotyping. Each chip has
different number of SNP probes to assay genotypes. We can also know the location
of these SNP markers in genome from HapMap project.
14
1M
1M duo
Source of figure:
http://www.servicexs.com/plaatjes/Illumina/rtColumnProdServDNA1M-Duo.gif
Figure 1.5: The 1mDuo and 1m BeadChip.
Source:
http://www.servicexs.com/plaatjes/Illumina/rtColumnProdServDNA1M-Duo.gif
Illumina BeadChips has number of strips and each of them contains many randomly replicated beads. The allele A and B are discriminated using either a red
or green dye (Steemers et al, 2006). Scanning device is used to scan the data by
different strips. A summary of intensities for each channel is reported in proprietary
idat files (Ritchie et al, 2009). Single, Duo and Quad Beadchips are three sample
formats and it is available for human genotyping. Figure /reffig24 are images of 1m
and 1mDuo BeadChips. These two BeadChips are two different type of Chips with
different number of strips and SNPs. The genotypes of the samples are called using signal extracted from the idat files by genotyping algorithms. Most genotyping
algorithms use normalised intensities from allele A and allele B as basic input data.
15
A
B
S
Figure 1.6: Sample SNP raw signa and genotype callsl plots.
Figure 1.6 is a set of plots of a typical sample. Figure 1.6A is a smooth-scatter
plot of one typical sample. It shows raw signal seperates into three clusters. The raw
intensities are non-normalised data. Figure 1.6B is a genotype calls plot of a typical
sample. Each SNP with genotype AA is labeled with black colour, Red points are
for SNPs with genotype AB and Green points are for SNPs with genotype BB. SNPs
with different genotypes are well seperated into three clusters. Cluster algorithm is
the basic idea for most of genotype calling algorithms. More details
1.6
Datasets
Four genotyping algorithms will be reviewed in this thesis. They are then applied
to various data sets: HapMap data from different chip types and association study
data. This section will give a short introduction of these two data sets.
16
1. HapMap data
The international HapMap project involves collaboration among researchers
from six countries. The goal of this project is to develop a haplotype map
of human genome and to genotype well defined SNPs. Now they have genotyped several millions SNPs from four populations: 90 samples include 30 trios
from Caucasians with European ancestry living in Utah and USA, 90 samples
include 30 trios from Yorbura people in Ibadan and Nigeria, 45 unrelated samples from Japan and 45 unrelated samples from China. Illumina is a major
company producer of SNP chip technology. Current SNP chips allow to assay
of over one million SNPs per sample. In this comparison, we use the data generated at Illumina on HapMap samples. We will use ten different chip types
and each of chips have different number of SNPs. Table 1.2 gives some details
about the number of samples and SNPs for each chip type. In the data there
are ten different chip types with a total of 1982 training samples and 250 test
samples (International HapMap Consortium, 2007).
Table 1.2: Summary of the HapMap samples
Chip type training sample test sample SNPs per sample
370kDuo
115
45
370,404
370kQuad
225
0
373,397
550k
112
0
561,466
610kQuad
225
31
620,901
650k
112
15
660,918
660kQuad
268
47
657,366
1m
118
12
1,072,820
1mDuo
269
45
1,199,187
omni1Quad
267
67
1,140,419
omni1exp12
270
0
733,202
2. Association study data
Genome wide association study can be understood as a case-control study in
statistics. The aim of a genome wide association study (GWAS) is to examine
17
the associations between genotype and phenotype. The selection of people are
based on the presence or absence of a particular phenotype often a disease state
of interest. Each genetic variant has only small effect on the overall phenotypic
variation which requires large number of samples for study. So most of GWAS
data have thousands of samples for detecting phenotypic variation. The other
data set for our comparison is from a recent genome wide association study on
multiple sclerosis. The samples from this data set are collected from Australia
and New Zealand. The total number of samples are 1943 from 6 batches. Each
sample has more than 300,000 observations which obtained from Illumina ’s
370k Duo Beadchip platform. The table below shows the actual numbers of
samples for the 6 batches, which broadly correspond to a center where samples
were recruited from.
Table 1.3: Summary of the association study samples
Batch Number of samples Number of SNPs per sample
MS1
115
370,404
MS2
225
370,404
MS3
112
370,404
MS4
225
370,404
MS5
112
370,404
MS6
268
370,404
18
Chapter 2
Statistical Models Applied To
Genotype Calling Algorithms
In this chapter we provide a short introduction of the EM algorithm which is fundamental for three of the genotyping algorithms we are using. We also discuss the four
statistical models underlying the four genotyping algorithms: GenoSNP, Illuminus,
CRLMM and GenCall.
2.1
An Introduction of the EM Algorithm
The idea of the Expectation-Maximisation (EM) algorithm was developed in a number of statistical problems published up until 1970’s. The Expectation-Maximisation
algorithm was named by Dempster,Laird and Rubin (1977) which is now widely
known as the EM algorithm (Dempster et al, 1977). EM algorithm is an efficient
iterative procedure to compute maximum likelihood estimates when direct maximisation of the observed-data log-likelihood is not feasible by the absence of some part
of the data expected in a more familiar and simpler data structure. Each iteration of the EM algorithm consists of two steps: E-step and M-step. The E-step or
expectation can be considered as filling in the missing data. Once the missing
data are reconstructed, the parameters are estimated in the M-step or maximisation
where we can solve M-step either analytically or numerically. Nowadays, the EM
algorithm has become a popular tool in statistical estimation on incomplete data
as well as on random effects and mixture estimation. The EM algorithm has been
successfully applied for computing the maximum likelihood estimation when the
19
complete data contain variables which are never observed. The EM algorithm can
avoid wildly overshooting or undershooting the maximum of the likelihood along its
current direction of search (Ziegler and Konig, 2006).
1. The derivation of EM algorithm
Suppose X is a set of observed data and θ is the set of parameters of interest.
(Borman, 2004). We want to find the values of θ at which the observed-data
likelihood function achieves the maximum. The first thing we need to do is to
find out the observed-data log-likelihood function which is defined as
L(θ) = ln(f (X|θ)).
Since ln(·) is an increasing function, the value of θ maximising L(θ) will also
maximise the observed-data likelihood function. Sometimes it is not easy to
directly maximise L(θ). Then it is possible to use the EM algorithm which
iteratively increases L(θ) until achieving the maximum. Let θn be the estimate of θ at the nth iteration. Let Z be the latent data which we wish to
have observed but not. We assume Z to be discrete because that is relevant to
what is involved in genotyping algorithms. Now maximising L(θ) is equivalent
to maximising the difference
ln f (X|θ) − ln f (X|θn )
where
f (X|θ) =
X
f (X|z, θ)f (z|θ)
z
Thus:
X
L(θ) − L(θn ) = ln(
f (X|z, θ)f (z|θ)) − ln f (X|θn )
z
= ln(
X
f (z|X, θn )
z
f (X|z, θ)f (z|θ)
) − ln f (X|θn )(2.1.1)
f (z|X, θn )
By Jensen’s inequality,
ln
n
X
λ i xi ≥
i=1
n
X
i=1
20
λi ln(xi )
for constants λi ≥ 0 satisfying
Pn
i=1
λi = 1. Applying Jensen’s inequality to
equation (2.1.1), the constants term can be identified to be the form f (z|X, θn ).
P
Since f (z|X, θn ≥ 0 and z f (z|X, θn ) = 1, now we rewrite equation (2.1.1)
with constants f (z|X, θn ),
L(θ) − L(θn ) ≥
X
=
X
f (z|X, θn )ln(
f (X|z, θ)f (z|θ)
) − lnf (X|θn )
f (z|X, θn )
f (z|X, θn ) ln(
f (X|z, θ)f (z|θ)
)
f (z|X, θn )f (X|θn )
z
z
∆
= ∆(θ, θn )
(2.1.2)
We continue by writing
L(θ) ≥ L(θn ) + ∆(θ, θn )
If we set
∆
l(θ|θn ) = L(θn ) + ∆(θ, θn )
the relationship in equation (2.1.2) can be made as
L(θ) ≥ l(θ|θn )
(2.1.3)
The function l(θ|θn ) is bounded above by the likelihood function L(θ).The
objective of EM algorithm is to choose the θn+1 as the value of θ at which
L(θ) is a maximum. We can observed that,
l(θn |θn ) = L(θn ) + ∆(θn |θn )
X
f (X|z, θn )f (z|θn )
)
= L(θn ) +
f (z|X, θn ) ln(
f (z|X, θn )f (X|θn )
z
X
f (X, z|θn )
= L(θn ) +
f (z|X, θn ) ln(
)
f (X, z|θn )
z
X
= L(θn ) +
f (z|X, θn ) ln 1
z
= L(θn )
(2.1.4)
The value of the function l(θ|θn ) is equal to L(θn ) at the current estimate
for θ = θn . Combining equation (2.1.3) and equation (2.1.4), then we can
conclude that
21
L(θn+1 ) ≥ l(θn+1 |θn ) ≥ l(θn |θn ) = L(θn )
As a result, any value of θ which increases l(θ|θn ) as compared to l(θn |θn )will
also increase L(θ) as compared to L(θn ). Finding θ + n + 1 which maximises
l(θ|θn ) can be realised s the following.
θn+1 = arg max{l(θ|θn )}
θ
= arg max{L(θ) + ∆(θ, θn )}
θ
= arg max{ln(f (X|θn ) +
X
θ
f (z|X, θn ) ln(
z
f (X|z, θ)f (z|θ)
)}
f (z|X, θn )f (X|θn )
X
f (z|X, θn ) ln(f (X|z, θ)f (z|θ))}
= arg max{
θ
z
X
f (X, z, θ)
= arg max{
f (z|X, θn ) ln(
)}
θ
f
(θ)
z
= arg max{Ez|X,θn ln f (X, z|θ)}
θ
2. The general procedure of EM algorithm
The general procedure of EM algorithm for finding the the maximum likelihood
estimator θ is detailed below
(a) Initialise θ0 according to prior knowledge of what optimal parameter value
should be or by randomly selecting a value.
(b) Given the current iterate θn where n = 1, 2, ...
E-step: compute the conditional expectation Q(θ, θn ) = Ez|X,θn ln f (X, z|θ)
M-step: maximise Q(θ, θn ) to obtain θn+1 where Q(θn+1 , θn ) ≥ Q(θ, θn ).
The E-step and M-step are alternately repeated until the difference |L(θn+1 )−
L(θn )| is smaller than a prescribed tolerance error, say 10−6 .
3. An example of four-component normal mixture model using EM
algorithm
22
The EM algorithm is the basic algorithm being applied to the statistical models
for three of the genotyping methods (GenoSNP, Illuminus and CRLMM). In
this section, we illustrate the EM algorithm for a four-component mixture
model. Suppose we have a set of independently observed data, X, from a fourcomponent normal mixture model (Nityasuddhi and Bohning, 2003). Then X
will have the following probability density function
p1
1
p2
1
f (xi |θ) = √ exp{− 2 (xi − µ1 )2 } + √ exp{− 2 (xi − µ2 )2 }
2σ
2σ
σ 2π
σ 2π
p3
1
p4
1
+ √ exp{− 2 (xi − µ3 )2 } + √ exp{− 2 (xi − µ4 )2 }
2σ
2σ
σ 2π
σ 2π
where θ is a set of unknown parameters θ = {(pi , µi , σ), i = 1, · · · , 4} with
P4
i=1 pi = 1 and σ > 0. In order to find the maximum likelihood function of θ, we introduce a new unobserved indicator variable Zi where Zi =
P
(zi,1 , zi,2 , zi,3 , zi,4 ) with zi,k ∈ {0, 1} and 4k=1 zi,k = 1. The conditional pdf
d
of Xi given Zi is Xi = N (µi , σ 2 ). Now we are trying to work out the log
likelihood function for observed data first.
ln L(θ|xn ) = ln
n
Y
i=1
f (xi |θ) =
n
X
ln f (xi |θ)
i=1
n
X
p1
1
p2
1
2
2
exp{ 2 (xi − µ1 ) + √
exp{ 2 (xi − µ2 )
=
ln[ √
2
2
2σ
2σ
2πσ
2πσ
i=1
p3
1
1
p4
2
2
√
+
exp{ 2 (xi − µ3 ) + √
exp{ 2 (xi − µ4 ) ]
2σ
2σ
2πσ 2
2πσ 2
Now if trying to add the missing variable Zi into the log likelihood function
and make the data complete, the new log likelihood function ln L(θ|xn , Zn )
23
will be
ln L(θ|xn , Zn ) = ln
n
Y
g(xi , zi |θ) =
i=1
=
n
X
i=1
n
X
n
X
ln g(xi , Zi |θ)
i=1
ln{f (xi |zi , θ)f (zi )}
(xi − µ1 )2 zi1 p2
(xi − µ2 )2 zi2
p1
√
√
exp{−
exp{−
}] [
}]
=
ln [
2σ 2
2σ 2
σ 2π
σ 2π
i=1
p3
(xi − µ3 )2 zi3 p4
(xi − µ4 )2 zi4
[ √ exp{−
}] [ √ exp{−
}]
2σ 2
2σ 2
σ 2π
σ 2π
n
X
√
[zi1 ln p1 + zi2 ln p2 + zi3 ln p3 + zi4 ln p4 ]
= −n ln 2π − n ln σ +
i=1
n
1 X
−
[zi1 (xi − µ1 )2 + zi2 (xi − µ2 )2 + zi3 (xi − µ3 )2 + zi4 (xi − µ4 )2 ]
2σ 2 i=1
(2.1.5)
The conditional pdf of Zi given Xi = xi and θ is p(Zi |xi , θ). Since Zi is an
indicator variable belonging to {1, 2, 3, 4} so now we are trying to write down
the conditional pdf of Zi equal to 1 first, the equations of P (zik = 1|xi , θ) will
be quite similar to P (Zi1 = 1|xi , θ).
o
n
−(xi −µ1 )2
p1 exp
2σ 2
n
o
P (z1 |yi , θ) = P
(2.1.6)
−(xi −µj )2
4
p
exp
j
2
j=1
2σ
and the P (Zi1 = 1|xi , θ) + P (Zi2 = 1|xi , θ) + P (Zi3 = 1|xi , θ) + P (Zi4 =
1|xi , θ) = 1.
From the previous section it is easy to know that it requires to compute the
Q(θ, θ(j) ) and use it in E-step in EM algorithm. Using the equation (2.1.5)
and equation (2.1.6) to derive and simplify Q(θ, θ(j) ) for this four-component
normal mixture model. The θ(j) denote the j-th update of θ obtained from
24
the EM algorithm. The Lagrange multiplier for Q(θ, θ(j) ) is
Q̃(θ, θ(j) ) = E[ln L(θ|xn , Zn )|xn , θ(j )]
n
X
√
(j)
(j)
(j)
(j)
(j)
= −n ln 2π − nlnσ +
[zi1 ln p1 + zi2 ln p2 + zi3 ln p3 + zi4 ln p4 ]
i=1
−
1
2σ 2
n
X
(j)
(j)
(j)
(j)
[zi1 (xi − µ1 )2 + zi2 (xi − µ2 )2 + zi3 (xi − µ3 )2 + (zi4 )(xi − µ4 )2 ]
i=1
+ λ(p1 + p2 + p3 + p4 − 1)
In order to work out M-step in EM algorithm, (j +1)th update of θ is required.
The update can be written by making the deviations of p1 , p2 , p3 , µ1 , µ2 , µ3 , µ4 , σ
to zero and solving the equations first. So the deviations are
Pn (j)
∂ Q̃(θ, θ(j) )
i=1 zik
=
+ λ = 0, k = 1, 2, 3, 4;
∂pk
pk
n
1 X (j)
∂ Q̃(θ, θ(j) )
z (xi − µk ) = 0, k = 1, 2, 3, 4;
=
∂µk
σ 2 i=1 ik
n
4
∂ Q̃(θ, θ(j) )
1 X X (j)
n
=
[zik (xi − µk )2 ] − = 0
3
∂σ
σ i=1 k=1
σ
∂ Q̃(θ, θ(j) )
= p1 + p2 + p 3 + p4 − 1 = 0
∂λ
(j+1)
Denote zi
= E(Zi |yi , θ(j+1) ) and solve the above equations to implement
the M-step, the (j + 1)th update for (p1 , p2 , p3 , µ1 , µ2 , µ3 , µ4 , σ) will be
n
(j+1)
p1
1 X (j)
z
=
n i=1 i1
n
1 X (j)
=
z
n i=1 i3
Pn (j)
zi1 xi
= Pi=1
(j)
n
i=1 zi1
(j+1)
p3
(j+1)
µ1
(j+1)
µ3
(j)
i=1 zi3 xi
Pn (j)
i=1 zi3
Pn
=
25
n
(j+1)
p2
1 X (j)
=
z ,
n i=1 i2
n
1 X (j)
z
=
n i=1 i4
Pn (j)
zi2 xi
(j+1)
µ2
= Pi=1
(j)
n
i=1 zi2
(j+1)
p4
(j+1)
µ4
(j)
i=1 zi4 xi
Pn (j)
i=1 zi4
Pn
=
σ (j+1)
2.2
v
u n 4 n
o
u 1 X X (j)
(j+1)
zik (xi − µk )2
=t
n i=1 k=1
Normalisation
Normalisation is a process for reducing the variation between arrays of non-biological
origin. Normalisation is always needed for analysing microarray data. It is involved
in most of genotype calling algorithms. Illuminus, GenCall and CRLMM use normalised X and Y intensities as input. Illuminus uses the normalised X and Y from
GenCall. Both Illuminus and GenCall have same normalisation method. Table 2.1 is
a summary of features of each algorithm. This table contains normalisation method,
underlying model and computer operating system.
Table 2.1: Summary of features of each algorithm
Algorithm
Normalisation
GenoSNP
None
Illuminus
Affine transformation
CRLMM Quantile normalisation
GenCall
Affine transformation
Model
Computer operating system
within-sample
Linux/Windows
between-sample
Linux
between/within-sample
Linux/Windows/Mac
between-sample
Windows
Six-degree of freedom affine transformation was given by Kermani in 2005 (kermani, 2005). This normalisation algorithm has five steps. Figure 2.1 show plots of
data transformation resulting from five stages of the normalisation process.
26
Source of figure: Kermani (2005)
Figure 2.1: Five stages of the normalisation process.
Firstly, the SNPs with allelic intensities smaller than the first percentile or larger
than the 99th percentile are defined as outliers. These outlier SNPs are removed
within each Beadpool. In particular, SNPs with missing values were also removed.The
second step is background estimation. In order to fit candidate homozygotes, they
were defined the scatter points which were closest to X-axis sweep points as candidate homozygotes A control points. Candidate homozygotes B control points were
defined for those points use similar sweep on Y-axis( Figure 2.1B). Two lines were
fitted through A and B control points. The intercept of two lines was identified then
establishing parameters for translation. All the signal value scatter points were
translated using established parameters. After translation, the signal value scatter
points were rotated and sheared. The final step is scaling the mean via the control points to determine the normalised intensities( Kermani,2005). The normalised
intensities are automatically occurs within Illumina software. Both Illuminus and
GenCall uses same normalisation data as input.
27
CRLMM uses another normalisation method called quantile normalisation (Bolstad et al., 2003). An across-array normalisation is need for analysing data. Because
the difference between array intensity distributions is large across batches and labs.
Quantile normalisation can correct those batch and lab effects. The raw X and Y
intensities are normalised between channels and samples by each strip. The normalisation between channels can remove any dye-bias effect (Ritchie et.al., 2009). The
CRLMM algorithm is applied after normalisation.
2.3
GenoSNP
GenoSNP is a genotyping algorithm developed for the Illumina’s Infinium SNP
genotyping assay. The idea behind GenoSNP is fitting the within sample model
to the data without requiring any need for a population of control samples as well
as parameters derived from the population (Giannoulatou et al, 2008). The SNPs are
separated into each Beadpool. They quantile normalised the data for SNPs in each
Beadpool. But the latest vision of GenoSNP is using non-normalised data as for the
bigger arrays the quantile normalisation between alleles is not necessary. GenoSNP is
the only method to fit the within sample model and to use non-normalised intensities
from GenCall. Figure 2.2 is a plot of all SNPs in a given Beadpool for one HapMap
data. X and Y are log2 scale intensity. The red points denote genotype AA, green for
AB, blue for BB and black for no call. The figure shows the data is well separated and
why the GenoSNP has good genotyping performance. The three genotype clusters
are easily discernible.
28
Source of figure: Giannoulatou et al (2008)
Figure 2.2: Log intensities plot for all SNPs in given beadpool.
There are two methods of model for posterior inference are examined in the paper.
The first approach is based on the standard Expectation-Maximisation algorithm
and the second approach is based on a Variational Bayesian EM algorithm.
1. First approach: Statistical model behind GenoSNP by using standard EM algorithm
Let x0i to be a pair of log2 scale intensities {log2 (xi + 1), log2 (yi + 1)} for a ith
SNP. A four-component student t-distribution mixture model is formed with
four components: π, µi , zi and θ. where π are the mixture proportions, zi is
the unobserved indicator variable for the latent genotype class, µi is a latent
scale variable and parameters θ where θ = {π, µ, λ}. The component mixture
29
can be described in a hierarchical form(Giannoulatou et al, 2008),
4
Y
I(zi =m)
πm
(2.3.7)
G(µi |vm /2, vm /2)I(zi =m)
(2.3.8)
f (zi |θ) =
m=1
f (µi |zi , θ) =
4
Y
m=1
4
Y
f (xi |µi , zi , θ) =
N (xi |µm , µi λm )I(zi =m)
(2.3.9)
m=1
where N denotes bivariate normal distribution and G denotes Gamma distribution. Before going through the steps for standard EM algorithm, it is necessary
to explain more details about this four component student t mixture model.
As being well known, Student t-distribution has a complex formula which will
make lots of trouble for computing EM algorithm with four component. From
previous equations, the posterior function with four components is proportion
to
f (xi , µi , zi , θ) = f (xi |µi , zi , θ)f (µi |zi , θ)f (zi |θ)f (θ)
(2.3.10)
where f (θ) is the prior for {π, µ, λ}. The GenoSNP authors have made some
assumptions to simplify the model. Four new distributions were introduced
first, the Dirichlet distribution, Gamma distribution, bivariate normal distribution and Wishart distribution. A prior is created for the mixture weight
which is given by Dirichlet distribution. Suppose the mixture proportions
d
π = {π1 , π2 , π3 , π4 } have distribution π = D(κ1 , κ2 , κ3 , κ4 ). The function
f (π|κ) have the probability density function
f (π|κ) =
Γ(κ1 + κ2 + κ3 + κ4 )
· π1k1 −1 · π2k2 −1 · π3k3 −1 · π4k4 −1(2.3.11)
Γ(κ1 ) × Γ(κ2 ) × Γ(κ3 ) × Γ(κ4 )
where
π1 + π2 + π3 + π4 = 1; 0 < π1 , π2 , π3 , π4 < 1; κ = (κ1 , κ2 , κ3 , κ4 )
d
By the definition of Gamma distribution, u = Gamma(θ0 , α) where θ0 = α =
30
vm
.
2
The prior conditional probability density function for µi is
(vm /2)−1
G(µi , |vm /2, vm /2) =
µi
Z
S(x; µ, λ, v) =
∞
exp{− vmµi/2 }
µi > 0
(2.3.12)
N (x|µ, uλ)G(u|v/2, v/2)du
(2.3.13)
( v2m )
vm
2
Γ( v2m )
,
0
Another Wishart-normal prior is also created to define the location where
d
λm = W (γ, Sm ), m = 1, 2, 3, 4. By the definition of Wishart distribution, the
conditional probability density distribution of λm is
r−3
−1
exp − 21 tr(Sm
λm ) |λm | 2
√
W (λm |γ, Sm ) =
,
r
2
2r 2πΓ( 2r )Γ( r−1
)|S
|
m
2
r≥2
(2.3.14)
where Sm is a 2 × 2 positive definite matrix can be calculated from equation (2.3.13) and the scale parameter λm is a 2 × 2 positive definite random
matrix. The bivariate normal probability density distribution is
1
1
− 21
t
−1
N (µm |m0 , η0 , λm ) =
|η0 λm | exp − (µm −m0 ) (η0 λm ) (µm −m0 ) (2.3.15)
2π
2
1
1
− 12
t
−1
N (xi |µ, µi λm ) =
|µi λm | exp − (xi − µm ) (µi λm ) (xi − µm ) (2.3.16)
2π
2
In order to define the scale parameters for each genotype component, they
create a new normal-Wishart prior which is the combination of Wishart distribution equation (2.3.14) and normal distribution equation (2.3.16). Hence,
p(µm , λm ) = N (µm |m0 , η0 λm )W (λm |γ, Sm )
Overall, θ is a set of parameters with π, µ and λ. But the student-t distribution
has a quite complex form and contains too many items which can not make
the calculation simpler, Instead the complex form of student t distribution,
the function of θ can be rewritten as the combination of equation (2.3.10) . So
now the probability density function for this four component mixture model
31
can be defined as
f (zi , µi , xi , θ) = f (xi |µi , zi , θ) · f (µi |zi , θ) · f (zi |θ) · f (θ)
= f (xi |µi , zi , θ) · f (µi |zi , θ) · f (zi |θ) · f (π|κ) · f (µm , λm )
= f (xi |µi , zi , θ) · f (µi |zi , θ) · f (zi |θ) · f ((π1 , π2 , π3 , π4)|κ)
1
1
− 12
t
−1
×
· |η0 λm | exp − (µm − m0 ) (η0 λm ) (µm − m0 )
2π
2
1
R∞
r−3
exp − 2 tr(( 0 N (x|µ, uλm )G(u|v/2, v/2)du)−1 λm ) |λm | 2
√
R∞
×
r
)| 0 N (x|µ, uλm )G(u|v/2, v/2)du| 2
2r 2πΓ( 2r )Γ( r−1
2
whereλm = {λ1 , λ2 , λ3 , λ4 }.
(2.3.17)
Firstly, The observed-data likelihood function ln L(xn |θ) without latent data
is
ln L(xn |θ) = ln
n
Y
f (xi |θ) =
n
X
i=1
Z
ln
f (xi , µi |θ)dµi
i=1
Now, consider including the latent data which denote by zi in the function,
the log-likelihood function ln(L(xn |Z, θ) is
ln(L(xn |Z, θ) = ln
n
Y
f (xi |µi , zi , θ) = ln
i=1
n
X
ln{f (xi |µi , zi , θ)f (µi , zi |θ)
i=1
Secondly, trying to fit a posteriori model and to compute the maximum this
model by iteratively computing the expectation of the latent parameters. In
order to compute the expectation of the latent parameters, it needs to apply
the EM algorithm and obtain θn+1 where it requires to compute the Q(θ, θn )
first. So in this case the Q(θ, θn ) will be
Q(θ, θn ) = E[ln(θ|xn , zn , µn )|xn , θn+1 ]
= E[zi , µi , |xi , θn ) ln(f (xi , zi , µi |θ)]
XZ
f (zi , µi |xi , θi ) ln(f (zi , µi , xi , θ)dµ
=
z
Hence, it is easy to work out θn+1 by maximising Q(θ, θn ) where
XZ
θn+1 = arg max{Q(θ, θn )} = arg max{
f (zi , µi |xi , θi ) ln(f (zi , µi , xi , θ)dµ}
z
32
2. Second approach: Statistical model behind GenoSNP base on Variational Bayesian EM Algorithm
The second method for posterior inference of by GenoSNP is Variational
Bayesian EM Algorithm. This method was introduced by Beal and Ghahramani in 2003 (Beal and Ghahramani, 2003). This method constructs and optimises a lower bound on marginal likelihood and generalise the standard EM
algorithm by variational approximation to posterior distributions. The main
difference between standard EM algorithm and VB-EM algorithm is VB-EM
algorithm maintains the posterior distributions over both latent variable and
the parameters (Beal and Ghahramani, 2003). In this case, µ, x, θ and z denote same variables as before. Firstly they create a marginal likelihood of a
set of observed data x,
Z
ln p(x) = ln p(x, z, µ, θ)dzdµdθ
Z
p(x, z, µ, θ)
= ln q(x, z, µ)
dzdµdθ
q(x, z, µ)
Z
p(x, z, µ, θ)
≥
q(x, z, µ) ln
dzdµdθ
q(x, z, µ)
(By Jensen’s inequality)
Assume there is a simple factorised approximation to the distribution q(x, z, µ) =
qz,µ (z, µ)qθ (θ). Because the substitution require to turn above equation from
inequality to equality where maximisation of lower bound with distribution
q(x, z, µ) = p(z, µ, θ|x). It does not simplify the problem since the true posterior distribution p(z, µ, θ|x) requires to have knowing normalised constant. So
the ln(p(x) become:
Z
ln p(x) ≥
qz,µ (z, µ)qθ (θ)ln
p(x, z, µ, θ)
dzdµdθ
qz,µ (z, µ)qθ (θ)
= G(qz,µ (z, µ)qθ (θ))
The G is represent a function with distributions qz,µ (z, µ) and qθ (θ) (Beal
and Ghahramani, 2003). The main purpose to use variational Bayesian EM
algorithm is maximising the function G with respect to qz,µ (z, µ) and qθ (θ)
33
iteratively. The update equation with t iterative times are :
Z
(t)
(t+1)
qz,µ (z, µ) ∝ exp
ln p(z, µ, x|θ)qθ (θ)dθ
Z
(t+1)
(t+1)
qθ (θ) ∝ exp
ln p(z, µ, x|θ)qz,µ (z, µ)dµ
2.4
Illuminus
Illuminus is a model-based genotyping algorithm which was published on 2007 by
Teo et al (Teo et al, 2007). The dataset needs to be normalised before applying
algorithm. The normalisation algorithm is a six-degree of freedom transformation
in five steps. This procedure at present occurs automatically within the Illumina
Beadstudio software and output the normalised intensities. The model behind this
method was also based on EM algorithm. It requires to construct the general input
file using normalised intensities and create a separated input file for X Chromosome
with gender information. A three component Gaussian mixture model was set and
EM algorithm was used to compute the genotype for each SNP interactively.
The authors separate the data into two parts: general SNP and SNP on chromosome X (Teo et al, 2007). Those SNPs on chromosome X assign genotypes
separately. They construct the contrast and strength based on normalised signal
intensities from GenCall which are denoted as (xij , yij . So the contrast (cij ) and
(sij ) define as
xij − yij
xij + yij
= log(xij + yij )
cij =
(2.4.18)
sij
(2.4.19)
They model the distribution for Xij = (cij , sij ) using a three component mixture of
multivariate truncated t distributions. The density function of Xij is written as
F (Xij ) =
3
X
λk φk (Xij , µk , σk , υk )
k=1
where λ are mixture proportions and calculate by utilising Hardy-Weinberg equilibrium, σk is variance covariance matrix and υk is the degrees of freedom which
is pre-determined. φk=1,2,3 denote probability function for three genotypes: AA=1,
AB=2 and BB=3.
34
Source of figure: Teo et al (2007)
Figure 2.3: clusterplot of a given SNP on Illumina array for Illuminus method.
Figure 2.3 is cluster-plot of a a given SNP on the Illumina array. The grey
points represent the observed data and black lines represent the kernel densities. It
shows the variance profile for the distribution of homogenous clusters are peaked
than heterozygous clusters. The homozygote samples with contrast intensities are
different with heterzygote samples. This figure illustrates the reason for fitting
υ1 = υ3 ≤ υ2 . Based on this, the density for Xij form as
35
φ1 (Xij , µ1 , σ1 , υ1 ) =
f (Xij , µ1 , σ1 , υ1 )
R1
1 − −∞ f (Xij , µ1 , σ1 , υ1 ) dcij
φ2 (Xij , µ2 , σ2 , υ2 ) = R 1
−1
φ3 (Xij , µ1 , σ1 , υ1 ) =
f (Xij , µ2 , σ2 , υ2 )
f (Xij , µ2 , σ2 , υ2 ) dcij
f (X , µ , σ , υ )
R ∞ ij 1 1 1
1 − −1 f (Xij , µ1 , σ1 , υ1 ) dcij
For the component correspond a null class to capture outliers is introduced as a
degenerate variable with zero covariance and significantly large variances such that
the density is flat across to all possible range of values.For each initialisation of
Expectation-Maximisation procedure, the (m + 1)th update of θ can be constructed
by the updates of parameters
nk
nk
X
1 X
(m) 1
(m)
=(
c ,
s )
nk i ij nk i ij
!
Pnk (m)
Pnk (m)
(m)
(m+1)
(m+1)
(m+1) 2
)
)(sij −µs
)
j (cij −µc
i (cij − µc
Pnk (m)
Pnk (m)
(m)
(m+1)
(m+1) 2
(m+1)
)
)
)(sij −µs
j (sij − µs
j (cij −µc
(m+1)
µk
=
(µ(m+1)
, µ(m+1)
)
c
s
(m+1)
=
1
nk − 1
σk
where k = 1, 2, 3.
For SNPs with null class, the µ = (0, 0) and
σ=
10000
0
0
10000
!
Chromosome X
For SNPs on chromosome X, the genotyping algorithm is modified with gender
information note that the genotype for males will never be heterozygous as males
only contain one chromosome X. So the genotyping procedure has no change for
females and for males it will set
φ2 (Xi j, µ2 , σ2 , υ3 , gender=male) = 0
The calculation of the mixture proportions using Hardy- Weinberg equilibrium is
done by assuming males only contribute one allele copy.
36
2.5
CRLMM
Corrected Robust Linear Model with Maximum Likelihood Classification method,
known as CRLMM, is a genotyping algorithm which was originally developed for
Affymetrix SNP arrays and has recently been adapted to Illumina’s Infinium BeadChips. CRLMM is developed by Carvalho et. al in 2007 (Carvalho et al, 2007) and
redeveloped by Lin et.al in 2008 (Lin et al, 2008). Ritchie et. al present a crlmm
package for Illumina BeadChips in 2009 (Ritchie et al, 2009). In this section, we only
discuss the statistical method of CRLMM for Illumina BeadChips. The concept of
this method is fitting a three-component mixture model with cubic spline and classify the three genotypes by using two stage hierarchical model. The response data
set will be quantile normalised before applying the genotyping algorithm. They define S as average intensities. In fact it is difficult to find the cases where sum of the
intensities provide useful information so they also create log ratios (M) as quantity
control. For each array, let (xA , xB ) denote the normalised intensities from allele A
and allele B, log ratios and average intensities are shown as
M = log2 (xA ) − log2 (xB )
log2 (xA ) + log2 (xB )
S=
2
Some figures show M have powerful discrimination ability and there is a sequence
effect present for log ratios. Due to these observations, a three component mixture
model is fitted for each sample
[Mi |Zi = k] = fk (Si ) + εi,k
where Zi denotes the unobserved genotype classification variable , k = 1, 2, 3. fk
describes the cubic spline of five degrees of freedom for genotype k with average
intensity Si and εi,k is the error term. The error term is assumed to be a normal
random variable with zero mean and variance τk2 Thus the fk is normal distributed
with mean µk and variance τk2 .
37
16
A
B
5
14
0
10
M
log2(intensity)
12
8
−5
6
4
2
1.1
2.1
3.1
4.1
5.1
6.1
7.1
8.1
9.1
10.1
1.2
2.2
3.2
4.2
5.2
6.2
7.2
8.2
9.2
10.2
8
9
strip
10
11
12
S
Source of figure: Ritchie et al (2009
Figure 2.4: (A) is a boxplot of intensities for allele B by strip. (B) is smooth-scatter
plot for a given array.
Figure 2.4A is box-plot of allele B intensities by strip. The log2 scale intensities
increases as strip number increases (can be considered as rows). The intensities were
quantile normalised by each strip. Figure 2.4B is a smooth-scatter plot for a given
array. M and S are average and log ratio intensities as above. This figures shows
the SNPs with intensities are sufficiently well separated into three clusters. The red
lines represent the smoothing splines for three clusters. Most of SNPs in upper and
lower clusters have genotype AA and BB, the SNPs in middle cluster have genotypes
AB. It shows the S appears to have effect on M and the effect of S described by
a smooth spline function respectively only for AA and BB genotype SNPs . Two
spline functions for homozygote clusters have same format with different sign. The
mean value for the function of genotype AB can be set to zero as S shows no effect
on M. As the result, the probability density functions for three genotypes are
d
[M 1|Zi = 1] = N (µ1 , τ12 )
d
[M 2|Zi = 2] = N (0, τ22 )
38
d
[M 3|Zi = 3] = N (−µ1 , τ12 )
µi can be explained as a combination of mean level for each SNP. As fk is the
spline function, for sample j, assuming fj,2 = 0, fj,1 = −fj,3 This model has 16
parameters and fitted using EM algorithm. The following model is
f1 (Si , bi ) = µbi + fs (Si )
where fs is a cubic spline with five degree freedom. µbi is a mean level for each
SNPs with bi denoting the six SNP base pair: AC, AG, AT, CG, CT, GT. This
model has sixteen parameters. The estimator πi,j,k denote the probability of sample
j in genotype k for SNP i which can be considered as a weight in EM algorithm.
We should note that CRLMM assume a three-component mixture model implying
it does not have no call class. This method calls genotypes for all SNPs. They use a
supervised learning approach to genotype calling (Carvalho, et al.,2007). CRLMM
uses the HapMap data as training data. But the calls are not available for all
the SNPs. Therefore, a two-level hierarchical model is used to solve this problem.
Suppose Zi,j = k be the genotype k for SNP i on sample j. A first level of twolevel hierarchical model describe the variation seen across samples in the location of
genotype regions. The model can be written as follows
[Mi,j |Zi,j = k, mi,k ] = fj,k (Xi,j ) + mi,k + εi,j,k
where the Xi,j represents covariates known to cause bias for SNP i on sample
j and mi,k is the SNP-specific shift from the typical genotype region centers. The
vector of genotype region centers denotes as m = (mi,1 , mi,2 , mi,3 ) to define the fist
level of model. The second level of hierarchical model describe the variation seen
2
across samples within each SNP. Let σi,k
to be standard dependent variance for SNP
i with genotype k. If there is not enough data available, they use a inverse χ2 prior
to improve estimates.
1
1
∝
χ2d0,k
2
2
σi,k
d0,k s0,k
where d0,k are the degrees of freedom of the X 2 distribution and s20,k denote the
variance of a typical SNP.
The training data.
39
The estimate of the effect function f on sample j is treated as known because of
the large number of SNPs. The main idea of this supervised learning approach is to
consider the genotype of HapMap calls as known and this information can be used
to estimates the maximum likelihood mi,k and variance σi,k . Then those estimates
are updated by posterior means divided from model (Carvalho et al, 2007). As the
genotypes Zi,j on SNP i on sample j and effect function f are known, the maximum
2
are
likelihood estimates for mi,k and σi,k
−1
mi,k = Ni,k
X
Mi,j − fi,k (Xi,j )
−1
2
σi,k
= Ni,k
X
Mi,j − fi,k (Xi,j ) − mi,k 2
Here, Ni,k represents the number of indexes which is associated with samples of
genotype k on SNP i. Some cases do not have enough available data which mi and
2
σi,k
are not accurate enough to be the estimates of region centre and scale. Shrinkage
of the estimates is good solution for this case. Let V to be the variance-covariance
matrix of m, the shrinkage step is defined as follows
mi = (V −1 + Ni Σ−1 Ni Σ−1 mi
2
(Ni,k − 1)σi,k
+ d0,k s20,k
2
σi,k
=
,
(Ni,k − 1) + d0,k
(2.5.20)
Ni,k > 1
(2.5.21)
In equation (2.5.20) and equation (2.5.21), m is vector of three sample means
(mi,1 , mi,2 , mi,3 ),Σ is a 3 by 3 matrix with Σk,k = s20,k and Ni,k is 3 by 3 matrix
with entries (Ni,1 , Ni,2 , Ni,3 ). These estimates mi and σ 2 are stored and used to call
genotypes in other datasets.
In order to make any genotype call for observed log ratios, the M values are
estimated from HapMap data. The follow likelihood based distance function is
formed for prediction of genotypes.
Mi,j − fi,j (Xi,j,k ) − mi,k 2
δi,k = σi,k + (
)
σi,k
Minimising the function δi,k is used to predict the genotype k. This likelihood
function is also used to measure the confidence for each call. The confidence score
is calculated by δi,2 − δi,k for homozygous calls and minimum of (δi,1 − δi,2 , δi,3 − δi,2 )
for heterozygous. The confidence score will be used for comparing the genotyping
algorithms in details in Chapter 3.
40
2.6
Gencall
GenCall is the proprietary genotyping algorithm provided by Illumina. GenCall software will automatically cluster,call and assign the confidence score for input data
sets. The output data from GenCall is used by GenoSNP and Illuminus. In this
genotyping analysis, the data from each array is normalised by themselves while
the specific array information are used. Outliers can be removed by normalisation(Kermani, 2006). Once the data have been normalised by clustering algorithm,
the calling algorithm is applied for genotyping individual’s DNA. Firstly, the polar
coordinates (R, θ) are modified by two steps. The radius of R can be calculated by
Manhattan distance:
R=
n
X
|xi − yi |
i
The angle θ can be calculated in a standard conversion of Cartesian to polar coordinates according to formula
θ = f (tan−1 (
yi
))
xi
where (xi , yi ) are normalised X and Y intensities for SNP i. For each of SNP, To
determine the location of genotype clusters from the value the polar coordinates
(R, θ), clusters can be defined by the best fitting cluster model using HapMap data
sets as training data. A GenCall score is given a confidence level which is also
assign to each calls. GenCall only have three clusters which means there are three
genotypes: AA, AB, BB. No call is defined by GenCall score less than 0.15.
The clustering and scoring processes are implemented using an Artificial Neural Networks (ANN) (Kermani, 2006). ANN is supervised learning which requires
training data. ANN is effective in modelling nonlinear behaviour and complex interactions. ANN is an application of a set of inputs produces the set of outputs. It
is a black box model as it is quite difficult to know what is happening in the model.
So the people only really see the inputs and outputs. As the process of ANN for
GenCall is not available, this section only introduces the basic process of ANN.
ANN is usually visually represented as a graphical network with several neurons.
Let X to be the original predictor, Z is derived predictor, Y is the response neuron
and y is the actual response. Figure 2.5 is an example network diagram. The
original predictor neurons from the first layer, derived predictor from middle layer
41
and response from last. Each of predictor neuron points to all elements of the next
layer. Suppose σ is an activation function which applied to a linear combination of
the inputs. We use the example network, derived predictors are follows,
Z 1,i = σ(α0i + α1T X),
i = 1, 2, 3, 4
Z 2,j = σ(β0j + α1T Z (1,.) ),
Y k = σ(λ0k + λTk Z (2,.) ),
j = 1, 2, 3
k = 1, 2
where i, j, k denote the number of neruons on each layer. Now suppose we only
have a single hidden layer and first layer has p neurons where X (l) , l = 1, .., p.. The
(m)
hidden layer Z (m) has M neurons where m = 1, ...M . Zi
(m)
Z1
=
T
σm (αm
Xi )
and Yi can be written as
p
X
(l)
σ(
αml Xi )
≡
l=1
Yi
=
ρ(β T Zi )
≡
M
X
(m)
ρ(
βm Zi
i=1
The total loss can be expressed as follows,
T (θ)
=
n
X
{yi − ρ(β T Zi )}2
i=1
T (θ)
=
n
M
X
X
T
βm σm (αm
Xi )]}2
{yi − ρ[
m=1
i=1
where θ denotes all parameters βm and αml .
The derivatives of the total loss are
n
X
dT
(m)
= −2
{yi − ρ(β T Zi )}ρ(β T Zi )Zi
dβm
i=1
n
X
dT
(l)
T
0
= −2
{yi − ρ(β T Zi )}ρ(β T Zi )β T σm
(αm
Xi )Xi
dαml
i=1
The model parameters may be updated according to the usual gradient descent
approach by using the total loss expression,
(t+1)
βm
=
(r)
βm
− γr
42
dT
| (r)
dβm θ=θ
dT
| (r)
dαml θ=θ
The numer γr is referred to as the learning rate. Extra hidden layers add extra chain
(t+1)
αml
=
(r)
αml − γr
elements to the derivatives. Extra responses require that the total loss expression
are enclosed in an extra summation. More hidden layers require more processing
time.
Figure 2.5: An example network diagram for ANN.
43
Chapter 3
Comparison of Genotyping
Algorithms
In this chapter we compare four genotyping algorithms. We compared accuracy,
quality measures for samples and minor allele frequency for HapMap data where
independent genotype calls are available and MS-GWAS data. The relevant R codes
are provided in the Appendix.
Currently, there is no study comparing the four algorithms with each other using
large scale data sets. We give a more thorough study of the four algorithms. The
codes were written in statistical programming language R with plots produced. We
only include several plots from some typical chips in this chapter. All the plots are
available in the submitted paper (Ritchie et al., 2010).
3.1
Comparison by call confidence
As is introduced in chapter 1, we compare the genotyping algorithm for two types
of data sets: HapMap data and Association study data. HapMap data is used
to check the accuracy rate. We applied all four algorithms for both training data
and test data. To compare the four algorithms, we generated accuracy versus drop
rate plots. In order to calculate the accuracy, we load the output data with calls
and confidence scores information from four algorithms for comparison. To provide
a fair comparison, any SNPs with no call class in either Illuminus, GenCall or
GenoSNP are excluded. We investigate the performance of four algorithms on set of
remaining SNPs. Then we assess agreement of each algorithms with the independent
44
genotype HapMap calls. We remove varying percent of calls based on different
quality measures. It is referred as drop rate(Ritchie et al, 2010). Figure 3.1 is the
plot shows the accuracy of each algorithm for autosomal SNPs from omni1exp12
chip.
0.998
0.999
omni1exp12NonX
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.994
0.995
0.996
0.997
●
●
●
0.00
0.01
0.02
GenCall
illuminus
GenoSNP
crlmm
0.03
Figure 3.1: Accuracy versus drop rate plot from omni1exp12 chip (autosome only).
Omni1exp12 is a type of high density chip type containing 270 samples. The calls
from the four methods were compared with calls from the HapMap project. The null
genotypes within the HapMap project were removed. The Y-axis is the accuracy
rate over each drop point, X-axis denote the drop rate from 0 to 0.03. Specifically,
each point in the graph represents the proportion of calls above a given quality
threshold in agreement with the HapMap project. The performance of GenCall
is slightly worse than the other methods. The accuracy drop rate for CRLMM is
45
marginally better than Illuminus and GenoSNP except for those cases having less
than 0.005 drop rate.
Table 3.1: Overall concordance
method overall concordance for valid genotype calls
Illuminus
0.9974122
GenoSNP
0.9962048
CRLMM
0.9962343
GenCall
0.9966172
number of no call SNPs
403,169
2,558
0
322650
Table 3.1 shows the overall concordance between valid genotype calls from the
omni1exp12 chip and the number of SNPs with no call class for each algorithm.
The second column shows the overall concordance for each method excluding no
call SNPs. The third column shows the total numbers of no call SNPs in each
method. The Illuminus achieved the highest overall concordance rate of 0.9974122.
Illuminus also has the largest number of no call SNPs. CRLMM is the only method
which provides genotype information to all the SNPs.
46
●
0.020
0.9990
0.9980
●
●
●
●
●
0.030
0.000
0.010
0.020
drop rate
drop rate
1mDuo(test)
1mDuo(training)
0.998
●
0.010
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●
●
0.996
0.030
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
0.992
0.994
0.992
accuracy
●
0.988
accuracy
0.996
0.000
Crlmm
GenoSNP
BeadStudio
Illuminus
0.9970
●
●
● ● ●
● ● ● ●
● ● ●
● ●
●
● ●
0.9960
●
0.994
0.996
●
●
● ● ● ● ● ●
● ● ● ●
● ● ●
●
● ●
0.992
accuracy
●
370kDuo(training)
accuracy
0.998
370kDuo(test)
0.000
0.010
0.020
0.030
0.000
drop rate
0.010
0.020
0.030
drop rate
Accuracy versus drop rate for the four methods tested, 370kDuo test with 45
samples, 370kDuo training test with 115 samples, 1mDuo test with 12 samples,
1mDuo training with 269 samples.
Figure 3.2: Accuracy versus drop rate from 370kDuo and 1mDuo chips.
47
Figure 3.2 shows the accuracy versus drop rate for two training data sets: 370kDuo,
and 1mDuo. These two data sets have both training data and test data. The accuracy of CRLMM is slightly higher than other methods. The 370kDuo test data has
only 45 samples. The accuracy rate of Illuminus for 370kDuo test data set starts
from 0.9964 with 45 samples. The accuracy rate of Illuminus for 370kDuo training
data set starts from 0.998 with 225 samples. The accuracy of Illuminus seems to
increase as the number of samples is increases. We notes that Illuminus may has
worse performance with less samples. Then we tested the no call rate for Illuminus by using omni1exp12 chip. The average no call rate is calculated by the the
proportion of no call SNPs per sample.
●
10
●
5
Average no call %
15
20
Illuminus trend in no call rate
●
●
●
20
●
●
●
●
●
40
●
●
●
60
●
●
●
80
●
●
●
●
100
Number of samples
Figure 3.3: Average no call rate versus number of samples for omni1exp12 chip
We repeatedly run the Illuminus program starting with 5 samples and sequentially
with 5 more samples each time until 100 samples are tested. We found the average
no call rate to be much higher when fewer samples are available Figure 3.3. The no
48
call rate with 5 samples is 19.799%. As the number of samples increases, the no call
rate decreases. 1.2269%) is the lowest which occurs with 100 samples. The accuracy
was also found to improve as the total number of samples increases (Ritchie et al,
2010).
As Illuminus and CRLMM are done separately for X chromosome SNPs in males
, the performance of each algorithm on X chromosome SNPs is shown in follow
figures.
Figure 3.4: Concordance versus drop rate with from the omni1exp12 chip (X chromosome) .
The first plot in figure 3.4 shows the accuracy versus drop rate for X chromosome
SNPs for all samples. The second and third plots show the the accuracy versus drop
rate for male and female samples separately. The statistical models used in CRLMM
and Illuminus have different form for SNPs with chromosome X. The plots show
both CRLMM and Illuminus have better overall performances on X chromosome
SNPs than the other two algorithms. This improvement mainly come from the male
samples. The accuracy of GenCall is slightly worse than other algorithms on X
49
chromosome SNPs.
3.2
Comparison by SNP quality measures for HapMap
data
CRLMM and GenCall produce the SNP confidence measures, producing the confidence score for each SNP. Each SNP was given a confidence score. Illuminus perform
perturbation analysis on each SNP. This analysis introduces an error term to the
input intensities and each SNP is recalled with the perturbed values (Ritchie et
al, 2010). The concordance rate between original and perturbed genotypes is outputted as perturbation score. GenoSNP software does not provide confidence score
directly from the output. The author of GenoSNP mentioned that the average posterior probabilities can be used as a sample quality metric, We decide to use average
posterior probabilities as confidence score for GenoSNP.
50
0.997
0.996
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.994
0.995
confidence score
0.998
0.999
Quality Measures (omni1exp12NonX)
●
0.00
0.01
0.02
GenCall
illuminus
GenoSNP
crlmm
0.03
Figure 3.5: Confidence score versus drop rate from the omni1exp12 chip (autosomal
SNPs only).
Figure 3.5 shows plots similar to the accuracy drop rate plots as before for data
sets of different quality and from different labs. The autosomal SNPs are from
omni1exp12 chip. The accuracy rate of call confidence is replaced by confidence
scores.
It seems Illuminus has better performance than the others. GenoSNP has worst
confidence score.
3.3
Comparison by sample quality measures for
MS-GWAS data
The MS-GWAS data are tested by processing the samples obtained from different
centers and batches. We use these data to compare the performance of each method
51
on calling poor quality samples. CRLMM and Illuminus allow for the identification
of these poor quality samples. For the sample with agreement less than 40% in
other methods, GenoSNP will return an error message for producing these poor
quality samples. The dataset which includes poor quality samples may distort calling
algorithms to such a degree that mistaken calls are made even on high quality chips.
Figure 3.6 is smooth-scatter plots of three typical poor quality samples. Figure 3.6A
only show only one clusters of points. Most of points have intensities less than 7.
There are no separated clusters of points in figure 3.6B and figure 3.6C. Those
poor quality samples can be flagged out by checking signal-to-noise in CRLMM or
checking no call rate in other methods. Sample quality measures is quite necessary
in studies involving large number of samples.
6
8
10
12
14
6
4
−6
−4
−2
0
2
4
−6
−4
−2
0
2
4
2
0
−6
−4
−2
M
C
6
B
6
A
6
8
10
12
14
6
8
10
12
S
Figure 3.6: Sample quality measures for MS-GWAS data
It is necessary to have a metric for calling poor quality samples. Samples with
failed hybridisation can be quickly flagged by the quality measurement. Quality
measurement is essential in studies involving large numbers of samples. Figure 3.7
shows the sample quality measures for the MS-GWAS data for four genotyping
algorithms. The X-axis refers to the number of samples. Six colours denote the
52
14
batches from 1 to 6. For GenCall, Illuminus and GenoSNP, the Y-axis refers to the
percentage of no call per sample. The Y-axis for CRLMM refers to the signal to noise
(SNR) score per sample (Ritchie et al, 2010). Obviously, these four methods agree
for most of poor quality samples. GenoSNP, GenCall, Illuminus and CRLMM flag
many of the same samples as potential outliers. GenoSNP, GenCall and Illuminus
detect some of the same low quality samples in batch 1 where the worst sample
assigned 60 percent of no call SNPs by these three methods.
0
20
40
60
GenCall
0
250
500
750
1000
1250
1500
1750
2000
1250
1500
1750
2000
1250
1500
1750
2000
1250
1500
1750
2000
20
0
0
250
500
750
1000
0.8
0.9
1.0
GenoSNP
0
250
500
750
1000
15
30
crlmm
0
quality measure
40
60
illuminus
0
250
500
750
1000
sample number
Source of figure: Ritchie et al (2010)
Figure 3.7: Sample quality measures for MS-GWAS data
53
Venn Diagram
illuminus
GenoSNP
0
GenCall
1
crlmm
0
0
0
0
0
0
0
18
1
1
0
1
0
Unique objects: All = 22; S1 = 20; S2 = 20; S3 = 20; S4 = 20
Source of figure: Ritchie et al (2010)
Figure 3.8: Agreement between 20 lowest quality samples from four genotyping
algorithms
Figure 3.8 shows the agreement of twenty lowest quality samples detected by
the four methods (Ritchie et al, 2010). These twenty samples are chosen by the
top twenty highest no call rates in GenCall, Illuminus. The samples are chosen
by the twenty lowest average posterior probability for all SNPs in given sample for
GenoSNP. The twenty samples with the lowest SNR scores are selected in CRLMM.
All four methods agree on 18 out of the 20 samples. GenoSNP selects one out of 20
samples which is not identified by the other three methods. The sample selected by
GenCall and GenoSNP was ranked 21st by both CRLMM and Illuminus.
54
3.4
Comparison by Minor Allele Frequency for
HapMap data
Minor allele frequency (MAF) is the frequency of those SNPs whose frequent allele
occurs less frequents of the SNP occurs in a given population. SNPs can be assigned
a minor allele frequency within a population. The lower the minor allele frequency
selected, the greater is the number of tagSNPs that will be required to represent the
variation in a given genomic region. The MAF is calculated by finding the minimum
proportion of genotypes(AA or BB) for a given SNP.
We assess the agreement of all the SNPs in a given MAF interval. The SNPs
with no call class either in GenCall, Illuminus or GenoSNP are excluded in our
calculations in order for a fair comparison.
55
●
●
●
●
●
C
●
●
●
●
●
0.998
0.998
B
0.998
A
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.994
D
E
F
●
●
●
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
●
●
●
●
●
●
●
●
●
●
0.996
●
●
0.994
●
●
●
0.996
●
●
●
0.994
●
0.994
●
●
0.998
0.0 0.1 0.2 0.3 0.4 0.5
0.998
0.0 0.1 0.2 0.3 0.4 0.5
●
0.996
0.994
0.994
0.0 0.1 0.2 0.3 0.4 0.5
0.998
accuracy
●
GenCall
illuminus
GenoSNP
crlmm
0.996
●
0.996
●
0.996
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0 0.1 0.2 0.3 0.4 0.5
minor allele frequency
Plots (A) to (C) are results from 610Quad training, plots (D) to (E) are results from
610Quad test. (A) and (D) are accuracy versus MAF plots with 0% drop rate, (B)
an (D) are plots with 1% drop rate, (C) and (F) are plots with 5% drop rate.
Figure 3.9: Accuracy versus minor allele frequency from 610Quad training and test
datasets.
Figure 3.9 is an accuracy versus minor allele frequency plot for the four algorithms
tested. We tested these using the 610kQuad training and test datasets. Each of
dataset were tested with three drop rate : 0%, 1% and 5%. The CRLMM offers
best performance in accuracy on low MAF with all drop rates followed by GenoSNP
(MAF≤ 0.1). The between-sample methods (Illuminus and GenCall) with low MAF
have worse accuracy than the within-samples methods. The 610Quad training data
has 225 samples (excluding replicates). The accuracy with MAF less than 0.05
is above 0.994 for all methods. The performance of Illuminus with large samples
is good, But for 610Quad training dataset, plots do not show any performances
56
for Illuminus with low MAF. Therefore, We test the accuracy versus minor allele
frequency for Illuminus only. Figure 3.10 is a accuracy plot from 610Quad test
datasets for Illuminus. The points with different colour present the performance
with different drop rate. It shows Illuminus has bad performance with low MAF
(MAF≤ 0.15). The accuracy versus low MAF is much lower than the other three
1.00
methods.
●
●
●
●
●
●
●
●
●
●
●
0.98
●
●
●
●
●
●
●
0.96
●
●
●
●
0.94
accuracy
●
●
●
●
0.92
●
●
0.0
0.1
0.2
0.3
0.4
0
0.5
1
2
5
0.5
minor allele frequency
The points with different colour present the accuracy with different percent of drop
rate. For example, the red points present the accuracy versus MAF with 0% drop
rate
Figure 3.10: Accuracy versus minor allele frequency with less samples (Illuminus
only).
57
Chapter 4
Discussion
Several genotype calling algorithms have been developed in literature for the purpose
of processing raw intensities from SNPs into genotype calls. Genotype calling algorithms provided a foundation for genetic analysis and are widely used in Genomewide association studies. We compared four genotyping algorithms: GenoSNP, Illuminus, CRLMM and GenCall. We have introduced the model-based approach to
call genotypes for each of the algorithms. GenoSNP, Illuminus and CRLMM have
been implemented with an Expectation-Maximisation framework in programs.
GenoSNP is able to call genotypes for SNPs using within-sample method for
Illumina Infinium SNP genotyping array. This method is a truly independent genotyping method. The performance of GenoSNP is independent of the size of the
study. GenoSNP does not require large memory for computation, but it takes much
longer time than to compute the other methods.
Illuminus is a model based genotype calling algorithm which does not require the
data to be partitioned into training and test parts. It provides a metric for assessing
the robustness of the assigned genotypes to minor changes in the allelic signals. This
method produce a quality metric output data. The quality metric can be used to
identify the SNPs having low call rates and accuracy. Illuminus is quite efficient for
running. The running time is only one tenth of the GenoSNP’s running time.
GenCall is a software which automatically clusters, call-types and assigns confidence scores. To call the genotypes for individual SNPs, it takes intensity values and
information, then identifies which cluster the data for any specific locus corresponds
to.
CRLMM is a method which requires the necessary training data and prior infor58
mation on genotype calls. The training data is generated from HapMap individuals.
CRLMM is able to genotypes all the calls. CRLMM also allows for the identification
of these poor quality and low rates chips.
Illuminus and GenCall are between sample methods. CRLMM is the only method
using both between and within-samples method for modelling. Normalisation methods are also different for all the four algorithms. Generally, using normalised intensities as input data will give us better results. GenoSNP is the only method which
fits a within-sample model and uses non-normalised raw intensities as input data.
The normalisation algorithm for GenCall uses six degrees of freedom for the affine
transformation. The SNPs with allelic intensities smaller than the first percentile or
larger than the 99th percentile are defined as outliers. These outliers are removed,
then four estimations are applied at each SNP. Illuminus uses normalised data from
GenCall. CRLMM has different normalisation methods with GenCall where SNPs
are quantile normalised between two alleles and samples.
In statistics terms, GenSNP models the distribution of the probe intensities using a four-component student-t mixture model. The four mixture components correspond to AA, AB, BB and a null class. respectively, Illuminus uses a one dimensional
three-component Gaussian mixture model. The no call class was introduced as a
fourth bivariate Gaussian component with large variances and zero covariance. In
GenCall, the clustering and scoring processes are implemented using an artificial
neural network. Three genotypes are defined in the Bayesian model. The null class
is defined by an SNPs’ confidence score less than 0.15. CRLMM is the only method
that guesses genotypes for all SNPs. It underlying statistical model is a threecomponent mixture model with spline used to model the smooth the mean function.
After that, a two level hierarchical model is used for obtaining genotypes. Both
GenCall and CRLMM were trained on data generated from HapMap individuals.
Note that, the calls from the HapMap Project are known to have an error rate as
their own.
One of the disadvantages of using an artificial neural network is the computation
is complex when ANN have lots of hidden layers. It is impossible to work out the nth
derivative of the total loss. The accuracy is not so good when we using learning rate
instead of derivations. Another problem is that the neural network needs training
to operate. For the data set without training information in HapMap, GenCall does
59
not work at all. CRLMM does need training data as well.
For those methods using EM algorithm, the efficiency is low due to the iterative
process of involved for probabilities reasoning when imputing the incomplete data.
So the estimates will not be fully efficient for over-identified models. However,
the EM algorithm avoids wildly overshooting or undershooting the maximum of
likelihood along its correct direction of search. The algorithms can handle parameter
constraints elegantly.
For the performances of four genotyping algorithms, the accuracy rates are only
slightly different for all methods. On the high quality HapMap chips, the four
methods offer similar performance in terms of the accuracy. CRLMM is marginally
better than GenoSNP and Illuminus followed by GenCall on some high quality
chips such as 660Quad, omni1Quad and omni1exp12 chip. These high quality chips
have large numbers of sample size which ensures the no call rate of Illuminus to be
sufficiently small . We tested the average no call rate for Illuminus. This test is uses
omni1exp12 chip with 270 samples. Smaller samples have a higher no call rate. The
no call rate for Illuminus decreases as the sample size increases. The performance
of Illuminus depends on the sample size, large sample sizes will give better results
on accuracy and less no call rate. Overall, Illuminus has a higher no call rate than
GenCall, followed by GenoSNP. For some study cases which always have smaller
sample size, Illuminus is not the best method of choice.
The GenoSNP’s authors mention that GenoSNP can be used to improve the calls
for SNPs with lower minor allele frequencies. From the result of section 3.4, we
observed that both CRLMM and GenoSNP do well for SNPs with lower MAF. The
accuracy of SNPs with lower MAF for Illuminus and GenCall are lower than the
other two methods for large number of samples. For the performance on MAF less
than 0.15, Illuminus offers much worse performance than the others on dataset with
fewer samples. The performance of Illuminus depends on the sample size. the For
same cases with fewer sample sizes, Illuminus would not be the method of choice.
A computer with a large memory is required for this project to process the high
dimensional matrices. GenCall is only available in Windows, Illuminus is only available in Linux, both CRLMM and GenoSNP are available for Linux and Windows.
GenoSNP takes the longest time for running followed by GenCall. Running Illuminus and CRLMM are much faster than the other two methods, but CRLMM
60
requires largest memory. In contrast, GenCall and GenoSNP use very little RAM.
In addition, preparing the input file for GenoSNP and Illuminus in R are needed.
It will take several hours and large memory (greater than 20GB) for making input
files.
61
Chapter 5
Conclusion
We discuss the basic statistical models and performance of four genotyping algorithms: GenoSNP, Illuminus, CRLMM and GenCall. This study represents the
comparison of methods using widely-used datasets: HapMap data and MS-GWAS
data. Statistical models are presented theoretically. The performances of four algorithms are compared in terms of their accuracy rate, sample quality measurements
and error rates on minor allele frequencies. In general, there are not significant
differences between the methods in term of accuracy. CRLMM is slightly better
than other methods. The performance of Illuminus depends on the total number
of samples. The accuracy increases as the number of samples increases. Illuminus always has the highest no call rate. The no call rate is acceptable with more
than 50 samples. CRLMM produces genotypes for all the SNPs. Illuminus and
CRLMM have separate models for X chromosome SNPs. The performance of Illuminus and CRLMM are significantly better than the other two methods on X chromosome SNPs. All four genotyping algorithms agree on flagging low quality samples.
However, GenoSNP cannot produce genotype calls on some extremely poor quality samples. Between-samples methods(Illuminus and GenCall) calling SNPs with
low minor allele frequency have lower accuracy than GenoSNP and CRLMM. For
some studies with fewer sample sizes, Illuminus would not be choice duo to bad
performances in term of accuracy.
Moreover, as CRLMM and GenCall using HapMap data as training data, They
can not be used when the HapMap training data is not available. For genotyping
in non-model organisms, Illuminus and GenoSNP are recommended.
62
Appendix A
Appendix 1
R code for making input files for Illuminus and GenoSNP.
############################################################
#Illuminus#
############################################################
sample=read.csv("/omni1exp12/HumanOmniExpress-12v1_B.csv")
sub2=sub1[sub1[,10]!="X",]
allelesab = gsub("\\[", "", sub2[,"SNP"])
allelesab= gsub("\\/", "", allelesab)
allelesab = gsub("\\]", "", allelesab)
x=read.table("/omni1exp12/OmniExp12_X_IllNorm.txt")
y=read.table("/omni1exp12/OmniExp12_Y_IllNorm.txt")
p<-NULL
rsID1=sample[,2]
coord=subx[,11]#mapinfo change it
for(i in 1:ncol(x)){
if(i==1){
p=data.frame(x[,i],y[,i])
}
else{
p=cbind(p,x[,i],y[,i])
}
}
colnames(p)[seq(1,2*ncol(x),by=2)]=paste(colnames(x),"_X",sep="")
colnames(p)[seq(2,2*ncol(x),by=2)]=paste(colnames(x),"_Y",sep="")
p=cbind(rsID1,p)
row.names(p)<-NULL
mer=merge(subx,p,by.x="Name",by.y="rsID1")
d=cbind(rsID1,coord,allelesab,mer[,27:568])
############################################################
#GenoSNP#
############################################################
load("/omniexp12/humanomniexpress12v1bCrlmm/inst/extdata/annotationOrig.rda")
tmpmx=read.table("/omni1exp12/OmniExp12_X_Raw.txt")
tmpmy=read.table("/omni1exp12/OmniExp12_Y_Raw.txt")
fn <- "/genosnp/input/"
n=NULL
sub=annotorig[annotorig[,23]=="1",]
rsID=annotorig[,2]
ID=annotorig[,19]
for(i in 1:ncol(tmpmx)){
63
cat(i)
a[i]=colnames(tmpmx)[i]
n[[i]]=data.frame(ID,rsID,tmpmx[,i],tmpmy[,i])
m=match(sub[,2],n[[i]][,2])
n[i]=n[i][m,]
write.table(n[[i]],paste(fn,a[[i]],".txt", sep = ""),row.names=F,
col.names=F, quote=FALSE,append=T)
}
R code for running CRLMM
############################################################
#CRLMM#
############################################################
library(Biobase)
library(crlmm) # using crlmm package#
setwd("/thumper/ctsa/snpmicroarray/HapMap/processed/illumina/omni1exp12/")
X = read.table("OmniExp12_X_Raw.txt" , header=TRUE, as.is=TRUE,
check.names=FALSE, sep="\t")
Y = read.table("OmniExp12_Y_Raw.txt" , header=TRUE, as.is=TRUE,
check.names=FALSE, sep="\t")
zeroes = read.table("OmniExp12_zeroes.txt", header=TRUE,
as.is=TRUE, check.names=FALSE, sep="\t")
X = as.matrix(X)
Y = as.matrix(Y)
zeroes = as.matrix(zeroes)
tmpmat = matrix(1, nrow(X), ncol(X))
rownames(tmpmat) = rownames(X)
createHapmapInfo = function(ids){
load("/thumper/ctsa/snpmicroarray/DataSnpNexsan/SNP/
hapmap/hapmapTable.rda")
idx = match(ids, hapmapTable\$hapmap)
hapmapTable[idx, c("hapmap", "sex")]
}
samples = colnames(X)
reps = which(duplicated(samples))
# in training data, we’ll remove duplicate samples#
if(length(reps)>0) {
X = X[, -reps]
Y = Y[, -reps]
zeroes = zeroes[, -reps]
samples=samples[-reps]
}
samplenames = samples
colnames(tmpmat)=samplenames
colnames(X) = colnames(Y) = colnames(zeroes) =
colnames(tmpmat) = samplenames
stopifnot(identical(rownames(X), rownames(Y)))
stopifnot(identical(colnames(X), colnames(Y)))
stopifnot(identical(rownames(zeroes), rownames(Y)))
stopifnot(identical(colnames(zeroes), colnames(Y)))
samplesheet =createHapmapInfo(samplenames)
setwd("/thumper/ctsa/snpmicroarray/ms-gwas/crlmm/hapmap/omniexp12")
library(ff) # using ff package to save memory #
XYtrain = new("NChannelSet",
X = initializeBigMatrix(name = "X",
64
nr = nrow(X), nc = ncol(X), vmode = "integer", initdata=X),
Y = initializeBigMatrix(name = "Y",
nr = nrow(X), nc = ncol(X), vmode = "integer", initdata=Y),
zero = initializeBigMatrix(name = "zero",
nr = nrow(X), nc = ncol(X), vmode = "integer", initdata=zeroes),
annotation = "humanomniexpress12v1b",
storage.mode = "environment")
#featureNames(XYtrain) = rownames(X)
close(XYtrain@assayData\$X)
close(XYtrain@assayData\$Y)
close(XYtrain@assayData\$zero)
crlmmoni1exp12tr = crlmmIllumina(XY=XYtrain,stripNorm=TRUE,
gender=samplesheet\$sex,
seed=1, mixtureSampleSize=10^5, verbose=TRUE,
cdfName="humanomniexpress12v1b",
sns=samplenames,
returnParams=TRUE)
65
Appendix B
Appendix 2
R code for creating hapmapcalls, loading output files for four algorithms and working
out concordance. All codes are similar with comparison of quality measure part.
############################################################
#creating hapmapCalls for training and test data#
############################################################
#################################
#test data#
#################################
load(anno.rda) #load annotation information#
snpids = annot\$Name
sub=annot[annot\$ToCall=="1",]
subx=sub[sub\$ProbeType=="x snp",] #the one with"1" \& "X"
snpnamex=subx\$Name
subn=sub[sub\$ProbeType=="snp",] #the one with nonX
snpname=subn\$Name
createHapmapInfo = function(ids){
load(hapmapTable)
idx = match(ids, hapmapTable\$hapmap)
hapmapTable[idx, c("hapmap", "sex")]
}
samplesheetTrn = createHapmapInfo(samplenames)
samplesheetTst = createHapmapInfo(samplenames)
db = "/thumper/ctsa/snpmicroarray/DataSnpNexsan/SNP/
regions/hapmap-ncbi36/rs/bin/hapmap-rs.db"
library(RSQLite)
conn = dbConnect(dbDriver("SQLite"), db)
pops = c("ceu", "chb", "jpt", "yri")
sql = paste("SELECT * FROM @POP@")
all.calls = lapply(pops, function(x){
tmp = dbGetQuery(conn, gsub("@POP@", x, sql))
idx = tmp[["rs"]] \%in\% snpids
tmp[idx, names(tmp) \%in\% c("rs", samplenames)]
})
hapmapCallsTest = merge(all.calls[[1]], all.calls[(Giannoulatou et al, 2007)], all=TRUE)
hapmapCallsTest = merge(hapmapCallsTest,
66
annot, by.x="rs", by.y="Name", all=TRUE)
hapmapCallsTest =
hapmapCallsTest[match(snpids, hapmapCallsTest\$rs),
order(colnames(hapmapCallsTest))]
info = hapmapCallsTest[, -grep("NA[[:digit:]]", names(hapmapCallsTest))]
hapmapCallsTest = as.matrix(hapmapCallsTest[, grep("NA[[:digit:]]",
names(hapmapCallsTest))])
rownames(hapmapCallsTest) = info\$rs
hapmapCallsTest = hapmapCallsTest[,match(samplenames,
colnames(hapmapCallsTest))]
hapmapCallsTest[hapmapCallsTest == 0] = NA
load("hapmapCallsRev.rda")
hapmapCallsTest[reversed,] = 4-hapmapCallsTest[reversed,]
hapmapCallsTest[reversed2,] = 4-hapmapCallsTest[reversed2,]
gender = samplesheetTrn\$sex
gender=na.omit(gender)
#################################
#training data#
#################################
load(hapmaptr)
hapmapCalls = hapmapCalls[,match(samplenames, colnames(hapmapCalls))]
hapmapCalls=hapmapCalls[,-(91)]
load(geno.rda)
toIgnore = rowSums(regionInfo)!=0
hapmapCalls[toIgnore,] = NA
rm(regionInfo)
rm=match(snpname,rownames(hapmapCalls))
hapmapCallsn=hapmapCalls[rm,]
rmx=match(snpnamex,rownames(hapmapCalls))#x chr#
hapmapCallsx=hapmapCalls[rmx,]
hapmapCallsMx=hapmapCalls[rmx,gender==1]
hapmapCallsFx=hapmapCalls[rmx,gender==2]
###################################################################
#load calls and confidence score for Illuminus,find outaccuracy rate#
###################################################################
illcalls=read.table(illtrcalls,header=FALSE, quote="",
as.is=TRUE, check.names=FALSE, sep=" ", strip.white=TRUE )
illcallssex=read.table(illtrcallsx,header=FALSE, quote="",
as.is=TRUE, check.names=FALSE, sep=" ", strip.white=TRUE)
illprob=read.table(illtrprob,header=FALSE, as.is=TRUE, quote = "",
check.names=FALSE, sep=" ",strip.white=TRUE)
illprobsex=read.table(illtrprobx,header=FALSE, as.is=TRUE,quote ="",
check.names=FALSE, sep=" ",strip.white=TRUE)
illcallsTr=rbind(illcalls,illcallssex)
rm(illcalls, illcallssex)
gc()
illcallsTr=illcallsTr[,5:(ncol(illcallsTr)-1)]
illnew=rbind(illprob,illprobsex)
rm(illprob, illprobsex)
gc()
67
x = matrix(NA, nrow(illnew), ncol(illcallsTr))
a=illnew[,5:(ncol(illnew)-1)]
n=seq(1,ncol(a),by=4)
for(j in 1:length(n)){
cat(j, " ")
i = n[j]
v=a[,i:(i+3)]
x[,j] = apply(v,1,max,na.rm=TRUE)
gc()
}
cat("\n")
rm(v, a, n)
illconfsTr=x
rm(x)
gc()
setwd(rawdatadir)
y=read.table(ynormtr, nrow=10)
illrownames=rownames(illcallsTr)=rownames(illconfsTr)=illnew[,1]
rm(illnew)
gc()
illcolnames=colnames(illcallsTr)=colnames(illconfsTr)=colnames(y)
illcolnames=gsub("\\.([1-3][0-9]|[1-9])","",illcolnames)
illreps = which(duplicated(illcolnames))
rm(y)
gc()
if(length(illreps)>0) {
illcallsTr= illcallsTr[,-illreps]
illconfsTr= illconfsTr[,-illreps]
illcolnames = illcolnames[-illreps]
}else {
illcallsTr=illcallsTr
illconfsTr= illconfsTr
}
cat("Reorganising illuminus calls\n")
illnewordr = match(rownames(hapmapCalls), illrownames)
illnewordc = match(colnames(hapmapCalls), illcolnames)
illconfsTr = illconfsTr[illnewordr,illnewordc]
illcallsTr=illcallsTr[illnewordr,illnewordc]
illrm=match(snpname,rownames(illcallsTr))#nonx Chr#
illcallsTrn=illcallsTr[illrm,]
illconfsTrn=illconfsTr[illrm,]
illrmx=match(snpnamex,rownames(illcallsTr))#x chr#
illcallsTrx=illcallsTr[illrmx,]
illconfsTrx=illconfsTr[illrmx,]
illcallsTrMx=illcallsTr[illrmx,gender==1]
illconfsTrMx=illconfsTr[illrmx,gender==1]
illcallsTrFx=illcallsTr[illrmx,gender==2]
illconfsTrFx=illconfsTr[illrmx,gender==2]
x = matrix(NA, nrow(illcallsTrn), 3*ncol(illcallsTrn))
n = seq(0, 3*ncol(illcallsTrn), by=3)+1
for(i in 1:ncol(illcallsTrn)){
68
cat(i, " ")
j = n[i]
x[,j:(j+2)] = cbind(illcallsTrn[,i],illconfsTrn[,i],hapmapCallsn[,i])
gc()
}
cat("\n")
z<-NULL
n=seq(1,ncol(x),by=3)
for(i in n){
z[[i]]=x[x[,i]!="4",i:(i+2)]
}
for(i in n){
cat(i, " ")
if(i==1){
illcallsTrn=as.vector(z[[i]][,1])
illconfsTrn=as.vector(z[[i]][,2])
hapmapCallsn1=as.vector(z[[i]][,3])
}else{
illcallsTrn=c(illcallsTrn,as.vector(z[[i]][,1]))
illconfsTrn=c(illconfsTrn,as.vector(z[[i]][,2]))
hapmapCallsn1=c(hapmapCallsn1,as.vector(z[[i]][,3]))
}
}
cat("\n")
rm(x, n, z)
gc()
isnan1 = is.na(hapmapCallsn1)
hapmapvecn1 = as.vector(hapmapCallsn1[!isnan1])
illconfsTrn1=as.vector(illconfsTrn[!isnan1])
illcallsTrn1=as.vector(illcallsTrn[!isnan1])
illagreeTrn=hapmapvecn1==illcallsTrn1
x = matrix(NA, nrow(illcallsTrx), 3*ncol(illcallsTrx))
n = seq(0, 3*ncol(illcallsTrx), by=3)+1
for(i in 1:ncol(illcallsTrx)){
j = n[i]
x[,j:(j+2)] = cbind(illcallsTrx[,i],illconfsTrx[,i],hapmapCallsx[,i])
}
z<-NULL
n=seq(1,ncol(x),by=3)
for(i in n){
z[[i]]=x[x[,i]!="4",i:(i+2)]
}
for(i in n){
if(i==1){
illcallsTrx=as.vector(z[[i]][,1])
illconfsTrx=as.vector(z[[i]][,2])
hapmapCallsx1=as.vector(z[[i]][,3])
}else{
illcallsTrx=c(illcallsTrx,as.vector(z[[i]][,1]))
illconfsTrx=c(illconfsTrx,as.vector(z[[i]][,2]))
hapmapCallsx1=c(hapmapCallsx1,as.vector(z[[i]][,3]))
}
}
rm(x, n, z)
gc()
69
isnax1 = is.na(hapmapCallsx1)
hapmapvecx1 = as.vector(hapmapCallsx1[!isnax1])
illconfsTrx1=as.vector(illconfsTrx[!isnax1])
illcallsTrx1=as.vector(illcallsTrx[!isnax1])
illagreeTrx=hapmapvecx1==illcallsTrx1
isnax1 = is.na(hapmapCallsMx1)
hapmapvecMx1 = as.vector(hapmapCallsMx1[!isnax1])
illconfsTrMx1=as.vector(illconfsTrMx[!isnax1])
illcallsTrMx1=as.vector(illcallsTrMx[!isnax1])
illagreeTrMx=hapmapvecMx1==illcallsTrMx1
isnax1 = is.na(hapmapCallsFx1)
hapmapvecFx1 = as.vector(hapmapCallsFx1[!isnax1])
illconfsTrFx1=as.vector(illconfsTrFx[!isnax1])
illcallsTrFx1=as.vector(illcallsTrFx[!isnax1])
illagreeTrFx=hapmapvecFx1==illcallsTrFx1
illtrdenomn = itryn = illtrdenomx = itryx =
illtrdenommx = itrymx = illtrdenomfx = itryfx = rep(0, length(qs))
cat("Calculating concordance for illuminus calls\n")
for(i in 1:length(qs)) {
cat(qs[i], " ")
ind=illconfsTrn1>=quantile(illconfsTrn1,qs[i],na.rm=TRUE)
illtrdenomn[i]=sum(ind,na.rm=TRUE)
itryn[i]=sum(illagreeTrn[ind],na.rm=TRUE)
ind=illconfsTrx1>=quantile(illconfsTrx1,qs[i],na.rm=TRUE)
illtrdenomx[i]=sum(ind,na.rm=TRUE)
itryx[i]=sum(illagreeTrx[ind],na.rm=TRUE)
ind=illconfsTrMx1>=quantile(illconfsTrMx1,qs[i],na.rm=TRUE)
illtrdenommx[i]=sum(ind,na.rm=TRUE)
itrymx[i]=sum(illagreeTrMx[ind],na.rm=TRUE)
ind=illconfsTrFx1>=quantile(illconfsTrFx1,qs[i],na.rm=TRUE)
illtrdenomfx[i]=sum(ind,na.rm=TRUE)
itryfx[i]=sum(illagreeTrFx[ind],na.rm=TRUE)
}
###################################################################
#load calls and confidence score for CRLMM,find outaccuracy rate#
###################################################################
load(crlmmtr)
tmp = strsplit(sampleNames(crlmmoni1exp12tr), "_")
# Will need to edit this for different chip types
samplenames = rep("", length(tmp))
for(i in 1:length(samplenames)) {
samplenames[i] = tmp[[i]][1]
}
crlmmcallsTr = calls(crlmmoni1exp12tr)
crlmmconfsTr = crlmmoni1exp12tr@assayData\$callProbability
newordec=match(colnames(hapmapCalls),colnames(crlmmcallsTr))
neworder=match(rownames(hapmapCalls),rownames(crlmmcallsTr))
crlmmcallsTr=crlmmcallsTr[neworder,newordec]
crlmmconfsTr=crlmmconfsTr[neworder,newordec]
70
crlmmrm=match(snpname,rownames(crlmmcallsTr))#nonx Chr#
crlmmcallsTrn=crlmmcallsTr[crlmmrm,]
crlmmconfsTrn=crlmmconfsTr[crlmmrm,]
crlmmrmx=match(snpnamex,rownames(crlmmcallsTr))#x chr#
crlmmcallsTrx=crlmmcallsTr[crlmmrmx,]
crlmmconfsTrx=crlmmconfsTr[crlmmrmx,]
crlmmcallsTrMx=crlmmcallsTr[crlmmrmx,gender==1]
crlmmconfsTrMx=crlmmconfsTr[crlmmrmx,gender==1]
crlmmcallsTrFx=crlmmcallsTr[crlmmrmx,gender==2]
crlmmconfsTrFx=crlmmconfsTr[crlmmrmx,gender==2]
###similar codes with Illuminus###
###################################################################
#load calls and confidence score for GenCall,find outaccuracy rate#
###################################################################
bsa1tr = read.table(bsfiletra1, header=TRUE, as.is=TRUE,
check.names=FALSE, sep="\t")
bsa2tr = read.table(bsfiletra2, header=TRUE, as.is=TRUE,
check.names=FALSE, sep="\t")
bsgctr = read.table(bsfiletrgc, header=TRUE, as.is=TRUE,
check.names=FALSE, sep="\t")
# zeroes = read.table("370K_zeroes.txt", header=TRUE,
as.is=TRUE, check.names=FALSE, sep="\t")
bsa1tr = as.matrix(bsa1tr)
bsa2tr = as.matrix(bsa2tr)
bsgctr = as.matrix(bsgctr)
bsreps = which(duplicated(colnames(bsa1tr)))
if(length(bsreps)>0) {
bsa1tr = bsa1tr[,-bsreps]
bsa2tr = bsa2tr[,-bsreps]
bsgctr = bsgctr[,-bsreps]
}
rm(bsreps)
cat("Reorganising beadstudio calls\n")
newordr = match(rownames(hapmapCalls),
newordc = match(colnames(hapmapCalls),
bsgctr = bsgctr[newordr,newordc]
newordr = match(rownames(hapmapCalls),
newordc = match(colnames(hapmapCalls),
bsa1tr = bsa1tr[newordr,newordc]
newordr = match(rownames(hapmapCalls),
newordc = match(colnames(hapmapCalls),
bsa2tr = bsa2tr[newordr,newordc]
rownames(bsgctr))
colnames(bsgctr))
rownames(bsa1tr))
colnames(bsa1tr))
rownames(bsa2tr))
colnames(bsa2tr))
bsgtTr = matrix(NA, nrow(bsa1tr), ncol(bsa1tr))
bsgtTr[(bsa1tr=="A" & bsa2tr=="G") | (bsa1tr=="A" & bsa2tr=="C") |
(bsa1tr=="T" & bsa2tr=="G") | (bsa1tr=="T" & bsa2tr=="C") |
(bsa1tr=="A" & bsa2tr=="T") | (bsa1tr=="T" & bsa2tr=="A") |
(bsa1tr=="G" & bsa2tr=="C") | (bsa1tr=="C" & bsa2tr=="G")] = 2
bsgtTr[(bsa1tr=="A" & bsa2tr=="A") | (bsa1tr=="T" & bsa2tr=="T")] = 1
bsgtTr[(bsa1tr=="G" & bsa2tr=="G") | (bsa1tr=="C" & bsa2tr=="C")] = 3
71
rm(bsa1tr, bsa2tr)
gc()
bsrm=match(snpname,rownames(bsgctr))
bsgcTrn=bsgctr[bsrm,]
bsgtTrn=bsgtTr[bsrm,]
bsrmx=match(snpnamex,rownames(bsgctr)) #x chr#
bsgcTrx=bsgctr[bsrmx,]
bsgtTrx=bsgtTr[bsrmx,]
bsgcTrMx=bsgctr[bsrmx,gender==1]
bsgtTrMx=bsgtTr[bsrmx,gender==1]
bsgcTrFx=bsgctr[bsrmx,gender==2]
bsgtTrFx=bsgtTr[bsrmx,gender==2]
###similar steps with Illuminus ###
###################################################################
#load calls and confidence score for GenoSNP,find outaccuracy rate#
###################################################################
setwd(rawdatadir)
x<-NULL
xtr=read.table(xtraw, nrows=10)
colnames = unique(colnames(xtr))
setwd(genosnptrdir)
for(i in 1:ncol(xtr)){
cat(i, " ")
n=paste(colnames[i],".txt",sep="")
tmp=read.table(n,header=TRUE, as.is=TRUE,quote="",check.names=FALSE, sep="")
if(i==1){
snps =annotorig[,2]
p = q = matrix(NA, nrow(tmp), ncol(xtr))
}
p[,i] = tmp[,3]
q[,i] = tmp[,4]
rm(tmp)
}
cat("\n")
colnames(p)=colnames(q)=colnames
rownames(q)=rownames(p)=snps
genocallsTr = p
genoconfsTr = q
rm(p,q,xtr)
colnames=gsub("\\.([1-3][0-9]|[1-9])","",colnames(genocallsTr))
genoreps = which(duplicated(colnames))
if(length(genoreps)>0) {
genocallsTr = genocallsTr[,-genoreps]
genoconfsTr = genoconfsTr[,-genoreps]
colnames = colnames[-genoreps]
}
rm(genoreps)
genocallsTr[(genocallsTr=="AA")]=1
genocallsTr[(genocallsTr=="AB")]=2
genocallsTr[(genocallsTr=="BB")]=3
72
# may need to check ordering of rows and columns to ensure consistency
genonewordr = match(rownames(hapmapCalls), rownames(genocallsTr))
genonewordc = match(colnames(hapmapCalls), colnames(genocallsTr))
genoconfsTr = genoconfsTr[genonewordr,genonewordc]
genocallsTr = genocallsTr[genonewordr,genonewordc]
genorm=match(snpname,rownames(genocallsTr))#nonx Chr#
genocallsTrn=genocallsTr[genorm,]
genoconfsTrn=genoconfsTr[genorm,]
genormx=match(snpnamex,rownames(genocallsTr))#x chr#
genocallsTrx=genocallsTr[genormx,]
genoconfsTrx=genoconfsTr[genormx,]
genocallsTrMx=genocallsTr[genormx,gender==1]
genoconfsTrMx=genoconfsTr[genormx,gender==1]
genocallsTrFx=genocallsTr[genormx,gender==2]
genoconfsTrFx=genoconfsTr[genormx,gender==2]
x = matrix(NA, nrow(genocallsTrn), 3*ncol(genocallsTrn))
n = seq(0, 3*ncol(genocallsTrn), by=3)+1
for(i in 1:ncol(genocallsTrn)){
cat(i, " ")
j = n[i]
x[,j:(j+2)] = cbind(genocallsTrn[,i],genoconfsTrn[,i],hapmapCallsn[,i])
gc()
}
cat("\n")
t<-NULL
n=seq(1,ncol(x),by=3)
for (i in n){
t[[i]]=x[x[,i]!="NC",i:(i+2)]
}
for( i in n){
cat(i, " ")
if(i==1){
genocallsTrn=as.vector(t[[i]][,1])
genoconfsTrn=as.numeric(t[[i]][,2])
hapmapCallsn2=as.vector(t[[i]][,3])
}else{
genocallsTrn=c(genocallsTrn,as.vector(t[[i]][,1]))
genoconfsTrn=c(genoconfsTrn,as.numeric(t[[i]][,2]))
hapmapCallsn2=c(hapmapCallsn2,as.vector(t[[i]][,3]))
}
}
cat("\n")
rm(x, n, t)
gc()
isnan2=is.na(hapmapCallsn2)
hapmapvecn2 = as.vector(hapmapCallsn2[!isnan2])
genoconfsTrn1=as.vector(genoconfsTrn[!isnan2])
genocallsTrn1=as.vector(genocallsTrn[!isnan2])
genoagreeTrn=hapmapvecn2==genocallsTrn1
x = matrix(NA, nrow(genocallsTrx), 3*ncol(genocallsTrx))
n = seq(0, 3*ncol(genocallsTrx), by=3)+1
73
for( i in 1:ncol(genocallsTrx)){
j = n[i]
x[,j:(j+2)] = cbind(genocallsTrx[,i],genoconfsTrx[,i],hapmapCallsx[,i])
}
t<-NULL
n=seq(1,ncol(x),by=3)
for (i in n){
t[[i]]=x[x[,i]!="NC",i:(i+2)]
}
for( i in n){
if(i==1){
genocallsTrx=as.vector(t[[i]][,1])
genoconfsTrx=as.numeric(t[[i]][,2])
hapmapCallsx2=as.vector(t[[i]][,3])
}else{
genocallsTrx=c(genocallsTrx,as.vector(t[[i]][,1]))
genoconfsTrx=c(genoconfsTrx,as.numeric(t[[i]][,2]))
hapmapCallsx2=c(hapmapCallsx2,as.vector(t[[i]][,3]))
}
}
rm(x, n, t)
gc()
isnax2=is.na(hapmapCallsx2)
hapmapvecx2 = as.vector(hapmapCallsx2[!isnax2])
genoconfsTrx1=as.vector(genoconfsTrx[!isnax2])
genocallsTrx1=as.vector(genocallsTrx[!isnax2])
genoagreeTrx=hapmapvecx2==genocallsTrx1
###similar codes with Illuminus###
74
Appendix C
Appendix 3
This chapter provide R codes for MAF. Suppose all the calls information are known.
######################################################################
#work out minor allele frequency#
######################################################################
#exclude nocalls first#
illhapmap=hapmapcalls
illhapmap[(illcallsTr=="4")]=NA
illhapmap[(genocallsTr)=="NC")]=NA
illhapmap[(bsgctr<=0.15)]=NA
#Illuminus#
illa1=apply(illcallsTr,1,function(x){sum(x==1,na.rm=TRUE)})
illa3=apply(illcallsTr,1,function(x){sum(x==3,na.rm=TRUE)})
m=cbind(illa1,illa3)
s=apply(m,1,function(x){sum(x,na.rm=TRUE)})
x<-NULL
for( i in 1:length(illa1)){
x[i]=min(illa1[i]/s[i],illa3[i]/s[i])
}
n=c(0.05,seq(0.1,0.5,by=0.1))
mafill=x1=cbind(rownames(illcallsTr),x)
o=order(mafill[,2])
mafill=mafill[o,]
exmill=illhapmap==illcallsTr
exmill=exmill[oi,]
exproportionill=exilllevel=rep(0,length(n))
for(i in 1:6){
if(i==1){
exilllevel[i]=sum(as.vector(exmill[1:sum(mafill[,2]<=n[i]),])==FALSE,
na.rm=TRUE)
exproportionill[i]=exilllevel[i]/length(as.vector(
exmill[1:sum(mafill[,2]<=n[i]),])
[!is.na(as.vector(exmill[1:sum(mafill[,2]<=n[i]),]))])
}else{
75
exilllevel[i]=sum(as.vector(exmill[(1+sum(mafill[,2]<=n[i-1])):
sum(mafill[,2]<=n[i]),])==FALSE,na.rm=TRUE)
exproportionill[i]=exilllevel[i]/length(as.vector(
exmill[(1+sum(mafill[,2]<=n[i-1])):
sum(mafill[,2]<=n[i]),])
[!is.na(as.vector(exmill[(1+sum(mafill[,2]<=n[i-1]))
:sum(mafill[,2]<=n[i]),]))])
}
}
#similar codes for other methods#
76
Bibliography
[1] Beal, M.J. and Ghahramani, Z. (2003).The Variational Bayesian SM Algorithm
for Imcomplete Data: with Application to Scoring Graphical Model Structures,
Oxford University Press,
[2] Brown, M.P.S., Grundy,W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey,
T.S., and Haussler, D. (1999). Knowledge-based analusis of microarray gene
expression data by using support vector machines. in: Boststein,D. (eds.) Proceedings of the National Academy of Sciences, 262-267.
[3] Teo,Y.Y.,
Inouye,
M.,
Small,
K.S.,
Gwilliam,
R.,
Deloukas,
R.,
Kwiatkowski,D.P. and Clark T.G. (2007).A genotype calling algorithm for Illumina BeadArray platfom.Bioinformatics. 24, 2741-2476.
[4] Giannoulatou, E., Yau,C., Colella,S., Ragoussis, J. and Holmes,C.C.(2008).
GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that
does not require a reference population. Bioinformatics, 24, 2209-2214.
[5] Lin, S., Carvalho, B.S., Cutler, D.J., Arking, D.E., Chakravarti, A. and Irizarry,
R.A.(2008). Validation and extension of an empirical Bayes method for SNP
calling on Affymetrix microarrays.Genome Biology, 9, R63.
[6] Ritchie, M.E., Carvalho, B.S.,Hetrick, K.N., Tavaré, S. and Irizarry,
R.A.(2009). R/Bioconductor software for Illumina’s Infinium whole-genome
genotyping BeadChips.Bioinformatics, 25, 2621-2623.
[7] Carvalho, B.S., Speed, T.P. and Irizarry, R.A. (2007).Exploration, normalization and genotype calls of high density oligonucleotide SNP array data. Biotatistics, 8, 85-499.
77
[8] Nityasuddhi, D. and Bohning, D.(2003).Asymptotic properties of the EM algorithm estimate for normal mixture models with component specific variances.Computational Statistics & Data Analysis. 41, 591-601.
[9] Ritchie, M.E., Liu, R., Carvalho, B.S., Hetrick, K.N.,The Australian and New
Zealand Multiple Sclerosis Genetics Consortium (ANZgene Consortium) and
Irizarry, R.A. (2010).Comparing genotyping algorithms for Illumina’s Infinium
whole-genome SNP BeadChips, BMC Bioinformatics (submitted).
[10] International HapMap Consortium (2007). A second generation human haplotye
map of over 3.1 miilion SNPs. Nature, 49, 851-861.
[11] Bahlo, M., Stankovich, J., Danoy, P., Hickey, P. F., Taylor, B. V., Browning, S.
R., The Australian and New Zealand Multiple Sclerosis Genetics Consortium
(ANZgene Consortium), Brown, M. A. and Rubio, J. P. (2009). Saliva derived
DNA performs well in large-scale, high-density SNP microarray studies, Cancer
Epidemiology, Biomarkers & Prevention, 19, 794
[12] R Development Core Team (2009). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.
ISBN 3-900051-07-0. URL: http://www.R-project.org
[13] Dempster, A.P., Laird,N.M. and Rubin, D.B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society:
Series B, 39, 38.
[14] Kermani, B.J. (2006). Artificial intelligence and global normalization methods
for genotyping, United States Patent 7035740.
[15] Steemers, F.J., Chang, W., Lee, G., Barker, D.L., Shen, R., Gunderson, K.L.
(2006). Normalization of Illumina Infinium whole-genome SNP data improves
copy number estimates and allelic intensity ratios, BMC Bioinformatics, 9, 409.
[16] Borman, S (2004). The Expectation Maximisation Algorithm – A short tutorial.
URL: http://www.seanborman.com/publications/
[17] Ziegler, A. and Koenig, I.R. (2006). A statistical approach to genetic epidemiology : concepts and applications.Wiley-VCH
78
[18] Salanti, G., Amountza,G, Ntzani, E.E. (2005). Hardy-Weinberg equilibrium in
genetic association studies: an empirical evaluation of reporting, deviations,
and power. Eur J Hum Genet. 13, 840-848
[19] Dunning M.J., Smith,M.L., Ritchie M.E. and Tavare, S. (2007). Beadarray:R
classes and methods for Illumina bead-based data. Bioinformatics, 23, 21832184.
[20] Bolstad,B. M., Irizarry, R.A., Astrand.M. and Speed, T. P. (2003).A comparison of normalization methods for high density oligonucleotide array data based
on variance and bias. Bioinformatics, 19, 185-193.
[21] Kermani,B.G (2005) Artificial intelligence and global normalization methods
for genotyping. US Patent, 20060224529.
79

Download Report

Comparison of Statistical models for genotype calling algorithms

Paperzz.com

Your Paperzz