machine learning methodologies in the discovery of the interaction

M
2014
MACHINE LEARNING METHODOLOGIES IN
THE DISCOVERY OF THE INTERACTION
BETWEEN GENES IN COMPLEX DISEASES
RICARDO JOSÉ MOREIRA PINHO
DISSERTAÇÃO DE MESTRADO APRESENTADA
À FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO EM
ÁREA CIENTÍFICA
FACULDADE DE E NGENHARIA DA U NIVERSIDADE DO P ORTO
Machine Learning methodologies in the
discovery of the interaction between
genes in complex diseases
Ricardo Pinho
D ISSERTATION
Mestrado Integrado em Engenharia Informática e Computação
Supervisor: Rui Camacho
Co-Supervisor: Alexessander Alves (Imperial College London, UK)
July 2014
Machine Learning methodologies in the discovery of the
interaction between genes in complex diseases
Ricardo Pinho
Mestrado Integrado em Engenharia Informática e Computação
Approved in oral examination by the committee:
Chair: Auxiliary Professor Ana Cristina Ramada Paiva
External Examiner: Auxiliary Researcher Sérgio Guilherme Aleixo de Matos
(Instituto de Engenharia Electrónica e Telemática de Aveiro)
Supervisor: Associate Professor Rui Carlos Camacho de Sousa Ferreira da Silva
July 2014
Abstract
In recent years, there has been a big research in gene-gene interactions to analyze how complex
diseases are affected by the genome. Many Genome Wide Association Studies (GWAS) have
been performed with interesting results. This new interest is due to the computing power that is
available today. Machine Learning methodologies quickly became a successful tool to find previously unknown genetic relations. The popularity of this field increased greatly after discovering
the potential value of gene-gene studies in the detection and understanding of how phenotypes are
expressed.
The information available in the DNA of the human genome can be divided into functional
subgroups that code different phenotypes. These subgroups are the genes, which can have different
presentations and still have the same behaviour. However, if a certain part of a gene changes its
behaviour, this part is called Single Nucleotide Polymorphism (SNP). These SNPs interact with
each other to affect how genes work, which affects the phenotypes that are expressed.
The purpose of this dissertation is to increase the knowledge obtained from these studies,
detecting more interactions related to the manifestation of complex diseases. This is achieved
by testing algorithms in a complex empirical study, and adding a new and improved Ensemble
approach, that shows better results than the existing state-of-the-art algorithms.
To achieve this goal, there are two main stages. The first stage consists of a comparison study
amongst the most recent statistical and Machine Learning methodologies using simulated data sets
containing generated epistatic interactions. The algorithms BEAM3.0, BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler, and TEAM were processed with many different settings of
data sets. The results showed that, with the exception of Screen and Clean, and MBMDR, all
algorithms displayed good results in relation to Power, Type I Error Rate, and Scalability.
The second stage is the creation of new combination of a new algorithm based on the results
obtained in the first stage. This new algorithm is comprised of an aggregation of previously tested
methodologies, of which 5 algorithms were chosen. This new Ensemble approach manages to
maintain the Power of the best algorithm, while decreasing the Type I Error Rate.
i
ii
Acknowledgements
First and foremost I would like to thank my Supervisor, professor Rui Camacho, for the effort,
patience and dedication to this project, which would be impossible to accomplish without his
support. I would also like to thank my co-supervisor Alexessander Alves, for saving me a lot of
trouble and helping me understand the specifics of the project. Considering that this area is very
new to me, his expertise is very much needed and appreciated.
I want to thank my family, specially my parents for giving me the opportunity to be able to
learn and work on something that I love and for always believing in me.
Ricardo Pinho
iii
iv
"An approximate answer to the right problem is worth a good deal more than an exact answer to
an approximate problem."
John Wilder Tukey
v
vi
Contents
1
.
.
.
.
1
1
2
2
3
.
.
.
.
.
5
5
8
21
27
34
.
.
.
.
.
37
37
38
39
40
50
4
Ensemble Approach
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
51
51
66
5
Conclusions
5.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
68
2
3
Introduction
1.1 Context . . . . . . . .
1.2 Project . . . . . . . . .
1.3 Motivation and Goals .
1.4 Structure of the Report
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
State-of-the-Art
2.1 Biological concepts . . . . . . . . . . . . . . . .
2.2 Statistical and Machine Learning Algorithms . .
2.3 Data analysis evaluation procedures and measures
2.4 Data Simulation and Analysis Software . . . . .
2.5 Chapter Conclusions . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Algorithms for interaction analysis . . . . . . . . . . . . . . .
3.3 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References
69
A Glossary
A.1 Biology related terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Data mining terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Lab Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
77
78
79
vii
CONTENTS
viii
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
An illustration of the interior of a cell. [cel14] . . . . . . . . . . . . . . . . . . .
Bayesian Network. Nodes represent SNPs. [JNBV11] . . . . . . . . . . . . . . .
An example of a Neural Network. [HK06] . . . . . . . . . . . . . . . . . . . . .
A logit transformation and a possible logistic regression function resultant of the
logit transformation.[WFH11] . . . . . . . . . . . . . . . . . . . . . . . . . . .
An example of a ROC curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The main stages of the KDD process.[FPSS96] . . . . . . . . . . . . . . . . . .
The CRISP-DM life cycle.[CCK00] . . . . . . . . . . . . . . . . . . . . . . . .
A diagram of the ATHENA software package [HDF+ 13]. . . . . . . . . . . . . .
The Weka Explorer interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
23
26
27
30
33
3.1
3.2
3.3
3.4
3.5
3.6
Results of epistasis detection by population size. . . . . . .
Results of main effect detection by population size. . . . .
Results of full effect detection by population size. . . . . .
Results of epistasis detection by minor allele frequency. . .
Results of main effect detection by minor allele frequency.
Results of full effect detection by minor allele frequency. .
43
44
45
46
47
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Results of epistasis detection by population size, with a 0.1 minor allele frequency,
2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Results of main effect detection by population size, with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . .
4.3 Results of full effect detection by population size, with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . .
4.4 Results of epistasis detection by minor allele frequency, with 2000 individuals, 2.0
odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Results of main effect detection by minor allele frequency, with 2000 individuals,
2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Results of full effect detection by minor allele frequency, with 2000 individuals,
2.0 odds ratio, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Results of epistasis detection by odds ratio, with a 0.1 minor allele frequency, 2000
individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Results of main effect detection by odds ratio, with a 0.1 minor allele frequency,
2000 individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . .
4.9 Results of full effect detection by odds ratio, with a 0.1 minor allele frequency,
2000 individuals, and 0.02 prevalence. . . . . . . . . . . . . . . . . . . . . . . .
4.10 Results of epistasis detection by prevalence, with a 0.1 minor allele frequency,
2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . .
6
12
18
4.1
ix
53
54
55
56
57
58
59
60
61
62
LIST OF FIGURES
4.11 Results of main effect detection by prevalence, with a 0.1 minor allele frequency,
2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Results of full effect detection by prevalence, with a 0.1 minor allele frequency,
2000 individuals, and 2.0 odds ratio. . . . . . . . . . . . . . . . . . . . . . . . .
x
62
63
List of Tables
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
3.1
3.2
4.1
4.2
4.3
An example of a penetrance table. . . . . . . . . . . . . . . . . . . . . . . . . .
A description of each data selection algorithm. . . . . . . . . . . . . . . . . . . .
A description of each model creation algorithm designed for this problem. . . . .
A description of each generic model creation algorithm. . . . . . . . . . . . . . .
A description of auxiliary algorithms used in model creation or data selection. . .
A description of data analysis measures and how they are calculated. . . . . . . .
A description of data analysis procedures. . . . . . . . . . . . . . . . . . . . . .
A comparison between the different procedures [AS08]. . . . . . . . . . . . . . .
A comparison of the most relevant features of data simulation tools. . . . . . . .
A comparison of data mining tools. . . . . . . . . . . . . . . . . . . . . . . . . .
Similarities and differences between BEAM3, BOOST MBMDR, and Screen &
Clean, SNPHarvester, SNPRuler, and TEAM. . . . . . . . . . . . . . . . . . . .
The values of each parameter used. Each configuration has a unique set of the
parameters used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scalability test containing the average running time, CPU usage, and memory
usage by data set population size. . . . . . . . . . . . . . . . . . . . . . . . . . .
Scalability test containing the average running time, CPU usage, and
usage by data set population size for epistasis detection. . . . . . . . . .
Scalability test containing the average running time, CPU usage, and
usage by data set population size for main effect detection. . . . . . . .
Scalability test containing the average running time, CPU usage, and
usage by data set population size for full effect detection. . . . . . . . .
xi
memory
. . . . .
memory
. . . . .
memory
. . . . .
8
10
16
19
21
24
25
27
29
33
35
39
49
64
64
65
LIST OF TABLES
xii
Abbreviations
A
ACO
ALEPH
API
ATHENA
AUC
BDA
BEAM
BOOST
C
CRISP-DM
DAG
DM
FOL
G
GENN
GPNN
GPAS
GUI
GWAS
IDE
IIM
ILP
K-NN
KEEL
KDD
KNIME
MAGENTA
MCMC
MDR
MDS
ML
NB
NN
OS
PDIS
PMML
Adenine
Ant Colonization Optimization
A Learning Engine for Proposing Hypotheses
Application Programing Interface
Analysis Tool for Heritable and Environmental Network Associations
Area Under the Curve
Backward Dropping Algorithm
Bayesian Epistasis Association Mapping
Boolean Operation-based Screening and Testing
Cytosine
Cross Industry Standard Process for Data Mining
Directed Acyclic Graph
Data Mining
First Order Logic
Guanine
Grammatical Evolution Neural Networks
Genetically Programmed Neural Networks
Genetic Programming for Association Studies
Graphics User Interface
Genome Wide Association Study
Integrated Development Environments
Information Interaction Method
Inductive Logic Programming
K-Nearest Neighbor
Knowledge Extraction Evolutionary Learning
Knowledge Discovery in Databases
Konstanz Information Miner
Meta-Analysis Gene-set Enrichment of variaNT Associations
Markov Chain Monte Carlo
Multifactor Dimensionality Reduction
MultiDimensional Scaling
Machine Learning
Naïve Bayes
Neural Network
Operating System
Dissertation Planning
Predictive Model Markup Language
xiii
ABBREVIATIONS
ROC
S&C
SAS
SEMMA
SNP
SVM
T
TEAM
Receiver Operating Characteristic
Screen & Clean
Statistical Analysis System
Sample, Explore, Modify, Model and Assess
Single Nucleotide Polymorphism.
Support Vector Machine
Thymine
Tree-based Epistasis Association Mapping
xiv
Chapter 1
Introduction
This Chapter introduces the context by discussing evolution of epistasis research. The next section contains a project discussion and the contribution of the dissertation. It is followed by the
motivation and goals for this work. The last section contains the structure of the document.
1.1
Context
Epistasis is the interaction between genes that work together to affect a manifestation of a complex
disease. The study of epistasis to determine the expression of phenotypes has long been discovered
to yield interesting results considering that most phenotypes cannot be explained by simple correlations to Single Nucleotide Polymorphisms (SNPs) or even to a specific gene. However, only
recently, enough advances have been made in technology, such as computer processing power,
to allow for whole genome studies. These studies quickly became a popular tool to discover genetic patterns in the manifestation of several phenotypes, including certain SNP configurations
with high risk of developing complex diseases. This allowed for a better understanding of many
complex diseases that were otherwise undetectable until the manifestation of symptoms.
From studies of allergic sensitization [BMP+ 13] and obesity predisposition [FTW+ 07] to diabetes [SGM+ 10] and breast cancer [RHR+ 01], there have been many successfully identified
associations between genes and the expression of complex diseases.
Considering that these advances in Genome Wide Association Study (GWAS) are very recent,
there is not yet a well-established method to test and find significant results. Therefore many
statistic and machine learning approaches have since been developed.
The study of epistatic interactions is a high dimensionality problem, caused by the millions of
possible combinations between SNPs and each combination can have a varying amount of SNPs
involved in each interaction. Because of this, the correct identification of interactions becomes a
problem, not only due to outliers and noise, but also because of the many possible configurations.
Another issue with its complexity is the identification of the correlation between interactions and
the actual manifestation of the phenotype in question. There is also an error associated with every
data mining problem, which in this case can be explained by mutations or ambiguity in SNPs.
1
Introduction
Very recently, many different algorithms have been proposed to tackle the problems implicit to
GWAS.
Recent machine learning algorithms have tackled this problem by simplifying its complexity,
reducing the inherent dimensionality.
1.2
Project
The project of this dissertation consists of two empirical studies. Initially, there is a review of
the literature to identify a range of different algorithms that are likely to produce better results.
Artificial datasets are then created to test these algorithms.
Based on the results obtained, a new empirical study will be made with a newly generated algorithm aiming to find an approach that may obtain better results than the existing algorithms. This
new methodology is a combination of the state-of-the-art algorithms.
Dissertation Contribuition
This dissertation contains empirical studies of the state-of-the-art algorithms, which enables a
broad analysis of the factors affecting the performance of these algorithms using relevant evaluation measures. Based on this information, new studies can have a better understanding of what to
expect from each method. With the introduction of new algorithms, this dissertation may produce
methodologies that better suit the needs of the domain problem.
1.3
Motivation and Goals
Genome wide association studies have made a big impact over SNP identification and analysis in
the last years. They allowed the discovery of how genes interact with each other and how each
gene affects phenotypes. By mapping epistatic interactions and gene behavior, it is possible to find
risk factors associated with complex diseases. These risk factors can be identified by certain SNP
configurations or genotypes.
From a machine leaning standpoint, there is a lot of room for improvement. Considering that
this is a problem that has only started to be studied using very different methodologies, it is still
not as optimized or as accurately solved as it can be. With better, more adapted and efficient
algorithms, the GWAS relevance can increase considerably. Due to the inherent dimension complexity, scalability is a very important issue required by the developed methodologies. Algorithms
used in typical prediction problems, such as classification and regression, now need to be adapted
to fit the requirements of this specific problem, which does not fall in the classical convention of
prediction problems. This requires an adaptation to the data and to the output, generating a result
that is understandable by specialists in the genetic field.
The main goal of epistatic studies is to find SNPs responsible for the expression of phenotypes,
which in this case are related to complex diseases. This means that the loci and the alleles that are
2
Introduction
active in complex diseases and contribute to its manifestation need to be identified. Their behavior
and the interactions relevant to the disease need to be monitored and accessed. This information
presents a better understanding of the disease in question and can be used in a medical scenario
to mark specific genotypes that have a high probability of manifesting genetic related complex
diseases, which can then be preemptively watched and treated.
1.4
Structure of the Report
The rest of this report is divided into three main chapters: state-of-the-art, work planning and
conclusions.
Chapter 2 has a brief introduction to the topic, followed by all the knowledge of the related
biological concepts in Section 2.1. Section 2.2 consists on the description of state-of-the-art machine learning and statistical algorithms that have been used in data selection, model creation and
other auxiliary algorithms. Section 2.3 contains evaluation measures and procedures that are relevant to estimating and optimizing the data mining process. There is also a description of the many
data mining tools in Section 2.4 including software tools specifically designed for epistasis detection analysis and data simulation software. Section 2.5 of this chapter contains the conclusions
extracted from the study of the existing algorithms, tools and evaluation measures and procedures.
Chapter 3 is composed by an introduction to the stage I of the experiments 3.1, a description
of the data and methodologies used in this project in Section 3.2, the experimental procedure and
results 3.3, and finally, the conclusions of the chapter 3.4.
In Chapter 4, Section 4.1 contains the introduction to the stage II of the experiments. Section
4.2 has the experimental procedure, results, and discussion of the stage II experiments. Section
4.3 contains the final conclusions of this chapter.
Chapter 5 contains a brief summary with the final conclusions from the empirical studies, and
a summary of the contributions of this dissertation in Section 5.1.
3
Introduction
4
Chapter 2
State-of-the-Art
Over the last 5 years, many methodologies have surfaced to find a solution for genome wide studies. With the development of computing power, algorithms that were practically infeasible are
now valid options for the identification of diseases related to SNPs. Most of these algorithms
are based on well-known data mining approaches like prediction, clustering and rule-based algorithms. Some software tools were also developed specifically for this purpose.
In this chapter we first introduce some basic biological concepts as biological background
required to understand the main issues and specifications of the dissertation. The concept of
epistasis and Genome Wide Association Studies are introduced in that section.
Data Mining algorithms are introduced in Section 2.2. This includes data selection, model
creation algorithms specifically designed for epistasis studies and generic model creation algorithms that are related to specific algorithm implementations. Some auxiliary algorithms, used in
the model creation are also included in this section.
In Section 2.3 the most relevant procedures and measures of evaluating the state-of-the-art
methodologies are discussed. These include relevant evaluation metrics for model testing and
machine learning approaches to provide better results for the generated solutions. In that section
there is also a description of the most commonly used data mining methodologies.
Section 2.4 contains data mining software tools, some of which are specifically designed for
this problem. This Section also contains the data simulation software used for artificial generation
of data sets with epistasis and main effect.
2.1
Biological concepts
Basic concepts
Most human beings have 46 chromosomes within the nucleus of each cell. These chromosomes are
divided into 23 groups, with every group having a pair of similar chromosomes. Each chromosome
is composed by a very large double helix structure: the deoxyribonucleic acid (DNA). DNA is
subdivided into sections called genes, which consist of regions that code for proteins. Figure 2.1
illustrates these structures.
5
State-of-the-Art
A gene is composed by several nucleotide bases. These bases can be either adenine, cytosine,
guanine, or thymine. Each connection has a complementing pair. This means that for each
position of a nucleotide base, also known as locus, there a connection of adenine-thymine, or
cytosine-guanine. Each combination in nucleotide bases is called an allele. Alleles and Loci can
also be referred to a variation and position of a gene, rather than a single nucleotide base. The pair
of alleles in the same locus, on each chromosome is called genotype. Considering that there are
usually two alleles for each locus, there is usually a dominant genotype and a recessive genotype,
which means that there is an allele that is expressed more often than another allele. The expression
of a physical trait or a creation of a protein is called a phenotype. For each locus, there are usually
3 different genotype configurations, 2 dominant genes, 2 recessive genes (having 2 equal alleles
in a genotype is called homozygous), or a dominant and a recessive gene (having different alleles
in a genotype is called heterozygous). A recessive gene is only expressed in an homozygous
recessive genotype.
A
T
T
A
C
G
Figure 2.1: An illustration of the interior of a cell. [cel14]
Single Nucleotide Polymorphisms
A single nucleotide polymorphism (SNP) is a specific nucleotide base, within a gene, that changes
what the gene expresses. This means that a different allele SNP will create a different gene that
will express a different protein or physical trait. Not all nucleotide bases in a gene are relevant to
this process. In this context, SNPs are the most important part in the gene.
Genetic Markers
Genetic markers are specific sets of SNPs, genes or DNA sequences that are used to identify
specific traits, individuals or species. In our study, genetic markers are used to identify specific
traits within complex diseases. In recent years, genetic markers are not limited to genes that encode
visible characteristics. Due to the genetic mapping of the human genome, patterns of SNPs can be
related to traits, without directly encoding a specific characteristic, including dominant/recessive
or co-dominant markers [Avi94].
6
State-of-the-Art
Main Effect
In this context, the main effect is related to the influence of a SNP on the expression of a phenotype,
in this case a complex disease. Any SNP has a main effect if it has a direct impact on the disease
expression. Multiple SNPs can have a main effect on the same phenotype expression, without
having a relation between them.
Epistasis
The concept of epistasis was first described by Bateson [Mud09] as the control of the manifestation
of the effect of one allele in one locus by another allele in another locus [Cor02]. This definition
changed its meaning and subdivided into different, often conflicting definitions. According to
Philips [Phi08], there are three major categories in which the term “epistasis” can be subdivided
into: Functional Epistasis, Compositional Epistasis and Statistical Epistasis.
Functional epistasis refers to the functional applications of the molecular interactions between
genes. The focus is on the proteins that are created by these interactions and on their effects.
Compositional epistasis is used to describe the traditional usage of the term "epistasis". It
describes the interaction in two locus, with specific alleles. This interaction affects the phenotype
expression.
Statistical epistasis describes the average deviation in the effect resulting from the interaction
in a set of alleles at different loci from the effect of those alleles considered independently [Fis19].
It is an additive expectation of the epistasis effect on the allelic function.
Genome Wide Association Studies
The search for genetic markers has helped to determine previously unknown aspects of complex
diseases. Previous studies focused on single-locus analysis and provided underwhelming results
[Cor09]. By changing this approach to include complex relations between genes in the effect of
phenotypes, new information about biological and biochemical pathways has surfaced and, since
then, become a powerful tool in understanding the diseases.
Penetrance Tables
Diseases can affect different proportions of individuals in a population, who have the disease
related genetic markers. This is the penetrance of a disease [FES+ 98]. With a high disease penetrance, most of the individuals with a given disease-associated genetic marker will manifest that
disease. This penetrance is visible with the disease allele frequency and the disease affected individuals. This is the proportion of individuals with the disease-associated SNP that develop the
disease among the population. Analysing these results, a table can be created showing the percentage of individuals affected with a disease, given their genotypes. Each genotypic configuration has
a penetrance value associated. Table 2.1 shows an example of a penetrance table.
7
State-of-the-Art
PENTABLE
Genotype Configuration Penetrance
AABB
0.068
AABb
0.064
AAbb
0.040
AaBB
0.055
AaBb
0.047
Aabb
0.103
aaBB
0.039
aaBb
0.103
aabb
0.004
Table 2.1: An example of a penetrance table.
2.2
Statistical and Machine Learning Algorithms
The disciplines of Statistics and Machine Learning (ML) have been studying these problems in
the last few years. We now survey a set of algorithms from both Statistics and ML application to
epistasis problems.
Feature Selection Algorithms
Ant Colony Optimization
Ant Colony Optimization (ACO) [Dor92] is a search wrapper that explores the same mechanism
seen in ant colonies to find shortest paths. This means that it uses a particular classifier to score
subsets of variables based on their relation to the class variable. Basically it transforms the optimization problem into a problem of finding the best path on a weighted graph [Ste12]. In this
context, it uses expert knowledge added in the "pheromones" to select SNPs with better expert
knowledge scores, calculated using fitness functions [GWM08]. In this context, ACO is used to
filter out SNPs by randomly searching for SNPs and choosing only the ones that are the most
relevant to the phenotype.
Classification Trees
Classification trees can be used as a feature selection algorithm by creating an upside down tree
[ZB00] where each node represents a test on an attribute or a set of attributes, and each edge represents the possible outcome value of the test in the "parent" node [NKFS01]. This representation
of a tree emphasizes the connections between attributes and therefore it can correlate to possible
relations between a disease and a connection of attributes [CKS04]. However, it can only use
selections of attributes that somehow connect to each other, it skips attributes that might have a
pure interaction with the disease [Cor09].
8
State-of-the-Art
ReliefF
ReliefF [RSK03] and its modified version Tuned ReliefF (TuRF) [MW07] are filtering algorithms.
The basic idea of ReliefF is to estimate the quality of attributes by the variation of its genotype
values in the instance neighborhood with the same class label. If the neighbors within a different
class have the same value however, the attribute separates two instances with the same class, which
means that the value of that attribute gets lower. Similarly, having a neighbor with a different class
and the same attribute value lowers its quality but if the same neighbor has a different attribute
value, the quality increases. ReliefF can deal with incomplete and noisy data and searches for k
neighbors of incomplete or noisy data with the same class and k neighbors with different classes.
Tuned ReliefF is an improvement over the original algorithm, by removing the worst attributes
and recalculating the estimations of attributes at each step.
Evaporative Cooling
Based on the ReliefF [KR92] algorithm, evaporative cooling removes ∆N of the least informative
attributes using classification accuracy [MRW+ 07][MCGT09]. The energy in the system is given
by:
hεi N
=
hεi N0
N
N0
η
(2.1)
where hεi is the average “energy” density of the system, N0 is the number of attributes before
the evaporation step and η is an adjustable parameter related to the evaporation rate, allowing for
a slow evaporation for higher values and a fast evaporation for lower values, which can lead to
a collection of suboptimal attributes. Evaporative Cooling is often used as a wrapper filter for
attribute selection. This means that the variables are scored based on their predictive power.
Genetic Programming for Association Studies
Genetic Programming for Association Studies (GPAS) works by searching variables, mapping
them to new boolean variables. By randomly choosing two individuals consisting of one randomly
selected literal, a form of genetic algorithm is then applied to select, based on a fitness function,
the score of the current generation in the population [Nun08]. A new, customized form of GPAS
that detects interactions involving a higher number of SNPs has since been developed [NBS+ 07].
Random Forest
The purpose of random forest in this context is to select, according to the various trees formed
by the algorithm, the main attributes that are important to a disease. Random forest is a fast
algorithm that can be applied in parallel, with many customizable parameters, such as the number
of trees, the number of instances to be used at each split and the number of permutations to
assess variable importance [Bre01]. Random forest is an ensemble algorithm that creates several
bootstrap samples from a data set with the same size as the original sample. For each bootstrap, a
9
State-of-the-Art
tree is grown, considering only a small random set of attributes at each node [LHSV04]. Instances
that where not used in the training phase are then selected to estimate the prediction error. By using
a random forest instead of a single tree, there is an improvement in the classification accuracy
[BDF+ 05].
Random jungle (RJ) [SKZ10] is an improved approach of random forest, using parallel processing, becoming a faster and more viable in larger datasets. Even without parallel processing, RJ
is faster and uses less memory than the standard random forest, implementing variable backward
elimination.
Summary Table
Table 2.2 shows the summary of all the data selections algorithms discussed.
Algorithms
ACO
Description
Search based on the behavior of ants. Optimization problem becomes a problem of finding the best path based on positive feedback.
Class. Trees
Construction of trees that split nodes based on rules that represent a
good division in the outcome variable.
Rand. Forest
Ensemble approach to Classification Trees.
ReliefF
Calculates the value of each attribute based on its value in neighborhood individuals that have the same outcome variable.
Tuned ReliefF Modified version of ReliefF that removes the worst attributes and
recalculates the weight in remaining attributes.
Evap. Cooling Based on ReliefF, removes the least informative attributes with an
adjustable parameter related to the evaporation rate.
GPAS
Searches and maps attributes from random individuals to boolean
variables and uses a fitness function to evaluate them.
Table 2.2: A description of each data selection algorithm.
Specific Model Creation Algorithms
Backward Dropping Algorithm
Backward Dropping Algorithm (BDA) tries to find the subset of attributes that have the biggest
impact on the class variable [WLZH12]. The class is assumed to be binary and all other attributes
are assumed to be discrete. The explanatory variables are segregated into partitions of subsets,
which are then used to calculate I-score as:
I=
2
∑ n2j (Y¯j − Ȳ )
(2.2)
j∈Pk
where P is the partition selected with k variables, n is the number of observations, Y¯j and Ȳ is the
average of Y observations in the j partition and overall average respectively. In the training set,
a large group of explanatory variables are selected to be sampled into subsets. After computing
10
State-of-the-Art
the I-score, the variable which contributes less to the I-score is dropped. In each round another
variable is dropped until there is only one variable left. The subset which has the highest I-score in
the whole dropping process is returned. This subset represents the set of variables than contribute
the most to a positive state of Y , the response variable.
Bayesian Epistasis Association Mapping
Bayesian Epistasis Association Mapping (BEAM) receives genotype markers as input and determines the probability of each marker being associated with the disease, through a Markov Chain
Monte Carlo (MCMC), independently or in epistasis with another marker, and creates partitions
with those markers [ZL07]. It classifies these markers into three categories: SNPs unassociated
with the disease, SNPs associated with the disease independently and SNPs jointly associated with
the disease in epistasis [WYY12]. A B statistic was developed to show the statistic relevance of
the associations made with the disease. BEAM searches for epistasis with interactions of 3 or 2
SNPs. This is a hypothesis-testing procedure, testing each marker for significant interactions. The
B statistic is defined by:
BM = ln
Pjoin (DM )[Pind (UM ) + Pjoin (UM )]
PA (DM ,UM )
= ln
P0 (DM ,UM )
Pind (DM ,UM ) + Pjoin (DM ,UM )
(2.3)
where M represents each set of k markers, representing different complexities of interactions. DM
and UM are genotype data from M cases and controls and P0 (DM ,UM ) and PA (DM ,UM ) are the
Bayes factors. Pind is the distribution that assumes independence among markers in M and Pjoin is
a saturated joint distribution of genotype combinations among all markers in M.
BEAM3.0 is the third iteration of the BEAM algorithm and introduces multi-SNP associations and
high-order interactions flexibility, using graphs, reducing the complexity and increasing the power.
BEAM3 produces cleaner results with improved mapping sensitivity and specificity [ZL07]. The
algorithm is written in C++.
BNMBL
BNMBL is a Bayesian Network that assumes SNPs can either be Adenine(A) and Guanine(G) or
Cytosine(C) and Thymine (T), depending on the nucleotide base, and therefore can only assume
three possible values in the genotype: AA, GG or AG, because A is the same as T, and C is
the same as G, in this context. A directed acyclic graph (DAG) model is created for each data
item to assign a probability of the relationships between SNP. Figure 2.2 shows an example of
a probabilistic model of the relationship between SNPs and the disease D. Using only 21 log2 m
bits for each conditional probability, where m is the number of data items, the penalty calculated
in Equation 2.4 is applied in the scoring phase to each DAG, where k is the number of SNPs
[JNBV11].
3k
m 2k
log2 k + log2 m
2
3
2
11
(2.4)
State-of-the-Art
S6
S1
S7
S2
S3
S8
S4
D
Figure 2.2: Bayesian Network. Nodes represent SNPs. [JNBV11]
Boolean Operation-based Screening and Testing
Boolean Operation-based Screening and Testing (BOOST) converts the data representation into a
boolean type, using logic operators [Weg60], which allows faster operations and a smaller usage of
memory [WYY+ 10a]. The algorithm uses a pruning approach by removing interactions which are
statistically irrelevant. The ratio at which the pruning occurs is based on the difference between
the full logistic regression model:
log
P(Y = 1|Xl1 = i, Xl2 = j)
Xl
Xl
Xl Xl
= β0 + βi 1 + βi 2 + βi j 1 2
P(Y = 2|Xl1 = i, Xl2 = j)
(2.5)
and the main logistic regression model:
log
P(Y = 1|Xl1 = i, Xl2 = j)
Xl
Xl
Xl Xl
= β0 + βi 1 + βi 2 + βi j 1 2
P(Y = 2|Xl1 = i, Xl2 = j)
(2.6)
where Xl1 and Xl2 are genotype variables and i and j are one of the three possible states (0,1,2)
[WLFW11]. The algorithm is written in C. A GPU version of the algorithm was developed
(GBOOST) [YYWY11] providing a 40-fold speedup compared to that of BOOST running in a
CPU.
Grammatically Evolution Neural Networks
Based on neural networks, Grammatically Evolution Neural Networks (GENN) uses instructions
and a fitness function to train for classification problems related to genetics [TDR10a]. Based on
genetic algorithms, the populations within the data are heterogeneous and go through a process of
pairing, crossover and mutation to find the best Neural Network (NN) solution, which translates to
finding influential SNPs and correctly evaluating network weights. As the name suggests, linear
genomes and grammars are used to define the population. Grammar is used to increase diversity,
by separating the genotype from the phenotype [TDR10b]. GENN uses a Genetically Programmed
Neural Networks (GPNN) approach to optimize the NN selection, which is an improvement on
the NN architecture using genetic programming. This means that there are binary expression trees
that are evolved in a tree-like structure, fitting into the NN architecture.
12
State-of-the-Art
Information Interaction Method
Information Interaction Method (IIM) is an exhaustive algorithm that searches for all possible
pairs of SNPs to find relations between them and the expression of the phenotype [OSL13]. If
the synergy between the pair and the phenotype is above a user-defined threshold, then there is a
possible correlation between the pair and the phenotype. This is revealed by the Equation 2.7.
I(A; B;Y ) = I(A; B|Y ) − I(A; B)
= I(A;Y |B) − I(A;Y )
(2.7)
= I(B;Y |A) − I(B;Y )
Associations between single SNPs and a given phenotype are also tested by applying a mutual
information method, explained in Equation 2.8.
I(X;Y ) = H(X) − H(X|Y )
= H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(H,Y )
(2.8)
= H(X,Y ) − H(X|Y ) − H(Y |X)
Meta-Analysis Gene-set Enrichment of variaNT Associations
Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA) consists of four steps:
mapping SNPs to genes, assigning a score to each gene association, applying a correction of ambiguous gene association scores and finally a statistical test is made to find predefined biologically
relevant gene sets in the association scores, compared to randomly sampled gene sets of identical
size [SGM+ 10]. Instead of receiving genotype data, MAGENTA receives p-values of SNPs as an
input. The gene association score is done based on regional SNP p-values.
Multifactor Dimensionality Reduction
Multifactor Dimensionality Reduction (MDR) is one of the most popular methods for the detection
of interactions between SNPs. MDR receives two parameters: the N number of attributes with
the strongest connection to the disease to be selected and the T threshold ratio for affected to
unaffected individuals to distinguish high risk from low risk genotype combinations [HRM03].
MDR uses cross validation and in the training data set of each fold determines the high/low risk
groups. After calculating the misclassification error using the test data, the resulting prediction
error rate is the average of all the folds. In the end, the n-order combination with the minimum
average prediction error and the maximum cross-validation accuracy from all the dimensions is
selected [CLEP07]. The odds ratio for the best combinations is used to generate bootstrap data.
After calculating the odds ratio for the best combination, the confidence intervals are constructed
by using empirical distribution of the odds ratio [Moo04]. The combinations of high risk loci are
the ones that have a stronger connection in the disease outcome [RHR+ 01].
13
State-of-the-Art
Model-Based Multifactor Dimensionality Reduction
Model-Based MDR (MB-MDR)[MVV11] tries to overcome many of the drawbacks from the
original algorithm MDR such as missing important interactions due to sampling too many cells
together and only analysing at most one significant epistasis model. MB-MDR merges multi-locus
genotypes that have a significant high or low risk based on testing, rather than a threshold value.
Unassociated loci are placed in a ’No Evidence for risk’ class. This algorithm uses a significance
assessment, correcting type I errors, and evaluating each SNP with a Walt statistic test [MVV11].
The algorithm is written in R. MB-MDR process can be divided into the following steps:
1. Multi-locus cell prioritization - Each two-locus genotype is assigned to either High risk,
Low risk or No Evidence of risk categories.
2. Association test on lower-dimensional construct - The result of the first step creates a new
variable with a value correlated to one of the categories. This new variable is then compared
with the original label to find the weight of high and low risk genotype cells.
3. Significance assessment - This stage tries to correct the inflation of type I errors after the
combination of cells into the weight of High risk and Low risk.
MB-MDR can also be adjusted to consider main effects within interactions.
Screen and Clean
Screen and clean (S&C) [WDR+ 10] is a recent algorithm divided into two main parts: screening
part and cleaning part. The algorithm creates a dictionary with all SNPs and splits the data into
stage 1 data for screening and stage 2 data for cleaning. In stage 1, the data is screened using the
logistic regression model in Equation 2.9 to find SNPs.
N
g(E [Y |X]) = β0 + ∑ β j X j
(2.9)
j=1
where X j is the encoded genotype value 0, 1 or 2 and Y is the encoded phenotype, 0 or 1, N is
the number of measured SNPs, g is an appropriate link function, and S = j : β j 6= 0, j ∈ 1, ..., N
are the set of terms associated with the phenotype as main effect [WLFW11]. According to the
selected SNPs, Screen and Clean tries to find relevant interacting SNPs that fit into the following
interaction model:
N
g(E [Y |X]) = β0 + ∑ β j X j +
j=1
∑
βi j Xi X j
(2.10)
i< j;i, j=1,...,N
where S = {(i, j) : β i j 6= 0, (i, j) ∈ 1, ..., N} are the set of terms associated with the phenotype as
epistasis. In stage 2, clean, controls false positives by using the stage 2 data and removing SNPs
with p-values higher than a predetermined threshold (α). The algorithm is written in R.
14
State-of-the-Art
SNPHarvester
SNPHarvester is a stochastic search algorithm. It divides the SNPs into three different categories:
unrelated to the disease, related independently and contributes jointly to the disease with no effect
independently. SNPHarvester is based on a multiple path generation with a generic score function
[YHW+ 09]. The first point in each path is generated randomly. Using a created local search algorithm, SNPHarvester finds the local optimum, usually in two or three iterations, and the significant
groups of SNPs in each path, according to a scoring function. This function is a popular χ 2 value
score function [RHR+ 01]. After the scoring, randomly picks k SNPs to form an active set, leaving
the rest as a candidate set. Each SNP in the active set is then substituted with one from the candidate sets in order to maximize χ 2 . After finding the maximized candidate, removes the selected
SNP group and repeats the procedure to identify M groups which is a predetermined parameter.
The selected M groups are then fitted into the L2 penalized logistic regression model
L(β0 , β , λ ) = −l(β0 , β ) +
λ
kβ k
2
(2.11)
where l(β0 , β ) is the binomial log-likelihood and λ is a regularization parameter [WLFW11].
SNPHarvester is written in Java.
SNPRuler
SNPRuler is a rule-based algorithm. Epistatic interactions promote a set of rules. These rules are
implications of an interaction between SNPs and the disease. To find the rules, SNPRuler uses
trees that represent genotypes in each node, with the leaves representing the phenotypes, creating
a path of epistatic interaction. For each rule, a 3x3 table is generated based on the probability of
each possible genotype combination and phenotype [WYY+ 10b]. In a big number of SNPs, there
is an upper bound limit to the tree, pruning it instead of an exhaustive search. This threshold is a χ 2
test statistic [RHR+ 01]. However, this pruning can lead to a wrongful prune of many true-positive
epistatic interactions. This algorithm was developed in Java.
TEAM
Tree-based Epistasis Association Mapping (TEAM) is essentially an exhaustive algorithm. TEAM
uses a permutation test to create a contingency table with all the calculated p-values [ZHZW10].
To reduce computation costs, if there are two SNPs with very frequent genotype values, then it
is shared for each individual with the same genotype. TEAM only works with two-loci interactions. It uses a tree-based representation, where nodes contain SNPs 1,2,3 and the edges represent
the number of individuals with different values on the two SNPs [WLFW11], further reducing
computation costs when the values are the same. This algorithm is written in C++.
Summary Table
Table 2.3 shows a summary of all model creation algorithms previously discussed.
15
State-of-the-Art
Algorithms
BDA
Description
Uses an iterative selection process where the most significant SNPs to the
disease are selected.
BEAM
Determines the probability of a given SNP to be associated with a disease
independently, or in epistasis with N SNPs.
BNMBL
Creates a DAG with the probabilistic model of the relationship between
SNPs and the disease.
BOOST
Converts data into a boolean type, pruning statistically irrelevant SNPs.
GENN
Creates NNs which are evolved, based on a genetic algorithm approach,
find the NN with the best accuracy.
IIM
Searches all possible pairs of SNPs, finding a relation between each pair
and the phenotype above a specified threshold.
MAGENTA
Calculates the statistic relevance of SNPs based on regional SNP p-values
instead of genotype data.
MDR
Applies cross-validation training the data to find high risk groups of SNPs.
S&C
Creates a dictionary of all the SNPs and divides the data into two stages:
screening, to select SNPs according to a logistic regression model, and
cleaning, to decrease false positives.
SNPHarvester Stochastic search algorithm classifying SNPs according to their relation
with the disease, using a random local search algorithm.
SNPRuler
Rule-based algorithm, defining rules based on epistatic interactions.
TEAM
Exhaustive algorithm, creating a table with p-values of each pair of SNPs,
and uses a tree-based representation to place the results.
Table 2.3: A description of each model creation algorithm designed for this problem.
Generic Model Creation Algorithms
Ensembles
There are many types of ensemble algorithms created by joining several kinds of model creation
algorithms to try and make a more accurate and reliable model. In this context, one of the most
recent ensemble methods [YHZZ10] was created using a genetic algorithm together with a few
classifier algorithms. Several subsets of SNPs are selected by applying the genetic algorithm a
predetermined number of times. These subsets are analyzed and ranked based on the number
of times each SNP combination appears in the selected subsets. After acquiring the fitness for
every SNP subset, the chromosome with the highest fitness is selected, represented by the SNP
subset contained in the chromosome. The genetic algorithm then applies selection, crossover and
mutation to the chosen subset. Considering the large amount of SNPs, in order to reduce the noise
and optimize the process, two classifier strategies and a diversity promoting strategy are used to
preselect and evaluate the SNPs. Blocking uses M classification algorithms to eliminate differences
caused by noise. Voting is used to balance and increase accuracy in evaluating the fitness of SNPs.
Double fault diversity tries to evaluate the diversity between classifiers by calculating the fitness
of misclassified SNPs, focusing on the diversity between them. This particular approach uses
decision-tree-based classifiers and instance based classifiers.
16
State-of-the-Art
ILP
Inductive Logic Programming (ILP) algorithms work by creating hypotheses that are encoded as
First Order clauses. ILP is characterized as an expressive representation language (First order
Logic - FOL) to represent both data and hypotheses. ILPs are very used in the bioinformatics field
and produce good results but have a slow runtime, therefore affecting the scalability.
Initially, ILP algorithms create background knowledge of the problem by logic propositions.
The training is then made with positive and negative examples. Hypotheses are then generated by
creating new logic propositions using the background knowledge and trained examples.
Success is measured by the classification accuracy of a given hypothesis and the transparency
of a formulated hypothesis, which means the ability to be understood by humans [LD94].
There are many implementations of this type of algorithm. One of the most used systems is A
Learning Engine for Proposing Hypotheses (Aleph) [Sri01]. This algorithm works in 4 different
steps:
1. Select example. Selects an example to be generalized and stops if none exist.
2. Build most-specific clause. Based on the example selected, the most specific clause that
respects the language restrictions is constructed.
3. Search. After creating the most specific clause, a more generalized clause is searched for
in a subset of the clauses in the previous clause.
4. Remove redundant. The clause with the best score is then added to the theory, removing
the redundant examples.
K-Nearest Neighbor
K-Nearest Neighbor (K-NN) is a classification and regression algorithm that determines the value
of new data based on its approximation to other instances [HK06]. In classification, the closest
neighbors to the new instance determine the class of that instance. In regression, the result is
the average of the nearest neighbors. K is the number of the nearest neighbors to be used in the
calculation of new results. For this context, K-NN is mostly used as an attribute selection method.
Methods such as ReliefF use an approach based on the K-NN algorithm.
Naïve Bayes
Bayesian approaches are amongst the most common in this context. The naïve approach assumes
independence between features. There are many optimizations to reduce the naivety [PV08], such
as selecting subsets of attributes that are considered to be conditionally independent [LIT92], or
extending the structure of Naïve Bayes (NB) to represent dependencies in attributes [WBW05].
NB works using Bayesian networks by assigning probabilities to each event, using the model
trained previously. The final result is then chosen based on the most probable outcome. Specific
implementations of this nature can be seen in BEAM and BNMBL.
17
State-of-the-Art
Neural Networks
NNs are a type of classification and regression algorithm based on the neurological system of the
central nervous system. An example of these NNs is a Multilayer Perceptron, whic is the most
used type of NNs. In a Multilayer Perceptron, there is an input layer, which proceeds to a second
layer of nodes that represent neurons. These intermediate layers are also called hidden layers. The
last hidden layer, or output layer, represents the prediction of the network. Each layer is densely
connected. Each connection is weighted based on the relations between nodes in the training phase
[HK06]. An example of this is illustrated in Figure 2.3.
1
w_1
w_14
w_15
4
w_46
w_24
2
w_2
6
w_25
w_34
3
w_3
w_56
5
w_35
Figure 2.3: An example of a Neural Network. [HK06]
NNs can have multiple layers, which can be used in nonlinear problems [DHS01]. There
are some recent implementations of NNs in the discovery of epistatic relations, such as GENN
[HDF+ 13].
Support Vector Machines
Support Vector Machines (SVM) is a classification algorithm that divides data based on pattern
recognition methods. In the training phase, data is divided into two parts, mapping them according to their attributes. SVM then tries to find the best nonlinear mapping to separate data by a
hyperplane [DHS01]. This hyperplane is mapped in order to find the best separation possible
by increasing the distance in the gap between the classes. If a linear classification is not possible, SVM can use the kernel trick to divide the data by increasing the dimension of the problem
[ABR64]. A regression or multiple class alternative of SVM is also available by transforming
the problem into multiple binary class problems [DKN05].There are no specific implementation
of SVM in this context, however there are many methods that use pattern recognition in their
implementation.
18
State-of-the-Art
Summary Table
Table 2.4 contains a summary of all the generic algorithms discussed. A more technical table is
available in Figure 2.11.
Algorithms
Ensemble
Description
Many model creation algorithms are joined to "vote" on the most probable
outcome to increase the accuracy and creating a more reliable meta model.
ILP
Uses logic programming, representing positive and negative examples, background knowledge and hypotheses that use trained examples and background
knowledge to classify accurately and transparently.
K-NN
Uses trained data to classify new instances based on the proximity to a given
neighbor previously classified. The outcome is obtained from the closest
neighborhood of the new instance.
Naive Bayes Creates bayesian networks, calculating the probability of a relation between
events, assuming independence between attributes.
NN
Based on the central nervous system, creates a graph using the trained data
and, receiving an input, calculates the most weighed path to an outcome node
based on its connection to other nodes.
SVM
Searches and maps attributes from random individuals to boolean variables
and uses a fitness function to evaluate them.
Table 2.4: A description of each generic model creation algorithm.
Statistical Methods
Bonferroni Correction
Bonferroni Correction is a conservative approach to multiple comparison testing [BH95]. It is the
simplest correction for selecting a predetermined number of hypotheses based on their statistical
relevance. However there is no assumption of dependency. For SNPs, the p value is calculated
using
pcorrected = 1 − (1 − puncorrected )n
(2.12)
where n is the number of hypothesis tested. this can be further simplified to
pcorrected = npuncorrected
(2.13)
when npuncorrected 1 [Cor09].
Linear Regression
Linear regression models try to fit a straight line on the data points. As in every regression problem,
the label is numeric. This label is modeled as a linear function, as shown in Equation 2.14 of
another random variable. The weights attributed to each variable are calculated in the training
19
State-of-the-Art
1
4
0.8
2
0.6
0
0.4
−2
0.2
−4
0
0.2
0.4
0.8
0.6
0
−10
1
(a) logit transformation
−5
0
5
10
(b) logistic regression function
Figure 2.4: A logit transformation and a possible logistic regression function resultant of the logit
transformation.[WFH11]
data [WFH11].
x = w0 + w1 a1 + w2 a2 + ... + wk ak
(2.14)
where x is the class outcome, ai is each attribute, and wi is the weight of the attribute. In case of
a multiple linear regression, more than one variable is involved in predicting the label [SGLT12].
Linear regression models are usually fitted using the least squares approach to fit data into linear
equations and minimizing the squared errors between the observed values and the fitted values
[HK06]. In this context, linear regression models are often used as fitness functions to test the
score of SNPs and their statistical relevance.
Logistic Regression
Linear functions can be used in classification problems by assigning 1 to instances in the training
that belong to the class and 0 for the instances that do not belong the class [WFH11]. A linear
function is still applied to new instances and the class closest to the resulting value is selected.
Since these values are not constrained in the interval from 0 to 1, a logit transformation is applied
by transforming the variable into a value ranging from 0 to 1. Figure 2.15 illustrates a relation of
the logit transformation and the final logistic regression function. To evaluate the logistic regression model, log likelihood is used instead of calculating the squared error. The formula used to
evaluate the model is
n
(i)
∑ (1 − x(i) )log(1 − Pr[1|a1
(i)
(i)
(i)
(i)
(i)
, a2 , ..., ak ]) + x(i) log(Pr[1|a1 , a2 , ..., ak ])
(2.15)
i=1
where x(i) is either 0 or 1 and ai represents each attribute.
Logistic regression, in this case, is used as a penalizing model [Ste12] [NCS05], and as a
statistical model when the outcome is binary [PH08][TJZ06][Cor09].
20
State-of-the-Art
A variation of this model called multinomial regression are used when there is more than two
possible outcomes.
Markov Chain Monte Carlo
MCMC algorithm is used to sample models within a high dimension surface. MCMC finds models
by using a random walk, trying to converge to a target equilibrium distribution [Smi84], creating
a sample of the population to be analyzed. This algorithm is often used in bayesian statistics
[SWS10].
Summary Table
Table 2.5 shows a brief description of the auxilliary algorithms.
Algorithms
Bonferroni Correction
Description
Selects a predetermined amount of hypotheses based on their statistic
relevance. Used in model creation algorithms.
Linear Regression
Creates a straight line connecting SNPs to find their relevancy. Used
for numerical values.
Logistic Regression
Applies a linear function to assert how new data is evaluated, based
on trained data. Used for binary values.
MCMC
Uses a random walk to find statistical relevancy of SNPs. Used in
bayesian models.
Table 2.5: A description of auxiliary algorithms used in model creation or data selection.
2.3
Data analysis evaluation procedures and measures
Data analysis evaluation measures
Type I error and Type II error
Type I errors refer to the acceptance of a false relation. In this case, it refers to the acceptance of
a relation between an SNP or interaction of SNPs and the disease that does not exist in fact. This
can also be referred to as a false positive. Type II errors refer to the rejection of true relations. This
can also be referred to as a false negative.
Accuracy
The accuracy for classifier algorithms can be determined by how accurately a given classifier will
correctly predict future data. This may sometimes be misleading when overfitting occurs. To
prevent this, data evaluation procedures are employed and the final accuracy is the average of the
accuracies obtained from each iteration [Pow11]. For ensemble methods, a voting process takes
21
State-of-the-Art
place and the final result is the most voted outcome.
Accuracy =
true positives + true negatives
true positives + false positives + true negatives + false negatives
(2.16)
Precision
Precision is the measure of the relation between the relevant results and the returned results by the
model [Pow11]. In this context, this means the number of SNPs or genotypes correctly identified
as related to a disease by the model, in relation to all the SNPs or genotypes identified as related
to a disease by the model.
Precision =
true positives
true positives + false positives
(2.17)
Recall
Recall measures the fraction of the relevant results in relation to the retrieved relevant results
[Pow11]. In this context this means the SNPs or genotypes correctly identified as related to a
disease by the model, in relation to all the SNPs or genotypes that are actually related to a disease.
Recall =
true positives
true positives + false negatives
(2.18)
F-measure
The f-measure is the relation between precision and recall. In the F1 measure, this creates some
problems due to the similar weight to precision and recall which may not have the same relevancy.
This is true for the epistasis problem, where type 1 errors should be prioritized [Pow11].
F1 = 2 ·
precision · recall
precision + recall
(2.19)
ROC curve
The receiver operating characteristic (ROC) is a graphical representation of the relation between
true positives and false positives of a binary classifier. Multiple classifiers can be plotted to compare results [Pow11]. The area under the curve (AUC) corresponds to a higher probability of
selecting a true positive than a false positive [WFH11]. Figure 2.5 shows an example of the ROC
curve. The greater the AUC the better. A representation of the ROC Curve is given by the relation
between sensitivity and specificity which can be seen in Equations 2.20 and 2.21.
Sensitivity =
Number of true positives
number of true positives + number of false negatives
(2.20)
Speci f icity =
Number of true negatives
number of true negatives + number of false positives
(2.21)
22
True Positive Rate (1−Sensitivity)
State-of-the-Art
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.8
0.6
False Positive Rate (Specificity)
1
Figure 2.5: An example of a ROC curve.
Summary Table
Table 2.6 shows the summary of the various evaluation measures.
Data analysis evaluation procedures
Bootstrapping
Given a dataset of n instances, a bootstrap sample is a selection of a new dataset with size n by
sampling from the original dataset with replacement, therefore creating a different dataset from the
original one. The probability for any given instance not to be chosen is (1 − 1/n)n ≈ e−1 ≈ 0.368.
Due to the high chance that some instances from the original dataset are not included in the new
dataset, these instances will be used for testing [Koh95]. Considering the high percentage of
probable testing instances, the bootstrap procedure is then repeated several times with different
samples and the final results are averaged [WFH11].
Cross-Validation
The dataset is split randomly into k mutually exclusive subsets or folds. Each fold has approximately the same size and is tested once and trained k − 1 times. This means that there are k
iterations of model creation and evaluation. In each iteration, a new subset is selected to become
the test set, while the other k − 1 subsets are used for training. The accuracy is estimated by
acccv =
1
δ (L(D \ D(i) , vi ), yi )
n hv ,y∑i∈D
i
(2.22)
i
where D(i) is the test set that includes instance xi = hvi , yi i and n is the number of folds [Koh95].
The error rates of the different iterations are averaged to yield the overall error rate [WFH11]. The
dataset can be stratified to place in each fold the same proportions of labels as the original dataset.
23
State-of-the-Art
Algorithms
Type I error
Description
false positives
Type II error
false negatives
Accuracy
true positives + true negatives
true positives + false positives + true negatives + false negatives
Precision
true positives
true positives + false positives
Recall
true positives
true positives + false negatives
F-Measure
2·
ROC curve
precision · recall
precision + recall
Sensitivity =
Number of true positives
number of true positives + number of false negatives
Speci f icity =
Number of true negatives
number of true negatives + number of false positives
Table 2.6: A description of data analysis measures and how they are calculated.
Leave-one-out
Leave-one-out is an n-fold cross validation where n is the number of instances in the dataset. In
each iteration, a new instance is left out as a test, while all the others are used in training. For
classification algorithms, this means that the success rate in each fold is either 0% or 100%. This
approach allows for the maximum use of data for training, which presumably increases accuracy,
and is a deterministic process because there is no random sampling for each fold. However this
process is computationally expensive and cannot be stratified [WFH11].
Hold-out method
Holdout method consists on the strict reservation of a portion of the data for testing. This means
that only a part of the dataset will be used for training, usually 70%, leaving 30% for testing
24
State-of-the-Art
purposes [Koh95]. This method works for big datasets but can have a negative effect on the
accuracy of small datasets, which can be underestimated.
Summary Table
The Table 2.7 shows the various methods of data testing.
Algorithms
Bootstrapping
Cross-Validation
Leave-one-out
Hold-out method
Description
Creates a new dataset with the same number of instances as the original,
but with replacement, which allows for a repetition of the same instance.
This can be repeated multiple times.
Divides the original data into small subset of equal size, leaving one subset to test. This subset changes with each iteration, through all subsets.
Similar to Cross-Validation, but the size of the subsets is 1, leaving only
1 instance to test, iterating through all instances.
Reserves a specific amount of data to test, using the rest for training.
Table 2.7: A description of data analysis procedures.
Data Analysis Methodologies
KDD Process
The Knowledge Discovery in Databases (KDD) Process is the extraction of knowledge from DM
methods using the specification of measures and thresholds [AS08].
The KDD process is interactive and iterative, divided into many components which can be
summarized into 5 steps illustrated in Figure 2.6.
1. The selection step consists of learning the application domain and creating a target data set
or select data samples for knowledge discovery.
2. The pre processing stage consists of data cleaning and handles missing data fields.
3. The transformation step allows for data reduction methods to reduce the dimensionality
and adapt the data for the model creation algorithms.
4. The data mining stage is where the algorithm for model creation is selected and applied.
5. The final stage, interpretation/evaluation, is where the discovered patterns are interpreted
and the performance is measured [FPSS96].
25
State-of-the-Art
Figure 2.6: The main stages of the KDD process.[FPSS96]
CRISP-DM
The Cross Industry Standard Process for Data Mining (CRISP-DM) methodology consists in a
model of the life cycle in a data mining project. This life cycle is illustrated in Figure 2.7. There
are 6 main phases of a project [CCK00].
1. The business understanding phase is the initial phase, where the main focus is the understanding of the objectives and requirements of the project from a business perspective,
defining an initial plan to solve the data minig problem.
2. Data understanding is the collection and comprehension of the data itself. This is where
the first data characteristics become apparent.
3. In the data preparation phase is where preprocessing filters and feature selection methods
are performed.
4. During the modeling phase, one or more modeling techniques are applied to create specific
models. Each technique has data preparation requirements which vary for each modeling
technique.
5. At the evaluation stage, the models developed in the earlier phase are then put to the test
using different types of evaluation methods. The results are then analyzed.
6. Finally, the deployment stage is where the knowledge created with the data mining process
is put in practice, either by generating a report or by implementing a repeatable data mining
process for the customer.
SEMMA
SEMMA stands for Sample, Explore, Modify, Model and Assess. SEMMA, like CRISP-DM,
follows a data mining life cycle with 5 stages, consistent with each of the letters of the acronym.
1. The sample stage is where sampling of the data and data selection takes place. This stage is
optional.
2. The explore stage consists of searching the data for anomalies to gain understanding of the
data.
3. Modify stage consists of transforming the data and shaping it to serve the model selection
process needs.
26
State-of-the-Art
Figure 2.7: The CRISP-DM life cycle.[CCK00]
4. The model stage is where the model creation takes place.
5. The assess stage exists to evaluate the usefulness and the reliability of the created model.
Although the process is independent from the data mining (DM) tool, there are some guidelines
connected to the Statistical Analysis System (SAS) Enterprise Miner software [AS08].
Summary Table
Table 2.8 contains the discussed procedures and their various stages.
KDD
SEMMA
CRISP-DM
Pre KDD
————- Business understanding
Selection
Sample
Data Understanding
Pre processing
Explore
Transformation
Modify
Data preparation
Data mining
Model
Modeling
Interpretation/Evaluation Assessment Evaluation
Post KDD
————- Deployment
Table 2.8: A comparison between the different procedures [AS08].
2.4
Data Simulation and Analysis Software
Data Simulation Software
HAPGEN
HAPGEN is used to simulate case-control datasets of SNP markers. These datasets can encode the
main effect and interactions of multiple disease SNPs. These datasets can be further customized to
27
State-of-the-Art
allow for a change in the number of individuals and SNPs. To create the simulation of phenotypes
with interaction between disease SNPs, an R package is supplied, using the data with independent
disease SNPs generated from HAPGEN [SSDM09].
GenomeSIMLA
genomeSIMLA can be divided into two main programs: SIMLA (SIMulation of Linkage and
Association) and simPEN. SIMLA generates large scale populations used in the samples for casecontrol datasets while simPEN creates penetrance tables for the disease specifications. A forwardtime population simulation is used to specify many gene related parameters, allowing a controlled
evolutionary process [EBT+ 08]. It can be used in family studies and unrelated individuals. Penetrance models can be generated, allowing specific allelic frequencies, in purely epistatic interactions or associated with main effects. Each interaction may contain various SNPs. The number
of chromosomes, SNPs, and individuals in each dataset is configurable. The prevalence and odds
ratio of the disease can be adjusted to allow a more realistic manifestation in the disease model.
The software contains 4 main steps:
1. Pool Generation contains the evolution of the population, together with their chromossomes and allelic frequencies.
2. When a generation contains the desired characteristics, a Locus Selection is made to allocate the disease model.
3. The Penetrance Specification is used to measure the risk associated to each configuration
of the disease alleles.
4. Finally the Data Simulation creates the datasets, according to the specified configurations.
Gene-Environment iNteraction Simulator 2
Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions between two genetic
SNPs and one environmental factor[MSA+ 12]. Initially, the population, which is used to generate the datasets, is carried out by a simuPOP script [PA10]. This population evolves with a time
simulation, based on the desired number of individuals and allele frequencies. The second step
involves the simulation disease penetrance model. Finally, according to the risk assessment, a disease status is assigned randomly. Some of the options of the customization of the data involve the
number of individuals, allellic frequencies, prevalence and risk associated. A GUI is implemented
to allow for a swift customization and configuration of the data.
Summary Table
The Table 2.9 contains all data simulation tools discussed previously.
28
State-of-the-Art
HAPGEN
case-control
Dataset types
and
GenomeSIMLA
case-control, pedigree and family
GENS2
case-control
main effect and/or
epistasis
epistasis and geneenvironment
Interaction types
main effect
epistasis
Order of interactions
X SNPs
X SNPs
2 SNPs and 1 environment factor
GUI
No
Generates html to
illustrate the population
Yes
Population
not generated
forward-time population simulation
time simulation
Customizable
of individuals
number
Yes
Yes
Yes
Customizable
of SNPs
number
Yes
Yes
No
Customizable
Allelic Yes
Yes
Yes
frequencies
Table 2.9: A comparison of the most relevant features of data simulation tools.
Data Analysis Software
Analysis Tool for Heritable and Environmental Network Associations
ATHENA (Analysis Tool for Heritable and Environmental Network Associations) is a software
package designed to create models by analyzing various types of data. The organization of the
software package can be seen in Figure 2.8.
ATHENA receives various types of inputs and consequently uses filtering methods for feature
selection and analytical methods for model creation. The final model consists of the best generated
analytical method.
The main usage of ATHENA is in feature selection and model creation. As a filtering method,
it uses a random jungle algorithm, which is a bootstrap, tree-based variable selection method
version of RF. For modeling, ATHENA uses computational evolution modeling techniques like
GENN [HDF+ 13], as well as other common regression algorithms.
29
State-of-the-Art
Filtering Methods
Random Jungle
Input Data
SNPs
microarray
proteomics
ATHENA
sequence data
Metadimensional
Models
biomarkers
clinical data
Analytical Methods
Symbolic Regression
Bayesian Networks
SVM
GENN
Y = B 1X1*B2X2+B3X3
Figure 2.8: A diagram of the ATHENA software package [HDF+ 13]. The input can have many
different formats, involving different kinds of data. In the filtering step, variables are prioritized
based on their known biological functions. The analytical methods currently consist of computational evolution modeling techniques as modeling techniques, but will be further developed to
allow more methods. This analysis allows the creation of different types of data in order to identify multi-variable prediction models that include data from different parts of the whole biological
process.
Knowledge Extraction Evolutionary Learning
Like the name suggests, Knowledge Extraction Evolutionary Learning (KEEL) is a software tool,
containing many evolutionary algorithms, and is used in many typical data mining problems.
KEEL contains the most well-known models in evolutionary learning. These can used for
research purposes, using the built in automation of experiments, or used as an educational tool,
with emphasis on the execution time and a real-time view of the algorithms during the data mining
process [AFSG+ 09].
The currently available function blocks are:
1. Data Management is used for importing or exporting data into other formats, data edition
and visualization.
2. Design of Experiments is where the experimentation takes place, applying the selected
model, type of validation and type of learning on the selected data sets. This module is
available off-line.
3. Educational Experiments works in a similar way as the previous function block, but can
be closely monitored, with displaying the learning process for the selected model algorithm.
This module is available on-line [AFFL+ 11].
30
State-of-the-Art
Konstanz Information Miner
The Konstanz Information Miner (KNIME) is a Java graphical workflow editor. The architecture was developed based on three main aspects: visual interactive framework, modularity in the
process, to distribute the development of different algorithms, and easy expandability to add new
processing nodes or views [BCD+ 08].
In version 2.0, KNIME now allows for loops in the workflow, new database ports and adds
Predictive Model Markup Language (PMML) used for storing and exchanging predictive models
in XML format [BCD09].
Orange
Developed in Python, Orange is a machine learning and data mining toolbox, containing many
hierarchically-organized data mining components. The main hierarchical blocks are: data management and preprocessing for feature selection and data input, classification, regression, association (rules), ensembles (such as bagging and boosting), clustering, evaluation, and projections.
Classification algorithms include bayesian approaches, SVM, rule induction approaches, classification trees and random forest. Regression methods include linear and lasso regression, partial
least square regression, multivariate regression and regression trees or forests. Evaluation contains
the various procedures for testing and scoring the quality of prediction methods or estimation of
reliability. The projections block is where the visual analysis takes place, with multi-dimensional
scaling and self-organizing maps.
Orange works with python shell scripts. This means that new methods can be created or
existing machine learning components can be combined [DCE13].
PLINK
PLINK is a C/C++ tool set designed to handle GWAS datasets. Due to their high complexity and
size, simple methods are preferred, which can achieve good results with more data. Measures
allele, genotype, and haplotype frequencies.
PLINK offers tools for clustering a population into homogeneous subsets, for classical multidimensional scaling (MDS) algorithm and for outlier detection. MDS helps to find similarities by
plotting objects in many dimensions, trying to preserve distance between objects [PNTB+ 07].
A graphical user interface (GUI), gPLINK, offers a framework to manage projects. gPLINK
also provides integration with Haploview, which is a tool used in tabulating, filtering, sorting,
merging and visualizing PLINK GWAS output files [BFMD05].
R
The R project is a statistical computing system. R has a command-line-driven interpreter for
the S language, with many extension packages available [Rip01]. The advantage of R is the
31
State-of-the-Art
flexibility to create new algorithms instead of using implemented approaches, where the source
code is not available. R can also produce high quality graphics and mathematical symbols. Some
user interfaces are available as packages or by using Integrated Development Environments (IDEs)
and adding R as an plugin [VL12]. R also contains many algorithms encoded in packages, such as
NN, SVM and MBMDR.
RapidMiner
RapidMiner is a DM software tool which contains many algorithms for all DM problems and
business analysis. It contains a GUI for creation and editing of data mining processes, following
the CRISP-DM methodology [Jun09].
A modular and pipelined view of the process consists of four stages: input stage where many
formats of data can be imported, preprocessing stage where the filtering and data processing begins, learning stage using the selected algorithm, and evaluation stage which contains the performance results of the process. RapidMiner allows for the extension of plug ins, which can be used
by developers to create new algorithms.
Weka
Weka is a machine learning workbench and application programming interface (API). Weka has
four interfaces: command line, Knowledge Flow, Explorer and Experimenter [FHT+ 04].
Explorer is the main interface in Weka. It contains different tabs with different types of methods. The tab Preprocess contains filtering methods. Classify contains classifier and regression
algorithms. Cluster and Associate contain clustering algorithms and rule association methods respectively. The select attribute tab contains methods for identifying subsets of attributes that are
predictive of other attributes. The final panel, visualize, allows plotting of pairs of attributes with
many customizable options. The user interface of the Explorer can be seen in Figure 2.9.
In the context of bioinformatics, Weka provides a wide variety of algorithms for classification,
regression, clustering and feature selection.
A recent update added many new methods and reduced the execution time by using just-intime compilers [HFH+ 09].
Summary Table
Table 2.10 contains a summary of all the discussed data analysis software.
32
State-of-the-Art
Figure 2.9: The Weka Explorer interface.
No of integrated algorithms
few algorithms
Tool
ATHENA
GUI
No
Allows scripting
Yes
KEEL
Yes
Yes
few algorithms (evolutionary learning
algorithms)
KNIME
Yes
Yes
many algorithms and i/o converters
Orange
Yes
Yes
many algorithms
PLINK
Yes (gPLINK)
Yes
PLINK
R
No
Yes
many algorithms (packages)
RapidMiner
Yes
No
many algorithms and i/o converters
Weka
Yes - 3
Yes
many algorithms
Table 2.10: A comparison of data mining tools.
33
State-of-the-Art
2.5
Chapter Conclusions
This chapter can be divided into 4 main categories: biology background, statistical and machine
learning algorithms, evaluation measures and procedures, and Data Simulation and Analysis
tools.
The study of the biology concepts and the background knowledge is vital to understanding
the problem. This concerns the data understanding, which translates to a better approach to the
problem. How the DNA is organized in the chromosomes and the division into genes and SNPs is
very important to knowing how epistasis works.
The statistical and machine learning algorithms are divided into feature selection algorithms
and model creation algorithms. The feature selection algorithms may produce different results
depending on the generated model. This means that model creation algorithms need to be adapted
with specific feature selection approaches. This is true for most of the model creation algorithms,
where these feature selection approaches are already embedded. Considering the large amount of
model creation algorithms, a pre-selection is necessary. From previous results [WLFW11], algorithms like PLINK, MDR and BEAM are deprecated in preference to BOOST, S & C, SNPHarvester, SNPRuler and TEAM. In the last year, an interesting study [OSL13] revealed that IIM was
better than BEAM and SNPHarvester, making this an interesting approach to be tested. However
this algorithm does not yield a χ 2 score for the significant SNPs, and BEAM has since been improved, and is now in its third iteration [ZL07]. Furthermore, considering that MDR is one of the
first and most popular approaches to GWAS, a new iteration, MDMBR is also a good algorithm to
test.
To optimize the results, machine learning procedures should be used. Cross validation and
bootstrapping are the most popular approaches, due to their high ratio of training to test instances
Hold-out is also very popular for large datasets. As a DM methodology, CRISP-DM is the most
widely adopted approach, for being independent from tools and industries.
There are many tools available, including specifically designed tools. However, some include
only a small amount of algorithms and do not allow implementation of new algorithms. This is
important to test existing approaches and generate new ones. A data analysis tool that allows
scripting, such as R, is very useful for creating scripts that evaluate existing algorithms based on
the chosen statistical relevancy tests.
The algorithms selected for the empirical study of state-of-the-art model creation algorithms
regarding epistasis and main effect detection are illustrated in 2.11, with a summary of the main
characteristics of each selected algorithm.
34
State-of-the-Art
Table 2.11: Similarities and differences between BEAM3, BOOST MBMDR, and Screen & Clean,
SNPHarvester, SNPRuler, and TEAM.
Features
Search
Permutation Test
Chi-square Test
Tree/Graph Structure
Bonferroni Correction
Interactive Effect
Main Effect
Full Effect
Programming Language
BEAM 3
Stochastisc
√
−
√
√
√
−
√
√
√
√
−*
−
−
√
√
√
Screen & Clean
Heuristic
−
−*
−
√
√
√
√
C++
C
R
R
Features
SNPHarvester
Stochastic
−
√
SNPRuler
Heuristic
−
√
√
√
√
TEAM
Exhaustive
√
−*
√
BOOST
Exhaustive
−
√
MBMDR
Exhaustive
√
Search
Permutation Test
Chi-square Test
−*
√
Tree Structure
−
√
Bonferroni Correction
−
√
√
Interactive Effect
√
Main Effect
−
−
√
Full Effect
−
−
Programming Language
Java
Java
C++
Chi-square Test is done for each SNP in main effect, and for each SNP interaction in epistasis
detection. Full effect is a disease model with both main effect and epistasis detection.
*Although BEAM3 can evaluate interactive and full effects, the evaluation test is not comparable
between methods. Only single SNPs are evaluated with χ 2 test. TEAM outputs χ 2 test score
from the contingency tables however it does not ouput the individual SNP χ 2 score. MBMDR and
Screen & Clean results are comparable with other algorithms.
35
State-of-the-Art
36
Chapter 3
A Comparative Study of Epistasis and
Main Effect Analysis Algorithms
In this chapter, the experimental setup of an empirical analysis with existing epistasis detection
algorithms is presented.
3.1
Introduction
The experiments can be divided into two stages: the empirical analysis of existing methods and
the comparison between a new approach and the existing algorithms.
For stage 1, several algorithms were selected based on the previous state of the art study,
using very different approaches. The algorithms selected are: BEAM 3.0 [Zha12]; BOOST
[WYY+ 10a]; MBMDR [MVV11]; Screen and Clean [WDR+ 10]; SNPHarvester [YHW+ 09];
SNPRuler [WYY+ 10b]; and TEAM [ZHZW10]. The purpose of this study is to evaluate the
results of each algorithm and select the best algorithms according to the evaluation measures for
stage 2.
Stage 2 consists of creating an Ensemble approach based on the characteristics of each algorithm. The existing algorithms are evaluated according to their Power, Scalability, and Type I
Error Rate. Each algorithm is analyzed with each measure, for each parameter configuration.
This allows a correlation between evaluation measures and data set parameters on each algorithm,
which means a greater understanding about the usability of each algorithm, according to parameter
setting.
In both of these studies, generated data sets were used, with many different configurations and
varying values. The many configurations use different parameters of: Population size; Minor
Allele Frequency; Odds ratio; Prevalence; and different types of Disease Models. These artificial data sets were created using genomeSimla, an open source data generator with generation
evolution capabilities and many parametrization options.
In this chapter, the data sets and their parameters are explained in more detail on Section
3.2. This Section also contains the input, output, and parameters for each algorithm. Section
37
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
3.3 contains the experimental procedure used for stage 1 experiments and the obtained results are
discussed. Finally, Section 3.4 contains the conclusions made from stage 1 experiments.
3.2
Methods
Data sets
In these experiments, there are a total of 270 different configurations of data sets. For each configuration. There are 100 data sets, creating 27000 results for each algorithm.
Each data set is created in genomeSimla, creating a population of 1,000,000 individuals,
evolved after 1750 generations. The growth of the initial population, consisting of 10,000 individuals, is done using a logistic growth rate. This allows for an organic evolution of SNP allele
frequencies. Each data set contains 300 SNPs divided into 2 chromosomes. The first chromosome contains 20 blocks of 10 SNPs each, while the second chromosome contains 10 blocks of
10 SNPs. The alleles infused with disease related genotypes are chosen from different blocks in
different chromosomes.
The following parameters are used to generate different configurations of data sets:
• Allele Frequency - The frequency of the minor allele of the disease SNPs. Considering the
allele frequency of all 300 SNPs, the chosen SNPs that affect the disease are selected among
the SNPs closest to the desired minor allele frequency. The allele frequencies can be seen
in the lab notes [PC14a].
• Population - Number of individuals sampled in the data set. According to each data set, a
given number of individuals are selected from the generated population mentioned earlier.
The ratio of cases to controls is determined by the disease prevalence.
• Disease Model - Type of disease model: main effect, epistasis interaction, and full effect.
The main effect model consists of 2 SNPs that independently affect the phenotype expres-
sion. The epistasis interaction model is determined by 2 SNPs that interact with each other
and affect the phenotype expression only when both disease alleles are present. Full effect
is determined by 2 SNPs that affect the phenotype expression by epistasis interaction and
by their main effect.
• Odds ratio - Relation between disease SNPs. Probability of one disease SNP being present,
given the presence of the other disease SNP.
• Prevalence - The proportion of a population with the disease. Affects the number of cases
and controls in a data set. A prevalence of 0.0001 corresponds to 30% of cases while a
prevalence of 0.02 corresponds to 50% of cases.
For these experiments, the parameters chosen are illustrated in table 3.1.
38
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
Parameters
Values
Minor Allele Frequency (0-1)
0.01;0.05;0.1;0.3;0.5
Population (Number of Individuals)
500;1000;2000
Disease Model
Main Effect; Epistasis; Full Effect
Odds Ratio
1.1;1.5;2.0
Prevalence
0.001;0.02
Table 3.1: The values of each parameter used. Each configuration has a unique set of the parameters used.
3.2.1
Algorithms for interaction analysis
The following algorithms were selected for these experiments. These algorithms were selected
because of their unique approach and previous results obtained. A more detailed description and
additional result analysis of the algorithms is available in the lab notes of these experiments:
BEAM3.0[PC14b]; BOOST[PC14c]; Screen and Clean[PC14d]; SNPRuler[PC14f]; SNPHarvester[PC14g];
TEAM[PC14h]; and MBMDR[PC14i].
BEAM3
The algorithms allows a filtering of SNPs with many missing genotypes, a specific number of
interactions for the MCMC and its initial temperature. There is also a prior probability of the
likelihood of each SNP to be associated with the disease. The default value is p = 5/L where L is
the number of SNPs. This was changed to p = 2/L, considering that there are 2 disease affected
SNPs.
BOOST
The algorithm contains no options to be customized. Considering the transformation of the data
into a Boolean type, the χ 2 tests for interaction analysis have only 4 degrees of freedom.
MBMDR
This algorithm was processed in a different computer setting. The computer used by this algorithm
has a Intel(R) Core(TM)2 Quad CPU Q9400 2.66GHz processor and 16,00 GB of RAM memory.
Screen and Clean
The parameters chosen for these algorithm are:
• L - number of SNPs to be retained with the smallest p-values. Since there are 300 SNPs,
this is the value chosen.
• K_pairs - Number of pairwise interactions to be retained by the lasso. The selected value is
100.
39
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
• response - The type of phenotype. Can be binomial or gaussian. The phenotypes are binomial.
• al pha - The Bonferroni correction lower bound limit for retention of SNPs. For this experiment, α = 0.05.
• standardize - If true, the genotype coded as 0,1, or 2 are centered to mean 0 and standard deviation 1. The data must be standardized to run the Screen & Clean procedure. Considering
the input data, this is enabled.
SNPHarvester
This algorithm has two modes: "Threshold-Based" mode, which outputs all the significant SNPs
above a specified significance threshold, and a "Top-K Based" mode which outputs a specified
number of SNP interactions, based on a specified number. It is possible to choose the minimum
and maximum number of interacting SNPs. For these experiments, the mode used to obtain results
is the "Threshold-Based" mode, with a significance level of α = 0.05, a minimum of interacting
SNPs of 1, which will test main effects of SNPs, and a maximum of 2.
SNPRuler
These results are already limited by a threshold of 0.3, and further reduced to 0.05, with a Bonferroni correction. There are 3 configurable parameters:
• listSize - The expected number of interactions.
• depth - Order of interaction. Number of interacting SNPs.
• updateRatio - The step size of updating a rule. Takes a value between 0 and 1, 0 being not
updated and 1 updating a rule at each step.
The maximum number of rules is 50000, the length of each rule is 2 and the pruning threshold
is 0, to allow for all possible combinations.
TEAM
For this experiment the χ 2 score was calculated from the contingency tables. The number of
permutations used in the significant test is set to 100 and the false discovery rate is set to 1. This
is used to control error rate using the permutation test, instead of a Bonferroni correction.
3.3
Simulation Design
This section contains the evaluation measures for the obtained results, the experimental methodology used in the experiments, the obtained results, and a discussion of the results.
40
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
Experimental Procedures
The results obtained from the various algorithms are evaluated according to their Power, Scalability, and Type I Error Rate.
In each data set, true positives and false positives are calculated based on the P-values that
correspond to α < 0.05 in the statistical test, after a Bonferroni correction.
The Power is the percentage for each configuration of data sets based on the amount of true
positives found out of the 100 data sets in the configuration. If the Power is 100%, that means that
the disease affected SNPs was found in every data set of the configuration.
The Type I Error Rate is calculated similarly to Power. For each configuration, the Type I
Error Rate is the percentage of data sets that contain false positives out of the 100 data sets in the
configuration. If the Type I Error Rate is 100%, that means that all the data sets contain at least
1 false positive. That means that there is at least 1 SNP or interaction of SNPs that is considered
statistically significant, but is not related to the disease.
Scalability is evaluated in three different ways: Running Time; CPU Usage; and Memory
Usage. Each of these measures is calculated for each data set, which is then averaged for each
configuration. The Running Time is calculated in seconds, CPU Usage is calculated in percentage
and Memory Usage is calculated in MBytes. All these measures are calculated from the moment
the algorithm is started until it has finished.
For these experiments, the Data Mining Process selected is CRISP-DM. The scripts used to
run each algorithm are made using the Unix shell Bash. Each algorithm was designed in a specific language. For comparison of the results, R language is used in the statistical relevancy test,
selecting only the relevant results.
For each allele frequency configuration, a different SNP pair is used, choosing the SNPs that
are closest to the desired minor allele frequency. The SNPs selected according to their minor allele
frequency (MAF) are as follows:
• MAF 0.01 - SNP112 (0.01329) and SNP267 (0.010001)
• MAF 0.05 - SNP4 (0.05239) and SNP239 (0.048355)
• MAF 0.1 - SNP135 (0.09855) and SNP230 (0.089905)
• MAF 0.3 - SNP197 (0.274662) and SNP266 (0.31648)
• MAF 0.5 - SNP80 (0.439337) and SNP229 (0.50654)
The penetrance tables are created differently for each allele frequency, altering the proportions
of each genotype in the disease SNPs.
Initially, the algorithms are tested for the most extreme configurations (minimum and maximum MAF) to see if the results obtained are as expected. After this is confirmed, the algorithms
are executed for all configurations, according to the capabilities of each algorithm.
For each data set, a file containing the scalability measures is created. For each configuration,
a file resuming all the data sets is created for Power, Scalability, and Type 1 Error Rate.
41
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
The computer used for this experiments used the 64-bit Ubuntu 13.10 operating system, with
an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor and 8,00 GB of RAM memory. The
results were obtained using parallel processing.
Results and Discussion
Figures 3.1, 3.2, and 3.3 show the Power and Type I Error Results for each algorithm according
to each population size, while Figures 3.4, 3.5, and 3.6 display the results according to each
minor allele frequency. Not all the algorithms are used for each disease model, due to algorithm
limitations and properties. This is discussed earlier in Figure 2.11 Further results, relating to
different parameters, can be seen in Lab Note 9 of these experiments [PC14e]. The lab notes are
available in the appendices.
For epistasis detection (Figure 3.1) by population size, in data sets with 500 individuals (a),
no algorithm has a Power above the Type I Error Rate, which is as high as 14%. The Power for
almost all algorithms is 0%, with the exception of BOOST, which is 1%. MBMDR and SNPR
have 0% Type I Error Rate, while Screen and Clean has the highest error rate in all algorithms,
closely followed by BOOST and SNPHarvester. For 1000 individuals (b), almost all algorithms
have Power higher than Type I Error Rate, with the exception of Screen and Clean. The algorithm with highest Power is BOOST wih 41%, with SNPHarvester and TEAM behind, both with
21%. MBMDR and Screen and Clean have very little Power. Screen and Clean has the highest
Type I Error Rate, with 16%, followed by SNPHarvester, BOOST and TEAM. Both MBMDR
and SNPRuler have 0% error rate. In the data sets with 2000 individuals (c), there are several
algorithms with high Power. BOOST has the best Power with 94%, closely followed by TEAM
with 92% and SNPHarvester with 85%. The worst algorithm by Power is Screen and Clean with
6%. Type I Error Rate is relatively low overall, with TEAM having the highest value with 28%.
Screen and Clean, BOOST, and SNPHarvester closely behind, with 21%, 21% 19% respectively.
The algorithm with the lowest error rate is MBMDR.
In Main effect detection (Figure 3.2), for 500 individuals (a), nearly all algorithms present
0% Power, with the exception of BOOST with 2%. Type I Error Rate is high, with Screen and
Clean having the highest value with 21%, followed by BOOST and SNPHarvester with 12% and
11% respectively. BEAM3 has the lowest error rate with 9%. For data sets with 1000 individuals
(b), the algorithm with the highest Power is BOOST with 43%, with SNPHarvester and BEAM3
close behind at 38% and 32% respectively. Screen and Clean has 0% Power. Type I Error Rate
is constant amongst all algorithms, with BOOST and Screen and Clean slightly ahead, with 23%.
For 2000 individuals (c), BOOST has the highest Power with 97%, while BEAM3 and SNPharvester have 93%. Screen and Clean has 39% Power, but also has the least error rate, 36%, where
SNPHarvester has the highest error rate, with 79%.
The data sets with full effect disease model (Figure 3.3), for 500 individuals (a), show that
BOOST has 1% Power and the other algorithms have 0%. The algorithm with the highest Type I
Error Rate is Screen and Clean, with 19%, and SNPHarvester has the lowest, with 9%. For 1000
individuals (b), BOOST has the most Power with 42%, SNPHarvester has 32% and Screen and
42
Power
Type 1 Error Rate
10
0
60
40
20
0
(a) 500 individuals.
(b) 1000 individuals.
150
Power
Type 1 Error Rate
100
50
0
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
PR
TE
A
M
Power/Type 1 Error(%)
Power
Type 1 Error Rate
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
PR
TE
A
M
Power/Type 1 Error(%)
20
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
PR
TE
A
M
Power/Type 1 Error(%)
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
(c) 2000 individuals.
Figure 3.1: These results correspond to epistasis detection by population size. The data sets have
a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the
Power and Type 1 Error Rate of BOOST, MBMDR, Screen and Clean, SNPHarvester, SNPRuler
and TEAM. Each sub figure contains the values for all algorithms in data sets with 500 individuals
(a), 1000 individuals (b), and 2000 individuals (c).
Clean remains at 0%. Type I Error Rates are higher for BOOST, with 38%. Screen and Clean
and SNPHarvester have 28% and 27%, respectively. For 2000 individuals (c), the best algorithm
is, once again, BOOST with 98% Power, and SNPRuler closely behind with 95%. Screen and
Clean has 0% Power, but also has the lowest error rate, with 33%. BOOST has 81% error rate,
and SNPHarvester has 79% error rate.
In evaluating data set results by minor allele frequency, for epistasis detection (Figure 3.4),
There is 0% Power for all algorithms for 0.01 allele frequency (a). The Type I Error Rate is as
big as 19% for Screen and Clean. The algorithms with the lowest error rate are MBMDR and
SNPRuler with 0%. In 0.05 allele frequency (b), TEAM has the highest Power, with 43% and
all other algorithms have a Power lower than 20%. TEAM also has the highest error rate, with
37%, and SNPRuler is the algorithm with the lowest error rate, with only 1%. For data sets with
0.1 minor allele frequency (c), BOOST and TEAM are the best algorithms with 94% and 92%
Power, respectively. Screen and Clean is the algorithm with the lowest Power, at 3%. MBMDR
has the lowest Type I Error Rate, while TEAM has the highest error rate, with 28%. For 0.3 allele
43
SN
BO
BE
A
M
(b) 1000 individuals.
Power
Type 1 Error Rate
150
100
50
SN
PH
ST
BO
O
BE
A
M
Sn
C
0
3
Power/Type 1 Error(%)
(a) 500 individuals.
PH
0
3
SN
PH
C
Sn
BO
BE
A
O
ST
0
20
C
10
40
Sn
20
Power
Type 1 Error Rate
60
O
ST
Power/Type 1 Error(%)
Power
Type 1 Error Rate
30
M
3
Power/Type 1 Error(%)
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
(c) 2000 individuals.
Figure 3.2: These results correspond to main effect detection by population size. The data sets
have a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the
Power and Type 1 Error Rate of BEAM3, BOOST, Screen and Clean, and SNPHarvester. Each sub
figure contains the values for all algorithms in data sets with 500 individuals (a), 1000 individuals
(b), and 2000 individuals (c).
frequency (d), BOOST has 100% Power, with TEAM close behind at 92%, and SNPHarvester
with 85%. Screen and Clean has the lowest Power, at 2%. SNPRuler has the lowest error rate,
with only 2%, while SNPHarvester has the highest, with 19%. Finally, for 0.5 allele frequency
(e), algorithms BOOST, TEAM and SNPRuler have the highest Power, with 100%, 95% and 92%,
respectively. Once again, Screen and Clean has the lowest Power with 0%. SNPHarvester has the
highest error rate, with 11%, and MBMDR together with SNPRuler have the lowest, with 0%.
For main effect detection (Figure 3.5), in 0.01 allele frequency (a), Power is 0% in all algorithms. Type I Error Rate is highest in Screen and Clean, with 13%. In 0.05 allele frequency (b),
Power is nearly 0% for all algorithms except BOOST, with 14%. SNPHarvester has the highest
Type I Error Rate, with 24%, followed by Screen and Clean, with 22%. BOOST has the lowest
error rate with 11%. For 0.1 allele frequency (c), the most powerful algorithm is BOOST (97%),
closely followed by BEAM3(92%) and SNPHarvester(92%). SNPHarvester has the highest error
rate with 79%, and Screen and Clean has the lowest, with 36%. In data sets with 0.3 allele frequency (d), all algorithms have 100% Power, with the exception of Screen and Clean with only
58%. All algorithms have 100% Error Rate, except Screen and Clean with 38%. The results are
44
20
10
40
20
0
ST
O
BO
C
Sn
(b) 1000 individuals.
Power
Type 1 Error Rate
150
100
50
O
BO
Sn
C
0
ST
Power/Type 1 Error(%)
(a) 500 individuals.
SN
PH
BO
O
SN
PH
0
Power
Type 1 Error Rate
60
SN
PH
Power
Type 1 Error Rate
Sn
C
Power/Type 1 Error(%)
30
ST
Power/Type 1 Error(%)
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
(c) 2000 individuals.
Figure 3.3: These results correspond to full effect detection by population size. The data sets have
a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the
Power and Type 1 Error Rate of BOOST, Screen and Clean, and SNPHarvester. Each sub figure
contains the values for all algorithms in data sets with 500 individuals (a), 1000 individuals (b),
and 2000 individuals (c).
the same in 0.5 minor allele frequency (e), with the exception for Screen and Clean, with 62%
Power and 48% Type I Error Rate.
For full effect detection (Figure 3.6), in 0.01 allele frequency (a), there is 0% Power in all
algorithms, with Screen and Clean having the highest Type I Error Rate (14%), and SNPHarvester
having the lowest (1%). For 0.05 minor allele frequency (b), only BOOST has any Power, with
15%. Screen and Clean has the highest error rate with 21%, followed by SNPHarvester with 20%
and BOOST with 17%. For 0.1 (c), BOOST and SNPHarvester have a high Power percentage,
with 98% and 95% respectively. Screen and Clean is once again with 0% Power. However, Screen
and Clean has the lowest error rate (33%), while BOOST has the highest (81%), followed by
SNPHarvester (79%). In 0.3 (d) and 0.5 (e) minor allele frequency, both BOOST and SNPHarvester have the same values, with 100% Power and Type I Error Rate. Screen and Clean has a
Power of 40% and 91% and Type I Error Rate of 68% and 84% for 0.3 and 0.5 allele frequencies.
Table 3.2 contains the scalability analysis. Screen and Clean is revealed to be the slowest
algorithm, followed by SNPHarvester. TEAM and BEAM3 have similar values, with SNPRuler
having close to half of their running time. BOOST is the fastest algorithm, with less than 1 second
45
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
40
30
P
T1ER
P
T1ER
80
60
40
20
0
20
10
BO
M OS
BM T
D
R
Sn
SN C
P
SN H
TE PR
A
M
BO
M OS
BM T
D
R
Sn
SN C
P
SN H
TE PR
A
M
0
(a) 0.01 allele frequency.
200
P
T1ER
P
T1ER
100
BO
M OS
BM T
D
R
Sn
SN C
P
SN H
TE PR
A
M
0
BO
M OS
BM T
D
R
Sn
SN C
P
SN H
TE PR
A
M
200
150
100
50
0
(b) 0.05 allele frequency.
(c) 0.1 allele frequency.
200
(d) 0.3 allele frequency.
P
T1ER
100
BO
M OS
BM T
D
R
Sn
SN C
P
SN H
TE PR
A
M
0
(e) 0.5 allele frequency.
Figure 3.4: These results correspond to epistasis detection by minor allele frequency. The data sets
have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Power and
Type 1 Error Rate of BOOST, MBMDR, Screen and Clean, SNPHarvester, SNPRuler and TEAM.
Each sub figure contains the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c),
0.3 (d), and 0.5 (e) allele frequencies.
of running time in the biggest data sets. Screen and Clean also has the biggest increase in running
time, followed by SNPHarvester. SNPRuler is the most expensive algorithm in CPU usage, having
a higher than 100% usage of CPU, which means that the algorithm uses more than 1 core to process
each data set. In memory usage, SNPRuler also has the highest usage of memory, closely followed
by TEAM, Screen and Clean, SNPHarvester, BEAM3, and finally BOOST far behind.
46
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
(a) 0.01 allele frequency.
(b) 0.05 allele frequency.
200
P
T1ER
P
T1ER
100
200
C
SN
PH
BE
(c) 0.1 allele frequency.
Sn
3
BO
O
ST
A
M
PH
SN
Sn
C
0
BE
A
M
3
BO
O
ST
200
150
100
50
0
C
SN
PH
BE
A
M
3
BO
O
ST
M
SN
BO
PH
0
Sn
C
0
O
ST
20
3
10
BE
A
P
T1ER
40
Sn
P
T1ER
20
(d) 0.3 allele frequency.
P
T1ER
100
C
PH
SN
Sn
ST
O
BO
BE
A
M
3
0
(e) 0.5 allele frequency.
Figure 3.5: These results correspond to main effect detection by minor allele frequency. The data
sets have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Power
and Type 1 Error Rate of BEAM3, BOOST, Screen and Clean, and SNPHarvester. Each sub figure
contains the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5
(e) allele frequencies.
47
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
30
P
T1ER
20
P
T1ER
40
20
10
(a) 0.01 allele frequency.
200
150
100
50
0
PH
(b) 0.05 allele frequency.
200
P
T1ER
P
T1ER
100
200
C
PH
SN
BO
(c) 0.1 allele frequency.
Sn
O
ST
PH
SN
C
Sn
ST
0
O
BO
SN
C
Sn
ST
BO
O
PH
SN
Sn
BO
O
C
0
ST
0
(d) 0.3 allele frequency.
P
T1ER
100
PH
C
Sn
SN
BO
O
ST
0
(e) 0.5 allele frequency.
Figure 3.6: These results correspond to full effect detection by minor allele frequency. The data
sets have 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. Each bar corresponds to the Power
and Type 1 Error Rate of BOOST, Screen and Clean, and SNPHarvester. Each sub figure contains
the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele
frequencies.
48
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
Running Time (s)
CPU Usage(%)
Memory Usage (MB)
500 1000 2000
500
1000
2000
500
1000 2000
BEAM3
4.9
7
8
87.8
96.3
95.5
4
4.3
5.8
BOOST
0.16 0.22
0.34
95.7 98.79 97.87
0.98
1
1.2
MBMDR*
−
−
−
−
−
−
−
−
−
SnC
8.05 18.65 34.65 75.7 98.99 77.25 129.8 137.2 152.5
SNPHarvester 9.29 25.89
33
102.1 86.5
101.6 68.35 71.3 76.86
SNPRuler
2.7
3.09
4.1
130.2 141.9 156.28 312.7
316
320.2
TEAM
3.28 5.28
9.81 66.99 69.71 74.75 162.7
176
228.1
Table 3.2: Scalability test containing the average running time, CPU usage, and memory usage
by data set population size. BOOST, Screen and Clean, and SNPHarvester have values related to
full effect, TEAM, and SNPRuler are related to epistasis detection, and BEAM3 is related to main
effect detection. *MBMDR does not contain scalability results because these were obtained from
different computers with different hardware settings from all other results. The average running
time of MBMDR for each data set was higher than 3600 seconds. The data sets have a minor allele
frequency is 0.5, 2.0 odds ratio, 0.02 prevalence.
49
A Comparative Study of Epistasis and Main Effect Analysis Algorithms
3.4
Chapter Conclusions
The results show that BOOST is the best algorithm overall in terms of Power, but has a high Type
I Error Rate. SNPRuler has a low Type I Error Rate, but not very high Power and only works with
epistasis detection. Screen and Clean has very low Power in general, but has a relatively low error
rate, specially in data sets with a high number of individuals or a high minor allele frequency, in
main effect or full effect disease models. BEAM3 has high Power and slightly lower error rate than
BOOST, but only works with main effect. SNPHarvester has low Power, but also low Type I Error
Rate overall. MBMDR has very low Type I Error Rate with high Power in certain configurations,
but only works with epistasis and has a very high running time. TEAM has very high Power and
low Type I Error Rate, with the exception of certain configurations, particularly of lower number
of individuals and lower minor allele frequency. However, it only works for epistasis detection.
BOOST is the most scalable algorithm, followed by SNPRuler and BEAM3. This is important
for the next stage of the experiments, with an ensemble approach. Based on the data obtained,
we can conclude that some of the algorithms used would not be useful in an ensemble approach,
either because of their scalability, or because they would not add Power without compromising
Type I Error Rate.
These experiments show similar results from previous studies [WYY+ 10b] [SZS+ 11], however there is a vast amount of different types of data sets and algorithms, which differs from the
previous studies. These experiments can be viewed from different perspectives, using different
parameters, and the results can be analyzed according to their Power, Type I Error Rate, and Scalability. Furthermore, the results obtained are available in the lab notes. The lab notes and the
created scripts are available in https://github.com/ei09045/EpistasisStudy.
50
Chapter 4
Ensemble Approach
In this chapter, a new Ensemble approach is discussed. This new approach uses algorithms from
the previous empirical study to improve results.
4.1
Introduction
The results from the empirical study of existing epistasis detection algorithms showed unique
properties in each algorithm. Considering Power and Type I Error Rate, the purpose of this stage
is to create a new approach that maintains the Power from the best algorithms and lowers Type I
Error Rate associated with these algorithms, which is usually high.
For this purpose, a new approach joining algorithms was developed. The algorithms are:
BEAM 3.0 [Zha12]; BOOST [WYY+ 10a]; SNPRuler [WYY+ 10b]; SNPHarvester [YHW+ 09];
and TEAM [ZHZW10]. This algorithms were selected based on their Power to Type I Error Rate
ratio, and their scalability. BOOST is used both for epistasis detection, and main effect detection,
which means that there are a total of 3 algorithms used for each detection type, with the exception
of full effect, which uses all algorithms.
In Section 4.2 the experimental procedure for stage 2 is discussed, involving the process of
selecting and using a voting system for the Ensemble approach. This Section also shows the
results obtained from the Ensemble approach and the comparison between the existing algorithms.
Finally, Section 4.3 shows the conclusions from the discussion of the results.
4.2
Experiments
Experimental Procedure
For these experiments, The same data sets discussed in the previous chapter were used. The same
evaluation measures are used to evaluate the new results. This new approach is created based on
an Ensemble approach, where each algorithm votes, based on their relevant SNPs and SNP pairs,
for a unified system that chooses relevant main effect SNPs and relevant epistasis interactions.
51
Ensemble Approach
For this purpose the algorithms selected for main effect detection are BEAM 3.0, BOOST,
and SNPHarvester. For epistasis detection the algorithms selected are BOOST, SNPRuler, and
TEAM. The Ensemble approach collects the registers relevant results from each algorithm and
selects SNPs and pairs of SNPs that are common in at least two algorithms. The algorithms
selected for main effect only work with single SNPs, and the algorithms selected for epistasis
detection only work with SNP pairs. BOOST works for both models, so the results enter the
voting stage of main effect and epistasis detection. The results obtained from each algorithm are
converted into a unified format, so they can be interpreted for the voting stage. In the full effect
detection, both main effect and epistasis detection algorithms intervene in the voting stage. This
helps to reduce Type I Error Rates, while maintaining Power because if each interaction is related
to the phenotype, it will be common amongst most algorithms, while non-related interactions will
not.
The computer used for this experiments used the 64-bit Debian testing (jessie) operating system, with an Intel(R) Core(TM)2 Quad CPU Q9400 2.66GHz processor and 16,00 GB of RAM
memory.
Results and Discussion
The results obtained from previous experiments are used to compare the performance of existing
algorithms with the new ensemble approach. Figures 4.1, 4.2, and 4.3 show the Power and Type
I Error Results for each algorithm according to each population size; Figures 4.4, 4.5, and 4.6
display the results according to each minor allele frequency; Figures 4.7, 4.8, and 4.9 show the
results according to each odds ratio tested; and Figures 4.10, 4.11, and 4.12 contain the results
regarding both prevalence values.
The results in epistasis detection by population size for data sets with 500 individuals (a) show
0% Power but also 0% Type I Error Rate for the Ensemble approach. BOOST has the most Power
with 1% but has 7% Type I Error Rate. In data sets with 1000 individuals (b), the Ensemble
approach has 23% Power and 0% error rate, while the algorithm with the most Power is BOOST
with 41% and 5% error rate. For 2000 individuals (c) in data sets, the Ensemble has 92% Power
and 15% error rate, while BOOST has 94% Power but 21% error rate.
In main effect detection, for 500 individuals (a), there was 0% Power and 11% Error Rate.
BOOST has 2% Power but 12% error rate, while BEAM3 has only 9% error rate. For 1000
individuals (b), Ensemble has 37% Power and 19% error rate. BOOST has 43% Power and 23%
error rate while BEAM3 has the least error rate with 18% but has only 32% Power. Finally in data
sets with 2000 individuals (c), Ensemble has 92% and a Type I Error Rate of 71%. BOOST has
the most Power with 97% and has 74% error rate.
Full effect detection results show 0% Power and 11% Type I Error Rate in the Ensemble
approach for data sets with 500 individuals (a). SNPHarvester has the least error rate with 9%. For
data sets with 1000 individuals (b), Ensemble has 0% Power and 11% error rate, but SNPHarvester
has less error rate with 9%. For 2000 individuals (c), the Ensemble approach has 95% Power and
75% Type I Error Rate, having the most Power and least error rate.
52
40
20
TE
A
M
En
se
m
bl
e
O
ST
BO
(b) 1000 individuals.
150
Power
Type I Error Rate
100
50
TE
A
M
En
se
m
bl
e
PR
SN
BO
SN
PH
0
O
ST
Power/Type I Error(%)
(a) 500 individuals.
PR
0
bl
e
se
m
En
TE
A
M
PR
SN
BO
SN
PH
0
Power
Type I Error Rate
SN
5
60
PH
Power
Type I Error Rate
SN
Power/Type I Error(%)
10
O
ST
Power/Type I Error(%)
Ensemble Approach
(c) 2000 individuals.
Figure 4.1: These results correspond to epistasis detection by population size, with a 0.1 minor
allele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error
Rate of BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the
values for all algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000
individuals (c).
53
Power/Type I Error(%)
A
M
BE
se
m
En
(b) 1000 individuals.
Power
Type I Error Rate
150
100
50
e
En
s
em
bl
PH
SN
O
BO
BE
A
ST
0
M
3
Power/Type I Error(%)
(a) 500 individuals.
bl
e
3
e
0
bl
SN
PH
O
BO
BE
A
ST
0
20
En
se
m
5
40
PH
10
O
ST
15
Power
Type I Error Rate
60
SN
Power
Type I Error Rate
BO
20
M
3
Power/Type I Error(%)
Ensemble Approach
(c) 2000 individuals.
Figure 4.2: These results correspond to main effect detection by population size, with a 0.1 minor
allele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error
Rate of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for
all algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).
54
20
0
En
BO
(a) 500 individuals.
(b) 1000 individuals.
Power
Type I Error Rate
150
100
50
bl
e
m
En
se
SN
O
BO
PH
0
ST
Power/Type I Error(%)
bl
e
O
ST
bl
e
se
m
En
SN
BO
O
PH
0
40
se
m
10
Power
Type I Error Rate
60
SN
PH
Power/Type I Error(%)
Power
Type I Error Rate
20
ST
Power/Type I Error(%)
Ensemble Approach
(c) 2000 individuals.
Figure 4.3: These results correspond to full effect detection by population size, with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I
Error Rate of BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all
algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).
55
Ensemble Approach
In minor allele frequency analysis, for epistasis detection, Ensemble shows 0% Power and
0% Type I Error Rate in data sets with 0.01 allele frequency (a). For 0.05 allele frequency (b),
Ensemble approach has 6% Power and 1% error rate, while TEAM has 43% Power and 37% error
rate. In 0.1 allele frequency (c), Ensemble has 92% Power and 15% error rate. BOOST has 94%
Power but 21% error rate. For 0.3 allele frequency (d), Ensemble has 99% Power and 1% error
rate, while BOOST has 100% Power but 6% error rate. Finally in 0.5 minor allele frequency (e),
Ensemble has 100% Power and 0% Type I Error Rate.
4
P
T1ER
80
60
40
20
0
2
O
S
SN T
PH
SN
P
TE R
En AM
se
m
bl
e
BO
BO
O
S
SN T
PH
SN
P
TE R
En AM
se
m
bl
e
0
P
T1ER
(a) 0.01 allele frequency.
200
150
100
50
0
(b) 0.05 allele frequency.
200
P
T1ER
100
SN
PH
SN
P
TE R
En AM
se
m
bl
e
BO
O
S
SN T
PH
SN
P
TE R
En AM
se
m
bl
e
ST
0
O
BO
P
T1ER
(c) 0.1 allele frequency.
200
(d) 0.3 allele frequency.
P
T1ER
100
BO
O
S
SN T
PH
SN
P
TE R
En AM
se
m
bl
e
0
(e) 0.5 allele frequency.
Figure 4.4: These results correspond to epistasis detection by minor allele frequency, with 2000
individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rate
of BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the values
for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.
In main effect detection, for 0.01 allele frequency (a), Ensemble has 0% and 1% error rate.
In 0.05 allele frequency (b), Ensemble has 1% Power and 20% Type I Error Rate. BOOST is the
best algorithm in this setting, with 14% Power and 11% Type I Error Rate. For 0.1 minor allele
frequency (c), Ensemble has 92% Power and 71% error rate. BOOST has 97% Power but has 74%
56
Ensemble Approach
error rate. BEAM3 has the same Power as Ensemble, but slight less error rate, with 67%. For 0.3
(d) and 0.5 (e) allele frequency, all the approaches have 100% Power and Type I error Rate.
2
P
T1ER
P
T1ER
40
20
0
0
BE
A
BE
A
M
3
BO
O
ST
SN
P
En H
se
m
bl
e
M
3
BO
O
ST
SN
P
En H
se
m
bl
e
1
(a) 0.01 allele frequency.
200
P
T1ER
P
T1ER
100
0
BE
A
BE
A
M
3
BO
O
ST
SN
P
En H
se
m
bl
e
M
3
BO
O
ST
SN
P
En H
se
m
bl
e
200
150
100
50
0
(b) 0.05 allele frequency.
(c) 0.1 allele frequency.
200
(d) 0.3 allele frequency.
P
T1ER
100
bl
e
se
m
PH
En
SN
ST
O
BO
BE
A
M
3
0
(e) 0.5 allele frequency.
Figure 4.5: These results correspond to main effect detection by minor allele frequency, with 2000
individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rate
of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all
algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.
In full effect detection, for 0.01 allele frequency (a), no algorithm has any Power, but Ensemble
has the least error rate, with only 1%. For 0.05 allele frequency (b), Ensemble has the least error
rate with 16% but BOOST has the most Power with 15%. For 0.1 (c), Ensemble approach has 95%
Power and 75% error rate. BOOST has 98% Power but 81% error rate. Finally, all algorithms have
100% Power and Type I Error Rate for 0.3 (d) and 0.5 (e) minor allele frequencies.
Analysing the results by odds ratio, for epistasis detection, at 1.1 odds ratio (a), Ensemble
has 1% and 0% error rate. BOOST has 27% Power, but has 5% Type I Error Rate. In 1.5 odds
ratio (b), Ensemble has 84% Power and 3% Type I Error Rate, while BOOST has 95% Power, but
57
Ensemble Approach
15
40
P
T1ER
10
P
T1ER
20
5
(a) 0.01 allele frequency.
200
150
100
50
0
bl
e
(b) 0.05 allele frequency.
200
P
T1ER
P
T1ER
100
bl
e
em
En
s
BO
(d) 0.3 allele frequency.
(c) 0.1 allele frequency.
200
SN
PH
O
ST
e
bl
se
m
En
SN
PH
ST
0
O
BO
En
se
m
ST
BO
O
bl
e
em
En
s
SN
PH
O
ST
BO
SN
PH
0
0
P
T1ER
100
e
bl
em
En
s
PH
SN
BO
O
ST
0
(e) 0.5 allele frequency.
Figure 4.6: These results correspond to full effect detection by minor allele frequency, with 2000
individuals, 2.0 odds ratio, and 0.02 prevalence. The results of the Power and Type I Error Rate of
BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms in
data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3 (d), and 0.5 (e) allele frequencies.
9% error rate. For 2.0 odds ratio (c), Ensemble has 92% Power and 15% error rate. BOOST has
slightly more Power, with 94%, but once again has higher error rate, with 21%.
In main effect detection, at 1.1 odds ratio (a), all algorithms have the same Power, with 2%,
but BEAM3 has less error rate, with 8%, while Ensemble has 10%. In 1.5 odds ratio (b), Ensemble
has the most Power, with 25%, and has 17% error rate, but BEAM3 has 16%, being the algorithm
with the least error rate. For 2.0 odds ratio (c), Ensemble has 92% Power and 71% error rate.
BEAM3 has the same Power but less error rate, with 67%, and BOOST has more Power (97%)
but also has more Type I Error Rate (74%)
Finally, for full effect detection by odds ratio, for 1.1 odds ratio (a), Ensemble has the least
Power, with 3% but also has the least error rate, with 7%. SNPHarvester has the most Power at
10%, and has 9% error rate. At 1.5 odds ratio (b), BOOST is the algorithm with the highest Power,
58
100
50
TE
A
M
En
se
m
bl
e
BO
(b) 1.5 odds ratio.
150
Power
Type I Error Rate
100
50
m
bl
e
M
En
se
TE
A
PR
SN
SN
O
BO
PH
0
ST
Power/Type I Error(%)
(a) 1.1 odds ratio.
PR
O
ST
0
bl
e
se
m
En
TE
A
M
PR
SN
BO
SN
PH
0
Power
Type I Error Rate
SN
20
150
PH
Power
Type I Error Rate
SN
Power/Type I Error(%)
40
O
ST
Power/Type I Error(%)
Ensemble Approach
(c) 2.0 odds ratio.
Figure 4.7: These results correspond to epistasis detection by odds ratio, with a 0.1 minor allele
frequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rate
of BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the values
for all algorithms in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).
59
Power/Type I Error(%)
BO
bl
e
A
M
BE
se
m
En
(a) 1.1 odds ratio.
(b) 1.5 odds ratio.
Power
Type I Error Rate
150
100
50
En
s
em
bl
e
PH
SN
O
BO
A
BE
ST
0
M
3
Power/Type I Error(%)
O
ST
3
e
0
bl
SN
PH
O
BO
BE
A
ST
0
20
En
se
m
5
40
PH
10
Power
Type I Error Rate
SN
Power
Type I Error Rate
15
M
3
Power/Type I Error(%)
Ensemble Approach
(c) 2.0 odds ratio.
Figure 4.8: These results correspond to main effect detection by odds ratio, with a 0.1 minor allele
frequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rate
of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all
algorithms in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).
60
Ensemble Approach
with 72%, and highest Type I Error Rate, with 51%. Ensemble has the least Power, with 65%, but
also the least error rate, with 40%. For 2.0 odds ratio (c), Ensemble has the least error rate, with
0
En
BO
(a) 1.1 odds ratio.
(b) 1.5 odds ratio.
Power
Type I Error Rate
150
100
50
e
bl
En
se
m
O
BO
SN
PH
0
ST
Power/Type I Error(%)
bl
e
bl
e
se
m
En
SN
BO
PH
0
50
se
m
10
Power
Type I Error Rate
100
SN
PH
Power
Type I Error Rate
O
ST
Power/Type I Error(%)
20
O
ST
Power/Type I Error(%)
75%, and has 95% Power, while BOOST has 98% Power, but 81% error rate.
(c) 2.0 odds ratio.
Figure 4.9: These results correspond to full effect detection by odds ratio, with a 0.1 minor allele
frequency, 2000 individuals, and 0.02 prevalence. The results of the Power and Type I Error Rate
of BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms
in data sets with 1.1 odds ratio (a), 1.5 odds ratio (b), and 2.0 odds ratio (c).
Looking at the results of data sets by the prevalence of the disease, in epistasis detection,
with 0.0001 prevalence (a), Ensemble has 86% Power and 2% error rate. SNPRuler is the only
algorithm with less error rate, at 0%, but has much less Power. BOOST has 91% Power, but has
7% error rate. For 0.02 prevalence (b), the Ensemble approach has 92% Power, and 15% error
rate. SNPRuler has 8% error rate, but has 32% Power. BOOST has 94% Power and a Type I Error
Rate of 21%.
Regarding main effect results, for 0.0001 prevalence (a), Ensemble has 98% Power and 77%
error rate. BEAM3 is the best in this configuration, with 99% Power, and 76% error rate. For 0.02
prevalence (b), Ensemble has 92% Power and 71% error rate. BEAM3 is the algorithm with the
least error rate, with 67%, with the same Power as Ensemble. BOOST has 97% Power and 74%
error rate.
For full effect analysis by prevalence, for 0.0001 prevalence (a), Ensemble has 99% and 99%
61
50
M
se
m
En
BO
O
(a) 0.0001 Prevalence.
bl
e
0
ST
TE
A
M
En
se
m
bl
e
SN
PR
BO
SN
O
PH
0
100
TE
A
50
Power
Type I Error Rate
SN
PR
100
150
SN
Power
Type I Error Rate
PH
Power/Type I Error(%)
150
ST
Power/Type I Error(%)
Ensemble Approach
(b) 0.02 Prevalence.
bl
e
m
En
se
ST
3
M
BE
A
(a) 0.0001 Prevalence.
PH
0
e
m
bl
En
se
SN
ST
O
BO
M
BE
A
PH
0
50
O
50
100
BO
100
Power
Type I Error Rate
150
SN
Power/Type I Error(%)
Power
Type I Error Rate
150
3
Power/Type I Error(%)
Figure 4.10: These results correspond to epistasis detection by prevalence, with a 0.1 minor allele
frequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error Rate
of BOOST, SNPHarvester, SNPRuler, TEAM, and Ensemble. Each sub figure contains the values
for all algorithms in data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).
(b) 0.02 Prevalence.
Figure 4.11: These results correspond to main effect detection by prevalence, with a 0.1 minor
allele frequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error
Rate of BEAM3, BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for
all algorithms in data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).
62
Ensemble Approach
Type I Error Rate, while SNPHarvester has the same error rate, but has 100% Power. For 0.02
100
50
50
0
BO
(a) 0.0001 Prevalence.
Power
Type I Error Rate
100
O
ST
e
bl
em
En
s
BO
SN
PH
0
150
En
se
m
bl
e
Power
Type I Error Rate
SN
PH
Power/Type I Error(%)
150
O
ST
Power/Type I Error(%)
prevalence (b), Ensemble is the best algorithm ,with 95% Power, and 75% error rate.
(b) 0.02 Prevalence.
Figure 4.12: These results correspond to full effect detection by prevalence, with a 0.1 minor allele
frequency, 2000 individuals, and 2.0 odds ratio. The results of the Power and Type I Error Rate of
BOOST, SNPHarvester, and Ensemble. Each sub figure contains the values for all algorithms in
data sets with 0.0001 prevalence (a), and 0.02 prevalence (b).
In order to evaluate the scalability of the Ensemble algorithm, in relation to other algorithms,
the scalability measures were taken from each algorithm individually, while being executed by the
Ensemble approach, and the overall Ensemble algorithm, including the voting stage. Tables 4.1,
4.2, and 4.3 have an indication of the total running time for all the algorithms, the running time
obtained from the Ensemble approach (with the voting stage), and the difference between them.
The average CPU usage and memory usage throughout the Ensemble algorithm are also registered.
The data sets chosen were the full effect disease model data sets because these data sets are more
likely to have the most statistically significant results per data set, which takes a longer running
time and memory usage.
The results show an increase in the difference between the total running time of all algorithms,
and the Ensemble running time. There is clearly an increase in the difference in running time,
with the increase in data set size. However, for epistasis and main effect, there is not an increase
in the amount of running time that is added to the difference, which means that the difference in
running time is almost the same percentage of the Ensemble running time, independently of the
data set size. In epistasis detection, there is a near 30% increase in relation to the total running
time of the algorithms, but the difference in seconds is smaller than main effect and full effect,
which have a near 14.5% and 9.6% increase at 2000 individuals. There is no relation between the
CPU usage and the data set size, but there is a small increase in memory usage with the data set
size, especially in full effect detection.
63
Ensemble Approach
Running Time (s)
CPU Usage(%)
Memory Usage (MB)
500 1000 2000
500
1000 2000
500
1000 2000
BEAM3
0.5
0.6
0.8
87.8
85.3
88
2.2
2.6
3.4
BOOST
0.1
0.2
0.3
98.6
96.3
96.9
1
1
1.1
SNPHarvester 1.9
3.1
6
119.9 108.2 104.7 52.2
60.2
78.3
SNPRuler
2.3
2.6
3.4
181.2 143.8 136.9 352.4 352.4 353.5
TEAM
2.7
4.6
8.1
99
98.6
98.8 162.7
177
228.1
Total*
7.5 11.1 18.6
Ensemble*
9.8 14.3 23.9 110.6
103
102.1 352.4 352.4 353.5
Difference*
2.3
3.2
5.3
Table 4.1: Scalability test containing the average running time, CPU usage, and memory usage by
data set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence and the disease model is epistasis detection. *Total calculates the total added running
time for all algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach
with the voting stage, and the difference calculates the running time increase between them. CPU
usage and memory constitute the average usage along the process, therefore the total CPU usage
and memory usage are not relevant.
Running Time (s)
CPU Usage(%)
Memory Usage (MB)
500 1000 2000
500
1000 2000
500
1000 2000
BEAM3
2.9
4.1
4.6
94.5
94.6
97.3
2.9
3.4
4.2
BOOST
0.1
0.2
0.3
97.9
98.6
98.9
1
1
1.1
SNPHarvester 2.5
11.6 39.4 117.9 105.5
102
59.6
92.5 103.6
SNPRuler
2.3
2.3
3
168
147
160.9 352.2 352.2 354.3
TEAM
2.9
4.2
7.7
98.7
99
99
162.7
177
227.9
Total*
10.7 22.4
55
Ensemble*
13.1
26
63
89.8
88
77.3 349.2 352.2 354.3
Difference*
2.4
3.6
8
Table 4.2: Scalability test containing the average running time, CPU usage, and memory usage by
data set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence and the disease model is main effect. *Total calculates the total added running time for
all algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach with the
voting stage, and the difference calculates the running time increase between them. CPU usage
and memory constitute the average usage along the process, therefore the total CPU usage and
memory usage are not relevant.
64
Ensemble Approach
Running Time (s)
CPU Usage(%)
Memory Usage (MB)
500
1000 2000
500
1000 2000
500
1000 2000
BEAM3
110.8 90.9 196.7
99
90.9
99
37.3
29.8 106.2
BOOST
0.1
0.2
0.3
98.6
97.7
98.5
0.9
1
1.2
SNPHarvester
7.9
20.3
28.9
111
103.8 102.9 98.9 101.3 102.8
SNPRuler
2.3
2.3
2.8
197.7 149.9 154.8 337.4 304.7 353.7
TEAM
2.7
4.5
8
99
98.6
98.9 162.7 176.8 227.8
Total*
123.8 118.2 236.7
Ensemble*
126.8 126.3 259.4
95
79.2
74.6 337.4 304.7 353.7
Difference*
3
8.1
22.7
Table 4.3: Scalability test containing the average running time, CPU usage, and memory usage by
data set population size. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence and the disease model is full effect. *Total calculates the total added running time for
all algorithms, Ensemble calculates the time for all algorithms in the Ensemble approach with the
voting stage, and the difference calculates the running time increase between them. CPU usage
and memory constitute the average usage along the process, therefore the total CPU usage and
memory usage are not relevant.
65
Ensemble Approach
4.3
Chapter Conclusions
In this chapter, a new epistasis and main effect detection approach is discussed. This is an Ensemble approach, using 5 of the best algorithms from the previous empirical study, showed in Chapter
3. This new Ensemble approach uses 3 algorithms to evaluate relevant epistatic interactions, and
relevant main effects of SNPs. If there is a majority in the voting stage, the SNP or SNP pair is
then selected as a relevant result.
From the results obtained, we can see that, for epistasis detection, the Ensemble method has
slightly less Power than the best algorithm but it is the algorithm with the least Type I Error Rate,
with the exception of SNPRuler in some configurations, but this algorithm has much less Power
than the Ensemble method. In main effect detection, Ensemble is amongst the algorithms with
the lowest Type I Error Rate, behind BEAM3 in some configurations, but has constantly higher
Power. BOOST has more Power than the Ensemble method, but also has higher error rate. For
full effect, Ensemble is the algorithm with the least error rate, but has less Power than BOOST in
some configurations.
The goal for these experiments is to create a more efficient method, that is able to find the
ground truth SNPs related to the disease, while reducing false positives. The Ensemble method
fulfills all these requirements. The Scalability test shows that there is an increase in the running
time and memory usage with the size of the data. However, given that only the 5 most scalable
algorithms were selected, and that there is only a stable increase in resource and time consumption,
it is easy to calculate the necessary time and resources necessary for big data sets, and given the
Power and Type I Error Rate results, it is far better than any single algorithm.
66
Chapter 5
Conclusions
This dissertation was created for the purpose of improving the detection of genes, specifically
SNPs, that cause the expression of complex diseases. These diseases have a genetic basis that
increases the susceptibility. This means that, given an individual genotype, it is possible to assume
a greater risk of developing complex diseases if it has a SNP allele that has a connection to the
disease manifestation. Therefore it is very important to find the genotype configurations that are
connected to a given disease.
A state-of-the-art study was made about the recent work related to this dissertation. For the
methodologies, the data selection and model creation algorithms were studied, together with more
generic algorithms, in which the specific model creation algorithms are based on. Some auxiliary
algorithms were also studied. These are used in different stages, by the specific model creation
algorithms. Furthermore, data analysis evaluations procedures and measures were studied. These
represent how the data is used to train and test, and statistical relevant measures to be taken from
the results, respectively. Data Mining processes and software were also studied. CRISP-DM was
the procedure selected for the experiments. R was the software used in the experiments, except in
algorithms that are implemented in different programming languages.
Initially, a group of algorithms was selected, based on the state-of-the-art study. These algorithms are the most recent approaches that showed most promise and compatibility between them,
which made it easier to test. For this study, a large amount of data was generated, using genomeSimla, to create different types of data sets, revealing a wide range of results from each algorithm.
The data sets selected contained different types of disease model type, each type being compatible with a subgroup of the selected algorithms. The purpose of this initial empirical study was
to find the best algorithms overall, according to their Power, Type I Error Rate, and Scalability.
The results showed that BOOST was the best algorithm overall in terms of Power, SNPRuler and
MBMDR had the least error rate, but MBMDR had really bad Scalability. BOOST is also the most
scalable.
Out of all 7 selected algorithms for the comparison study, 3 algorithms for main effect and
epistasis detection were detected. In each stage, these results are chosen from the majority of
these algorithms. For main effect detection, the algorithms chosen were: BEAM3; BOOST; and
67
Conclusions
SNPHarvester. For epistasis detection, the algorithms chosen were: BOOST; SNPRuler; and
TEAM. BOOST was selected twice because of the high Power results in both disease models.
A new methodology was then created with the selected algorithms: the Ensemble. Using the
selected algorithms, the purpose of this methodology is to maintain the Power of the algorithms,
while reducing Type I Error Rate.
The observed results showed that the Type I Error Rates were lowered significantly, especially
in epistasis detection. However, the Power of the Ensemble was slightly lower than BOOST in
some configurations. The Scalability results showed some difference in the running time of the
Ensemble and the selected algorithms, due to the voting stage, but this difference is stable and did
not show a clear increase with the data set size, which means that only a small percentage of the
overall running time is dedicated to the voting stage.
The main conclusion of the empirical study from the state-of-the-art algorithms is that, even
if some algorithms show more dominant results, there is no absolute best algorithm for all types
of diseases. These results used small sized artificial data sets, so for large realistic data sets, these
results are also very dependent on the scalability of each algorithm. This also limits the types of
configurations that each algorithm can process. It is very difficult to obtain true positives without
false positives, in a viable period of time. For this purpose, the Ensemble approach was created to
maintain the epistasis and main effect detections, without such high amounts of false positives, but
the necessary running time is greater than all the algorithms combined, which may turn out to be
not viable for the larger data sets. However, the Ensemble is the most accurate approach possible.
5.1
Contribution summary
The main contributions of this dissertation are as follows:
• Create a vast amount of data sets with many different configurations, altering different parameters that affect the data and consequently the results of the algorithms. This allows for
a more complete evaluation of a given algorithm.
• An empirical study of the 7 most recent epistasis and main effect detection algorithms,
across many different configurations.
• Creation and evaluation of a new methodology, Ensemble, based on existing state-of-the-art
algorithms. This new methodology was able to yield good results, while decreasing the
Type I Error Rate.
68
References
[ABR64]
A Aizerman, Emmanuel M Braverman, and L I Rozoner. Theoretical foundations
of the potential function method in pattern recognition learning. Automation and
remote control, 25:821–837, 1964.
[AFFL+ 11] J. Alcala-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, L. Sanchez, and
F. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration of
Algorithms and Experimental Analysis Framework. Journal of Mult.-Valued Logic
& Soft Computing, 17:255–287, 2011.
[AFSG+ 09] J Alcalá-Fdez, L Sánchez, S García, M J del Jesus, S Ventura, J M Garrell, J Otero,
C Romero, J Bacardit, V M Rivas, J C Fernández, and F Herrera. KEEL: a software
tool to assess evolutionary algorithms for data mining problems. Soft Computing,
13:307–318, 2009.
[AS08]
Ana Azevedo and Manuel Filipe Santos. KDD, SEMMA and CRISP-DM: a parallel
overview. IADIS European Conference Data Mining, pages 182–185, 2008.
[Avi94]
John C Avise. Molecular markers: natural history and evolution. Springer, 1994.
[BCD+ 08]
Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Kötter, Thorsten Meinl, Peter Ohl, Christoph Sieb, Kilian Thiel, and Bernd Wiswedel.
KNIME: The Konstanz information miner. Springer, 2008.
[BCD09]
M R Berthold, N Cebron, and F Dill. KNIME-the Konstanz information miner:
version 2.0 and beyond. ACM SIGKDD, 2009.
[BDF+ 05]
Alexandre Bureau, Josée Dupuis, Kathleen Falls, Kathryn L Lunetta, Brooke Hayward, Tim P Keith, and Paul Van Eerdewegh. Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology, 28:171–182, 2005.
[BFMD05]
J C Barrett, B Fry, J Maller, and M J Daly. Haploview: analysis and visualization of
LD and haplotype maps. Bioinformatics (Oxford, England), 21:263–265, 2005.
[BH95]
Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical
Society. Series B (Methodological), 57:289 – 300, 1995.
[BMP+ 13]
Klaus Bønnelykke, Melanie C Matheson, Tune H Pers, Raquel Granell, David P
Strachan, Alexessander Couto Alves, Allan Linneberg, John a Curtin, Nicole M
Warrington, Marie Standl, Marjan Kerkhof, Ingileif Jonsdottir, Blazenka K Bukvic,
Marika Kaakinen, Patrick Sleimann, Gudmar Thorleifsson, Unnur Thorsteinsdottir, Katharina Schramm, Svetlana Baltic, Eskil Kreiner-Møller, Angela Simpson,
69
REFERENCES
Beate St Pourcain, Lachlan Coin, Jennie Hui, Eugene H Walters, Carla M T Tiesler,
David L Duffy, Graham Jones, Susan M Ring, Wendy L McArdle, Loren Price,
Colin F Robertson, Juha Pekkanen, Clara S Tang, Elisabeth Thiering, Grant W
Montgomery, Anna-Liisa Hartikainen, Shyamali C Dharmage, Lise L Husemoen,
Christian Herder, John P Kemp, Paul Elliot, Alan James, Melanie Waldenberger,
Michael J Abramson, Benjamin P Fairfax, Julian C Knight, Ramneek Gupta, Philip J
Thompson, Patrick Holt, Peter Sly, Joel N Hirschhorn, Mario Blekic, Stephan Weidinger, Hakon Hakonarsson, Kari Stefansson, Joachim Heinrich, Dirkje S Postma,
Adnan Custovic, Craig E Pennell, Marjo-Riitta Jarvelin, Gerard H Koppelman,
Nicholas Timpson, Manuela Ferreira, Hans Bisgaard, and a John Henderson. Metaanalysis of genome-wide association studies identifies ten loci influencing allergic
sensitization. Nature genetics, 45:902–6, 2013.
[Bre01]
Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
[CCK00]
P Chapman, J Clinton, and R Kerber. CRISP-DM 1.0. CRISP-DM, 2000.
[cel14]
Eukaryote dna (htt p : //commons.wikimedia.org/wiki/ f ile : eukaryoted na.svg),
June 2014.
[CKS04]
Robert Culverhouse, Tsvika Klein, and William Shannon. Detecting epistatic interactions contributing to quantitative traits. Genetic epidemiology, 27:141–152, 2004.
[CLEP07]
Yujin Chung, Seung Yeoun Lee, Robert C Elston, and Taesung Park. Odds ratio
based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics (Oxford, England), 23:71–76, 2007.
[Cor02]
Heather J Cordell. Epistasis: what it means, what it doesn’t mean, and statistical
methods to detect it in humans. Human Molecular Genetics, 11(20):2463–2468,
October 2002.
[Cor09]
Heather J Cordell. Detecting gene-gene interactions that underlie human diseases.
Nature reviews. Genetics, 10:392–404, 2009.
[DCE13]
J Demšar, T Curk, and A Erjavec. Orange: Data Mining Toolbox in Python. Journal
of Machine Learning Research, 14:2349–2353, 2013.
[DHS01]
Richard O Duda, Peter E Hart, and David G Stork. Pattern Classification, volume 2.
2001.
[DKN05]
KB Kai Bo Duan, Sathiya Keerthi, and Et al. N.C.Oza. In Multiple Classifier Systems. 2005.
[Dor92]
Marco Dorigo. Optimization, learning and natural algorithms. Ph. D. Thesis, Politecnico di Milano, Italy, 1992.
[EBT+ 08]
Todd L Edwards, William S Bush, Stephen D Turner, Scott M Dudek, Eric S
Torstenson, Mike Schmidt, Eden Martin, and Marylyn D Ritchie. Generating Linkage Disequilibrium Patterns in Data Simulations using genomeSIMLA. Lecture
Notes in Computer Science, 4973:24–35, 2008.
[FES+ 98]
Deborah Ford, D F Easton, M Stratton, S Narod, D Goldgar, P Devilee, D T Bishop,
B Weber, G Lenoir, and J Chang-Claude. Genetic heterogeneity and penetrance
70
REFERENCES
analysis of the BRCA1 and BRCA2 genes in breast cancer families. The American
Journal of Human Genetics, 62(3):676–689, 1998.
[FHT+ 04]
Eibe Frank, Mark Hall, Len Trigg, Geoffrey Holmes, and Ian H Witten. Data mining
in bioinformatics using Weka. Bioinformatics (Oxford, England), 20:2479–2481,
2004.
[Fis19]
R A Fisher. XV.—The Correlation between Relatives on the Supposition of
Mendelian Inheritance. Earth and Environmental Science Transactions of the Royal
Society of Edinburgh, 52:399–433, 1919.
[FPSS96]
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kdd process
for extracting useful knowledge from volumes of data, 1996.
[FTW+ 07]
Timothy M Frayling, Nicholas J Timpson, Michael N Weedon, Eleftheria Zeggini,
Rachel M Freathy, Cecilia M Lindgren, John R B Perry, Katherine S Elliott, Hana
Lango, and Nigel W Rayner. A common variant in the FTO gene is associated
with body mass index and predisposes to childhood and adult obesity. Science,
316(5826):889–894, 2007.
[GWM08]
Casey S Greene, Bill C White, and Jason H Moore. Ant colony optimization for
genome-wide genetic analysis. Lecture Notes in Computer Science, 5217:37–47,
2008.
[HDF+ 13]
Emily R Holzinger, Scott M Dudek, Alex T Frase, Ronald M Krauss, Marisa W
Medina, and Marylyn D Ritchie. ATHENA: a tool for meta-dimensional analysis
applied to genotypes and gene expression data to predict HDL cholesterol levels.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pages
385–96, 2013.
[HFH+ 09]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
and Ian H. Witten. The WEKA Data Mining Software: An Update. ACM SIGKDD
Explorations Newsletter, 11:10–18, 2009.
[HK06]
Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, volume 54. 2006.
[HRM03]
Lance W Hahn, Marylyn D Ritchie, and Jason H Moore. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions.
Bioinformatics (Oxford, England), 19:376–382, 2003.
[JNBV11]
Xia Jiang, Richard E Neapolitan, M Michael Barmada, and Shyam Visweswaran.
Learning genetic epistasis using Bayesian network scoring criteria. BMC bioinformatics, 12:89, 2011.
[Jun09]
Felix Jungermann. Information Extraction with RapidMiner. GSCL Symposium
Sprachtechnologie und eHumanities 2009, 2009:1–16, 2009.
[Koh95]
Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation
and Model Selection. In International Joint Conference on Artificial Intelligence,
volume 14, pages 1137–1143, 1995.
71
REFERENCES
[KR92]
Kenji Kira and Larry A. Rendell. The feature selection problem: traditional methods
and a new algorithm. In Proceedings of the tenth national conference on Artificial
intelligence, pages 129–134, 1992.
[LD94]
Nada Lavrac and Saso Dzeroski. Inductive Logic Programming: Techniques and
Applications. E. Horwood, New York, 1994.
[LHSV04]
Kathryn L Lunetta, L Brooke Hayward, Jonathan Segal, and Paul Van Eerdewegh.
Screening large-scale association study data: exploiting interactions using random
forests. BMC genetics, 5:32, 2004.
[LIT92]
Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of Bayesian classifiers.
In Proceedings of the National Conference on Artificial Intelligence, pages 223–
223, 1992.
[MCGT09]
Brett A McKinney, James E Crowe, Jingyu Guo, and Dehua Tian. Capturing the
spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS genetics, 5:e1000432, 2009.
[Moo04]
Jason H Moore. Computational analysis of gene-gene interactions using multifactor
dimensionality reduction. Expert review of molecular diagnostics, 4:795–803, 2004.
[MRW+ 07] B A McKinney, D M Reif, B C White, J E Crowe, and J H Moore. Evaporative
cooling feature selection for genotypic data involving interactions. Bioinformatics
(Oxford, England), 23:2113–2120, 2007.
[MSA+ 12]
Gennaro Miele, Giovanni Scala, Roberto Amato, Sergio Cocozza, and Michele
Pinelli. Simulating gene-gene and gene-environment interactions in complex diseases: Gene-Environment iNteraction Simulator 2, 2012.
[Mud09]
Geo. P. Mudge. Mendel’s principles of heredity. The Eugenics review, 1:130–137,
1909.
[MVV11]
Jestinah M Mahachie John, Francois Van Lishout, and Kristel Van Steen. ModelBased Multifactor Dimensionality Reduction to detect epistasis for quantitative
traits in the presence of error-free and noisy data. Eur J Hum Genet, 19(6):696–
703, June 2011.
[MW07]
JH Moore and BC White. Tuning ReliefF for genome-wide genetic analysis. Lecture
Notes in Computer Science, 4447:166–175, 2007.
[NBS+ 07]
Robin Nunkesser, Thorsten Bernholt, Holger Schwender, Katja Ickstadt, and Ingo
Wegener. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics (Oxford, England), 23:3280–3288, 2007.
[NCS05]
Bernard V North, David Curtis, and Pak C Sham. Application of logistic regression
to case-control association studies involving two causative loci. Human heredity,
59:79–87, 2005.
[NKFS01]
M R Nelson, S L Kardia, R E Ferrell, and C F Sing. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait
variation. Genome research, 11:458–470, 2001.
72
REFERENCES
[Nun08]
Robin Nunkesser. Analysis of a genetic programming algorithm for association
studies. In Proceedings of the 10th annual conference on Genetic and evolutionary
computation, pages 1259–1266. ACM, 2008.
[OSL13]
Anunciação; Orlando, Vinga; Susana, and Oliveira; Arlindo L. Using Information Interaction to Discover Epistatic Effects in Complex Diseases. PloS one,
8(10):e76300, 2013.
[PA10]
Bo Peng and Christopher I Amos. Forward-time simulation of realistic samples for
genome-wide association studies. BMC bioinformatics, 11:442, 2010.
[PC14a]
Ricardo Pinho and Rui Camacho. Genetic Epistasis I - Materials and methods. 2014.
[PC14b]
Ricardo Pinho and Rui Camacho. Genetic Epistasis II - Assessing Algorithm BEAM
3.0. 2014.
[PC14c]
Ricardo Pinho and Rui Camacho. Genetic Epistasis III - Assessing Algorithm
BOOST. 2014.
[PC14d]
Ricardo Pinho and Rui Camacho. Genetic Epistasis IV - Assessing Algorithm
Screen and Clean. 2014.
[PC14e]
Ricardo Pinho and Rui Camacho. Genetic Epistasis IX - Comparative Assessment
of the Algorithms. 2014.
[PC14f]
Ricardo Pinho and Rui Camacho. Genetic Epistasis V - Assessing Algorithm
SNPRuler. 2014.
[PC14g]
Ricardo Pinho and Rui Camacho. Genetic Epistasis VI - Assessing Algorithm
SNPHarvester. 2014.
[PC14h]
Ricardo Pinho and Rui Camacho. Genetic Epistasis VII - Assessing Algorithm
TEAM. 2014.
[PC14i]
Ricardo Pinho and Rui Camacho. Genetic Epistasis VIII - Assessing Algorithm
MBMDR. 2014.
[PH08]
Mee Young Park and Trevor Hastie. Penalized logistic regression for detecting gene
interactions. Biostatistics (Oxford, England), 9:30–50, 2008.
[Phi08]
Patrick C Phillips. Epistasis–the essential role of gene interactions in the structure
and evolution of genetic systems. Nature reviews. Genetics, 9:855–867, 2008.
[PNTB+ 07] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A R Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul I W de Bakker, Mark J Daly,
and Pak C Sham. PLINK: a tool set for whole-genome association and populationbased linkage analyses. American journal of human genetics, 81:559–575, 2007.
[Pow11]
D.M.W. Powers. Evaluation : From Precision, Recall and F-Measure To Roc, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2:37–63, 2011.
[PV08]
Anita Prinzie and Dirk Van den Poel. Random Forests for multiclass classification:
Random MultiNomial Logit, 2008.
73
REFERENCES
[RHR+ 01]
M D Ritchie, L W Hahn, N Roodi, L R Bailey, W D Dupont, F F Parl, and J H
Moore. Multifactor-dimensionality reduction reveals high-order interactions among
estrogen-metabolism genes in sporadic breast cancer. American journal of human
genetics, 69:138–147, 2001.
[Rip01]
Brian D Ripley. The R project in statistical computing. MSOR Connections,
1(1):23–25, 2001.
[RSK03]
Marko Robnik-Sikonja and Igor Kononenko. Theoretical and Empirical Analysis of
ReliefF and RReliefF. Machine, 53:23–69, 2003.
[SGLT12]
Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regression, 2012.
[SGM+ 10]
Ayellet V Segrè, Leif Groop, Vamsi K Mootha, Mark J Daly, and David Altshuler.
Common inherited variation in mitochondrial genes is not enriched for associations
with type 2 diabetes or related glycemic traits. PLoS genetics, 6, 2010.
[SKZ10]
Daniel F Schwarz, Inke R König, and Andreas Ziegler. On safari to Random Jungle:
a fast implementation of Random Forests for high-dimensional data. Bioinformatics
(Oxford, England), 26:1752–1758, 2010.
[Smi84]
R. L. Smith. Efficient Monte Carlo Procedures for Generating Points Uniformly
Distributed over Bounded Regions, 1984.
[Sri01]
Ashwin Srinivasan. The aleph manual. Machine Learning at the Computing Laboratory, Oxford University, 2001.
[SSDM09]
C. C A Spencer, Zhan Su, Peter Donnelly, and Jonathan Marchini. Designing
genome-wide association studies: Sample size, power, imputation, and the choice
of genotyping chip. PLoS Genetics, 5, 2009.
[Ste12]
Kristel Van Steen. Travelling the world of gene-gene interactions. Briefings in
bioinformatics, 13:1–19, 2012.
[SWS10]
Yun S Song, Fulton Wang, and Montgomery Slatkin. General epistatic models of
the risk of complex diseases. Genetics, 186:1467–1473, 2010.
[SZS+ 11]
Junliang Shang, Junying Zhang, Yan Sun, Dan Liu, Daojun Ye, and Yaling Yin.
Performance analysis of novel methods for detecting epistasis, 2011.
[TDR10a]
Stephen D Turner, Scott M Dudek, and Marylyn D Ritchie. ATHENA: A
knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait Loci. BioData mining,
3:5, 2010.
[TDR10b]
Stephen D Turner, Scott M Dudek, and Marylyn D Ritchie. Grammatical Evolution of Neural Networks for Discovering Epistasis among Quantitative Trait Loci.
Lecture Notes in Computer Science, 6023:86–97, 2010.
[TJZ06]
Michael W T Tanck, J Wouter Jukema, and Aeilko H Zwinderman. Simultaneous
estimation of gene-gene and gene-environment interactions for numerous loci using
double penalized log-likelihood. Genetic epidemiology, 30:645–651, 2006.
74
REFERENCES
[VL12]
J. Verzani and M. F. Lawrence. Programming graphical user interfaces with R.
CRC Press, 2012.
[WBW05]
Geoffrey I. Webb, Janice R. Boughton, and Zhihai Wang. Not So Naive Bayes:
Aggregating One-Dependence Estimators, 2005.
[WDR+ 10]
Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, and Kathryn Roeder.
Screen and clean: a tool for identifying interactions in genome-wide association
studies. Genetic epidemiology, 34:275–285, 2010.
[Weg60]
Peter Wegner. A technique for counting ones in a binary computer. Communications
of the ACM, 3(5):322, 1960.
[WFH11]
Ian H Witten, Eibe Frank, and Mark A Hall. Data Mining: Practical Machine
Learning Tools and Techniques (Google eBook). 2011.
[WLFW11] Yue Wang, Guimei Liu, Mengling Feng, and Limsoon Wong. An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics
(Oxford, England), 27:2936–43, 2011.
[WLZH12]
Haitian Wang, Shaw-Hwa Lo, Tian Zheng, and Inchi Hu. Interaction-based feature
selection and classification for high-dimensional biological data. Bioinformatics
(Oxford, England), 28:2834–42, 2012.
[WYY+ 10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson L S Tang,
and Weichuan Yu. BOOST: A fast approach to detecting gene-gene interactions in
genome-wide case-control studies. American journal of human genetics, 87:325–
340, 2010.
[WYY+ 10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S Tang, and Weichuan
Yu. Predictive rule inference for epistatic interaction detection in genome-wide
association studies. Bioinformatics (Oxford, England), 26:30–37, 2010.
[WYY12]
Xiang Wan, Can Yang, and Weichuan Yu. Comments on ‘An empirical comparison of several recent epistatic interaction detection methods’. Bioinformatics,
28(1):145–146, 2012.
[YHW+ 09]
Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue, and Weichuan Yu.
SNPHarvester: a filtering-based approach for detecting epistatic interactions in
genome-wide association studies. Bioinformatics (Oxford, England), 25:504–511,
2009.
[YHZZ10]
Pengyi Yang, Joshua W K Ho, Albert Y Zomaya, and Bing B Zhou. A genetic
ensemble approach for gene-gene interaction identification. BMC bioinformatics,
11:524, 2010.
[YYWY11] Ling Sing Yung, Can Yang, Xiang Wan, and Weichuan Yu. GBOOST: a GPUbased tool for detecting gene-gene interactions in genome-wide case control studies.
Bioinformatics (Oxford, England), 27:1309–1310, 2011.
[ZB00]
H Zhang and G Bonney. Use of classification trees for association studies. Genetic
epidemiology, 19:323–332, 2000.
75
REFERENCES
[Zha12]
Yu Zhang. A novel bayesian graphical model for genome-wide multi-SNP association mapping. Genetic Epidemiology, 36:36–47, 2012.
[ZHZW10]
Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang. TEAM: efficient two-locus
epistasis tests in human genome-wide association study. Bioinformatics (Oxford,
England), 26:i217–i227, 2010.
[ZL07]
Yu Zhang and Jun S Liu. Bayesian inference of epistatic interactions in case-control
studies. Nature genetics, 39:1167–1173, 2007.
76
Appendix A
Glossary
A.1
Biology related terms
• Allele - One specific pair or series of genes in a given position of a specific chromosome.
• Cell - The most basic structural unit of any organism that is capable of independent functioning.
• Chromosome - Genetic material stored in the nucleus of eukaryotic cells. Contains all the
hereditary information. In Humans, there are 23 pairs of chromosomes.
• DNA - Deoxyribonucleic Acid. Molecule where the genetic material is stored. Is capable
of self-replication and RNA synthesis.
• Dominant Gene - It is the allele which manifests itself in the phenotype, when there are
more than one type of alleles in the genotype or in homozygotic cases. Might also be
dominant to one allele but recessive to another.
• Epistasis - If various SNPs interact with each other to express a phenotype, that interaction
is called epistasis. There are 3 main types of Epistasis:
– Compositional Epistasis - How the relationship between genes work.
– Functional Epistasis - The direct effect of the function of a gene in another gene.
– Statistical Epistasis - Number of different allele effects.
• Eukaryote - A cell with a defined nucleus. Has a membrane that separates the nucleus of
the cell from the rest of its contents.
• Gene - The basic unit of hereditary information in DNA or RNA. May suffer mutations.
• Genotype - The genetic constitution of a specific trait. A combination of alleles in a corresponding match of chromosomes that determines a trait.
77
Glossary
• GWAS - Genome Wide Association Study. Study of the entire genome to find SNPs that
are associate with specific traits. In this case, complex diseases.
• Heterozygous - A genotype composed with different alleles.
• Homozygous - A genotype composed with the same allele.
• Locus - Place within the DNA where a given gene is stored.
• Mutation - A change in a chromosome, either by the change of a gene or rearrangement of
a part of the chromosome.
• Nucleotide Bases - Different types of molecules that combine with each other to form DNA
and RNA.
– Adenine - Nucleotide base. Links to Thymine in DNA or Uracil in RNA.
– Cytosine - Nucleotide base. Links to Guanine.
– Guanine - Nucleotide base. Links to Cytosine.
– Thymine - Nucleotide base specific to DNA. Links to Adenine.
– Uracil - Nucleotide base specific to RNA. Links to Adenine.
• Phenotype - Expression of a specific trait. The manifestation of a certain gene or interaction
of various genes.
• Recessive Gene - Allele that does not manifest in the phenotype unless in homozygotic
cases. Might be recessive to one allele but dominant to another.
• Ribosome - Molecular machine used in protein synthesis from encoded RNA.
• RNA - Ribonucleic Acid. Molecule synthesized by DNA that expresses genes by being
transcribed by the ribosome.
• SNP - Single nucleotide polymorphism. DNA sequence in the genome that varies in individuals of the same species.
A.2
Data mining terms
• Association Rules - Relations between variables that are relevant in a significant number of
instances.
• Bayesian Networks - Graphical model that shows dependencies between a randomly chosen group of variables.
• Classification - A type of Data Mining prediction problem, where the determined value is
nominal. In specific cases, the determined variable is binary.
78
Glossary
• Clustering - Tries to find similarities between instances and joins them together in groups,
or clusters.
• Data Mining - A vast field in computer science. Tries to identify patterns in big data sets.
• Data Set - A collection of data composed of Attributes (columns) and Instances (rows).
Attributes are different variables of the recorded data and each Instance is a new member of
the data set.
• Machine Learning - Area of Artificial Intelligence where algorithms and systems are developed to be able to learn from data. Can be used to solve problems such as Clustering,
Classification, Association Rules, Regression, etc.
• Model - Data Mining models are created by using a specific algorithm on a specific data
set. The result is a model specifically designed to predict or find relations in data, based on
the learned patterns of the data set used.
• Overfitting - When a model is too adapted to a subgroup of data that does not translate in
the all of the dataset and future data.
• pre-processing - The adaptation of data to fit a certain criteria, either by transforming the
data type or by reducing dimensionality in attributes or instances.
– Filter methods - preprocessing methods that select subsets of variables independently
from the model creation algorithms.
– Wrapper methods - Scoring and transformation of subsets of variables to serve a
specific type of algorithm.
– Embedded Methods - Feature selection that occurs during the training of a given
model.
• Pruning - To cut a connection in tree-based methods due the branch not being relevant to
the final result. This increase the efficiency of the algorithm but can also wrongfully cut
significant branches.
• Regression - A type of DM prediction problem, where the determined value is continuous.
• Supervised method - A method which has real values appointed in the data to the class
variable.
A.3
Lab Notes
79
Laboratory Note
Genetic Epistasis
I - Materials and methods
LN-1-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045 [email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
Based on literature results, we have selected 7 epistatic detection
methods. The selected methods were empirically evaluated and compared using generated data from genomeSimla to simulate a smaller
scale of genome wide studies. The simulated data includes 270 different configurations of datasets to simulate a wide array of disease
models. The selected algorithms are BEAM 3.0, BOOST, MBMDR,
Screen and Clean, SNPRuler, SNPHarvester, and TEAM. These algorithms are evaluated according to their Power, scalability and Type
I Error Rate.
1
Introduction
The search for genetic predisposition to diseases has been researched for a
long time. However, most early studies only focused on single SNP studies
to determine disease predisposition. This is not the case in most complex
diseases. Generally, most disease involve thousands or milions of SNPs, interacting between them in a large scale. Due to the complexity of these
interactions, the computational costs for epistasis detection were infeasible
until recently.
The main objective of the following experiments is to empirically evaluate
the following algorithms: BEAM 3.0 [Zha12], BOOST [WYY+ 10a], MBMDR
[MVV11], Screen and Clean [WDR+ 10], SNPRuler [WYY+ 10b], SNPHarvester [YHW+ 09], and TEAM [ZHZW10].
These algorithms will be evaluated according to their Power, scalability,
and Type I Error Rate. Each algorithm will be executed with many data
sets that simulate diseases with many different parameters.
These data sets are generated with genomeSimla, an open source data generator that contains many useful parameters to realistically simulate complex
diseases.
The structure of the rest of the lab note consists of a brief description of the
data sets that were used in the experiments, including the application used
to generate them in Section 2. Section 3 includes a description of the evaluation measures used in these experiments. Section 4 contains the experimental
methodology followed in these experiments. Section 5 is the summary of the
experiments that will be detailed in the next lab notes.
2
The Data sets for the Experiments
The data sets for the experiments were created specifically for these experiments. The program used for the generation of the datasets was genomeSimla
[EBT+ 08]. In total, 270 different configurations were generated. Each configuration consists of 100 data sets, which means that each algorithm was
executed 27000 times.
Data Generation Application
The data generation application used for these experiments was genomeSimla. Due to its ability to evolve a population and achieve the desired allele
frequencies, with any amount of SNPs, distributed by as many chromosomes
as desired, genomeSimla is an adequate application for these kind of exper1
iments. The evolution of the population can follow a linear, exponential or
logistic growth, the last one being the most preferred model.
Aside from generating and evolving a population to any amount of iterations
as required, genomeSimla allows for an observation of the allele frequencies
along the population and chose, based on those frequencies, which SNPs
should be allocated, choosing how many chromosomes and block of SNPs
per chromosome for each individual a priori.
After the generation of the population according to the selected parameters,
genomeSimla can then be used to generate datasets, sampling from the population pool with as many individuals as necessary. The disease model can
be further customized, with the desired odds ratio, prevalence of the disease,
and type of disease model. Based on these values, a penetrance table is generated for each desired parameter.
• Allele Frequency - The frequency of the minor allele of the disease
SNPs.
• Population - Number of individuals sampled in the data set.
• Disease Model - Type of disease model: main effect, epistasis interaction, and full effect.
• Odds ratio - Relation between disease SNPs. Probability of one disease
SNP being present, given the presence of the other disease SNP.
• Prevalence - The proportion of a population with the disease. Affects
the number of cases and controls in a data set.
With this data, data sets can be generated, using a configuration file, embedding the disease model into the desired alleles.
Data Set
The data sets were created using many different parameters, to maximize
the diversity of disease models, to assert which algorithms are best for which
scenarios. The data consists of a simulation of genotypes and phenotypes.
For each individual, the attributes consist of genotypes associated with each
SNP, for a total of 3 states: Homozygotic dominant, heterozygotic and homozygotic recessive. The label is binary, corresponding to an affected or not
affected individual.
In each data set, a total of 2 pairs of chromosomes where generated. The
first chromosome contains 20 blocks of 10 SNPs and the second contains 10
2
blocks of 10 SNPs, having 300 SNPs in total. There are two disease alleles
placed in different chromosomes, according to the desired allele frequency.
The generated data sets contain 3 different number of individuals: 500, 1000,
and 2000 individuals. The disease alleles contains 5 different minor allele frequencies: 0.01, 0.05, 0.1, 0.3, and 0.5. Three different disease models are
used: data sets with marginal effects and no epistatic relations, without
marginal effects and with epistatic relations and with marginal effects and
epistatic relations. The odds ratio associated with both disease related alleles
is 1.1, 1.5, or 2.0. The prevalence of the disease can is also configurated to
either 0.0001 or 0.02, which also influances the amount of cases and controls.
3
Evaluation Measures
The evaluations measures used for these experiments consist of Power, scalability, and Type I Error Rate.
To evaluate the Power of the algorithms, for each configuration, the number of data sets were the ground truth is a statistically relevant interaction,
measured using the χ2 test, out of 100 data sets. Calculating the amount
of datasets, within each configuration, how many data sets correctly identify
the ground truth of the disease as the most significant SNP pair, considering
that SNPs are ranked according to their importance to the phenotype, using
statistical hypothesis tests.
To evaluate the scalability of each dataset, the average time for each dataset
is calculated in each configuration.
In the Type I Error Rate, the proportion out of 100 data sets where nondisease related SNPs in each configuration are classified as a statistically
relevant SNP pair, using χ2 test.
4
Experimental Methodology
Initially, the population for the datasets is generated using genomeSimla.
The population is generated using a logistic growth rate, with an initial population of 10000 and a maximum capacity of 1000000. The population chosen
for the datasets is picked from reported generations, based on the allele frequencies desired for the experiment. The generation 1750 was selected for
this purpose. 2 SNPs are selected for each configuration. The SNPs selected
according to their minor allele frequency (MAF) were as follows:
• MAF 0.01 - SN P 112 and SN P 267
3
• MAF 0.05 - SN P 4 and SN P 239
• MAF 0.1 - SN P 135 and SN P 230
• MAF 0.3 - SN P 197 and SN P 266
• MAF 0.5 - SN P 80 and SN P 229
The first 200 SNPs belong to chromosome 1, where as the last 100 correspond
to chromosome 2 SNPs.The table with all the allele frequencies can be seen
in the annexes. Table 1 are the chromosome 1 allele frequencies and table 2
are the chromosome 2 allele frequencies
The penetrance tables are created from the allele frequencies in the population, following the configurations that were discussed earlier. The datasets
are created, using each unique configuration file to create 100 datasets, generating all the configurations mentioned before.
With the data sets generated, the algorithms are tested for the most extreme
configurations (minimum and maximum MAF) to find if the results are valid.
Upon asserting the validity of the experiment, all algorithms are then executed for all configurations to analyze the potential of each algorithm.
For each algorithm in each dataset, a file containing the ranked SNPs, according to statistical relevancy, is generated, together with information about
the time and memory used in the execution for each test. The Power and
Type I Error Rates are are taken from the results that present a statistic
relevance of α < 0.05.
The computer used for this experiments used the 64-bit Ubuntu 13.10 operating system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz
processor and 8,00 GB of RAM memory.
A
Loci Frequencies
Chromosome 1
Table 1: Allele frequencies of the generated population
for chromosome 1.
Label
RL0-1
RL0-2
RL0-3
Freq Al1
0.704448
0.467747
0.856627
Freq Al2
0.295552
0.532253
0.143373
4
Map Dist.
0.0002523144
3.65488E-006
3.86582E-006
Position
253
256
259
RL0-4
RL0-5
RL0-6
RL0-7
RL0-8
RL0-9
RL0-10
RL0-11
RL0-12
RL0-13
RL0-14
RL0-15
RL0-16
RL0-17
RL0-18
RL0-19
RL0-20
RL0-21
RL0-22
RL0-23
RL0-24
RL0-25
RL0-26
RL0-27
RL0-28
RL0-29
RL0-30
RL0-31
RL0-32
RL0-33
RL0-34
RL0-35
RL0-36
RL0-37
RL0-38
RL0-39
RL0-40
RL0-41
0.94761
0.747191
0.868644
0.869881
0.634084
0.616899
0.603205
0.951322
0.928004
0.7257
0.547945
0.735312
0.983344
0.809402
0.908173
0.628892
0.824863
0.640543
0.542639
0.776321
0.925422
0.596454
0.80071
0.712163
0.91426
0.902589
0.933652
0.486126
0.553701
0.887238
0.93165
0.887583
0.824546
1
0.817039
0.762831
0.623942
0.886716
0.05239
0.252809
0.131356
0.130119
0.365916
0.383101
0.396795
0.048678
0.071996
0.2743
0.452055
0.264688
0.016655
0.190598
0.091827
0.371108
0.175137
0.359457
0.457361
0.223679
0.074578
0.403546
0.19929
0.287837
0.08574
0.097411
0.066348
0.513874
0.446299
0.112762
0.06835
0.112417
0.175454
0
0.182961
0.237169
0.376058
0.113284
5
1.18175E-006
1.23056E-006
5.41858E-006
9.49181E-006
1.72337E-006
4.81936E-006
8.88582E-006
0.000118908
9.53558E-006
3.20447E-006
3.96875E-006
5.03938E-006
4.10188E-006
7.11582E-006
2.25726E-006
2.13406E-006
9.99491E-006
0.000229233
5.61457E-006
5.05623E-006
4.39722E-006
9.48707E-006
7.38516E-006
3.95139E-006
5.07943E-006
0.000006668
2.4885E-006
0.000296081
8.33422E-006
4.95048E-006
5.32692E-006
2.23131E-006
5.40611E-006
7.03837E-006
1.44855E-006
9.89044E-006
2.53856E-006
0.0003574
260
261
266
275
276
280
288
406
415
418
421
426
430
437
439
441
450
679
684
689
693
702
709
712
717
723
725
1021
1029
1033
1038
1040
1045
1052
1053
1062
1064
1421
RL0-42
RL0-43
RL0-44
RL0-45
RL0-46
RL0-47
RL0-48
RL0-49
RL0-50
RL0-51
RL0-52
RL0-53
RL0-54
RL0-55
RL0-56
RL0-57
RL0-58
RL0-59
RL0-60
RL0-61
RL0-62
RL0-63
RL0-64
RL0-65
RL0-66
RL0-67
RL0-68
RL0-69
RL0-70
RL0-71
RL0-72
RL0-73
RL0-74
RL0-75
RL0-76
RL0-77
RL0-78
RL0-79
0.603873
0.708144
0.722182
0.59756
0.810217
0.679944
0.467092
0.518637
0.918397
0.979136
0.571337
0.615734
0.695586
0.660442
0.910148
0.445087
0.470733
0.858588
0.681468
0.870466
0.646194
0.763207
0.931087
0.7151
0.670911
0.888122
0.694165
0.864311
0.838895
0.823928
0.583947
0.841979
0.6003
0.892639
0.761561
0.900447
0.599257
0.972086
0.396127
0.291856
0.277818
0.40244
0.189783
0.320056
0.532908
0.481363
0.081603
0.020864
0.428663
0.384266
0.304414
0.339558
0.089852
0.554913
0.529267
0.141412
0.318532
0.129534
0.353806
0.236793
0.068913
0.2849
0.329089
0.111878
0.305835
0.135689
0.161105
0.176073
0.416053
0.158021
0.3997
0.107361
0.238439
0.099553
0.400743
0.027914
6
5.73344E-006
7.18489E-006
6.17693E-006
6.57155E-006
3.25347E-006
8.28564E-006
4.45383E-006
2.97358E-006
4.58774E-006
0.0003772277
3.32175E-006
2.23233E-006
2.98606E-006
4.02315E-006
5.9643E-006
6.82648E-006
9.5693E-006
0.000004337
2.15272E-006
0.0003573156
7.20147E-006
3.17006E-006
8.01624E-006
6.35415E-006
2.38872E-006
2.52589E-006
0.000008364
7.35972E-006
2.46709E-006
0.0001992617
6.33832E-006
9.79685E-006
7.07911E-006
5.16523E-006
2.85138E-006
1.53824E-006
3.89272E-006
6.53018E-006
1426
1433
1439
1445
1448
1456
1460
1462
1466
1843
1846
1848
1850
1854
1859
1865
1874
1878
1880
2237
2244
2247
2255
2261
2263
2265
2273
2280
2282
2481
2487
2496
2503
2508
2510
2511
2514
2520
RL0-80
RL0-81
RL0-82
RL0-83
RL0-84
RL0-85
RL0-86
RL0-87
RL0-88
RL0-89
RL0-90
RL0-91
RL0-92
RL0-93
RL0-94
RL0-95
RL0-96
RL0-97
RL0-98
RL0-99
RL0-100
RL0-101
RL0-102
RL0-103
RL0-104
RL0-105
RL0-106
RL0-107
RL0-108
RL0-109
RL0-110
RL0-111
RL0-112
RL0-113
RL0-114
RL0-115
RL0-116
RL0-117
0.560663
0.554206
0.93403
0.542574
0.837702
0.909783
0.91318
0.725569
0.90355
0.716186
0.612835
0.582162
0.83582
0.558802
0.86217
0.617906
0.801595
0.676978
0.738348
0.591386
0.521751
0.508844
0.565387
0.479309
0.745518
0.532452
0.935416
0.662617
0.658306
0.712991
0.665501
0.568289
0.98671
0.79789
0.553154
0.667399
0.700185
0.610748
0.439337
0.445794
0.06597
0.457426
0.162298
0.090217
0.08682
0.274431
0.09645
0.283814
0.387165
0.417838
0.16418
0.441198
0.13783
0.382094
0.198405
0.323022
0.261652
0.408614
0.478249
0.491156
0.434613
0.520691
0.254482
0.467548
0.064584
0.337383
0.341694
0.287009
0.334499
0.431711
0.01329
0.20211
0.446846
0.332601
0.299815
0.389252
7
8.62124E-006
0.000199997
8.61757E-006
9.10087E-006
1.23079E-006
6.84162E-006
4.48263E-006
0.000001848
2.79894E-006
4.00443E-006
6.94976E-006
0.0003616833
0.000009529
9.02466E-006
5.29547E-006
7.09319E-006
6.73657E-006
6.97316E-006
7.87644E-006
3.67391E-006
4.20054E-006
9.09917E-005
9.41043E-006
7.40872E-006
3.35237E-006
4.28727E-006
9.89425E-006
8.74864E-006
2.01241E-006
5.8733E-006
6.69027E-006
8.718047E-005
8.66949E-006
5.05033E-006
9.60618E-006
6.92172E-006
9.52134E-006
5.60877E-006
2528
2727
2735
2744
2745
2751
2755
2756
2758
2762
2768
3129
3138
3147
3152
3159
3165
3171
3178
3181
3185
3275
3284
3291
3294
3298
3307
3315
3317
3322
3328
3415
3423
3428
3437
3443
3452
3457
RL0-118
RL0-119
RL0-120
RL0-121
RL0-122
RL0-123
RL0-124
RL0-125
RL0-126
RL0-127
RL0-128
RL0-129
RL0-130
RL0-131
RL0-132
RL0-133
RL0-134
RL0-135
RL0-136
RL0-137
RL0-138
RL0-139
RL0-140
RL0-141
RL0-142
RL0-143
RL0-144
RL0-145
RL0-146
RL0-147
RL0-148
RL0-149
RL0-150
RL0-151
RL0-152
RL0-153
RL0-154
RL0-155
0.661102
0.820744
0.912926
0.68335
0.707937
0.589477
0.745493
0.698088
0.424467
0.787719
0.860644
0.638396
0.731953
0.744233
1
0.771704
0.878927
0.90145
0.648369
0.80335
0.856866
0.615518
0.788087
0.678961
0.771435
0.503258
0.795211
0.490144
0.488492
0.667302
0.643159
0.673992
0.788535
0.781059
0.502629
0.466542
0.538982
0.841056
0.338898
0.179256
0.087073
0.31665
0.292063
0.410523
0.254507
0.301912
0.575533
0.212281
0.139356
0.361604
0.268047
0.255766
0
0.228296
0.121073
0.09855
0.351631
0.19665
0.143134
0.384482
0.211913
0.321039
0.228565
0.496742
0.204789
0.509856
0.511508
0.332698
0.356841
0.326008
0.211465
0.218941
0.497371
0.533458
0.461018
0.158944
8
6.63784E-006
3.09427E-006
4.1968E-006
0.0003871028
5.00312E-006
2.13525E-006
9.8212E-006
7.02674E-006
5.18827E-006
4.74483E-006
5.22368E-006
3.96526E-006
8.71207E-006
0.0002181738
1.69539E-006
9.71469E-006
0.000002233
4.28905E-006
0.00000754
8.70869E-006
9.44719E-006
3.60345E-006
0.000002436
0.0002748812
5.86447E-006
3.67578E-006
2.75252E-006
4.10642E-006
4.30833E-006
7.3961E-006
2.3613E-006
9.5407E-006
5.39342E-006
0.0002359844
5.62238E-006
2.22743E-006
3.21068E-006
2.43989E-006
3463
3466
3470
3857
3862
3864
3873
3880
3885
3889
3894
3897
3905
4123
4124
4133
4135
4139
4146
4154
4163
4166
4168
4442
4447
4450
4452
4456
4460
4467
4469
4478
4483
4718
4723
4725
4728
4730
RL0-156
RL0-157
RL0-158
RL0-159
RL0-160
RL0-161
RL0-162
RL0-163
RL0-164
RL0-165
RL0-166
RL0-167
RL0-168
RL0-169
RL0-170
RL0-171
RL0-172
RL0-173
RL0-174
RL0-175
RL0-176
RL0-177
RL0-178
RL0-179
RL0-180
RL0-181
RL0-182
RL0-183
RL0-184
RL0-185
RL0-186
RL0-187
RL0-188
RL0-189
RL0-190
RL0-191
RL0-192
RL0-193
0.462765
0.90605
0.681072
0.596135
0.855496
0.727272
0.774272
0.791941
0.644252
0.549582
0.428749
0.376485
0.535948
0.514295
0.700045
0.571955
0.586523
0.783275
0.610016
0.866664
0.75876
0.600093
0.577467
0.789476
0.590153
0.422633
0.526449
0.83354
0.737217
0.650092
0.56464
0.717536
0.961919
0.84241
0.817398
1
1
0.709334
0.537235
0.09395
0.318928
0.403865
0.144504
0.272728
0.225728
0.208059
0.355748
0.450418
0.571251
0.623515
0.464052
0.485705
0.299955
0.428045
0.413477
0.216725
0.389985
0.133336
0.24124
0.399907
0.422533
0.210524
0.409847
0.577367
0.473551
0.16646
0.262783
0.349908
0.43536
0.282464
0.038081
0.15759
0.182602
0
0
0.290666
9
7.40954E-006
3.96506E-006
2.10963E-006
6.71541E-006
0.00000768
0.0002969833
2.62789E-006
6.76876E-006
0.000005599
8.32549E-006
8.10471E-006
9.96927E-006
9.47661E-006
3.16517E-006
5.98168E-006
0.0003862553
2.88618E-006
7.29982E-006
9.43182E-006
7.05865E-006
7.56181E-006
1.005344E-005
2.42474E-006
6.1728E-006
5.99256E-006
9.624393E-005
1.007159E-005
3.23814E-006
8.58028E-006
9.27841E-006
5.87977E-006
3.16557E-006
2.93894E-006
8.25314E-006
4.0069E-006
0.0002386956
4.98276E-006
0.000002811
4737
4740
4742
4748
4755
5051
5053
5059
5064
5072
5080
5089
5098
5101
5106
5492
5494
5501
5510
5517
5524
5534
5536
5542
5547
5643
5653
5656
5664
5673
5678
5681
5683
5691
5695
5933
5937
5939
RL0-194
RL0-195
RL0-196
RL0-197
RL0-198
RL0-199
RL0-200
0.78411
0.932612
0.865947
0.725338
0.795964
0.583016
0.803726
0.21589
0.067388
0.134053
0.274662
0.204036
0.416984
0.196274
0.000008052
2.89373E-006
8.6839E-006
5.21764E-006
7.8731E-006
4.61094E-006
8.37366E-006
5947
5949
5957
5962
5969
5973
5981
Chromosome 2
Table 2: Allele frequencies of the generated population
for chromosome 2.
Label Freq Al1
RL1-201 0.893976
RL1-202 0.584141
RL1-203 0.422083
RL1-204
0.73351
RL1-205 0.694034
RL1-206 0.765355
RL1-207 0.965014
RL1-208 0.668517
RL1-209 0.634885
RL1-210 0.725027
RL1-211 0.698398
RL1-212 0.595985
RL1-213 0.710597
RL1-214 0.663247
RL1-215
0.75663
RL1-216 0.936743
RL1-217 0.663784
RL1-218 0.680104
RL1-219 0.688756
RL1-220
0.9333
RL1-221 0.742415
RL1-222 0.799322
RL1-223 0.709122
RL1-224 0.565597
Freq Al2
0.106024
0.415859
0.577917
0.26649
0.305966
0.234645
0.034986
0.331483
0.365115
0.274973
0.301602
0.404015
0.289403
0.336753
0.24337
0.063257
0.336216
0.319896
0.311244
0.0667
0.257585
0.200678
0.290879
0.434403
10
Map Dist.
Position
0.0003986369
399
2.05934E-006
401
0.000005955
406
5.58855E-006
411
4.1723E-006
415
2.06415E-006
417
7.44318E-006
424
9.60649E-006
433
8.56251E-006
441
6.14954E-006
447
7.386583E-005
520
9.7547E-006
529
1.58667E-006
530
4.37889E-006
534
7.38782E-006
541
8.35938E-006
549
1.64064E-006
550
9.16445E-006
559
0.000007628
566
7.01934E-006
573
0.0003420352
915
4.01391E-006
919
0.000002737
921
6.28353E-006
927
RL1-225
RL1-226
RL1-227
RL1-228
RL1-229
RL1-230
RL1-231
RL1-232
RL1-233
RL1-234
RL1-235
RL1-236
RL1-237
RL1-238
RL1-239
RL1-240
RL1-241
RL1-242
RL1-243
RL1-244
RL1-245
RL1-246
RL1-247
RL1-248
RL1-249
RL1-250
RL1-251
RL1-252
RL1-253
RL1-254
RL1-255
RL1-256
RL1-257
RL1-258
RL1-259
RL1-260
RL1-261
RL1-262
0.863029
0.752561
0.676998
0.840474
0.49346
0.910095
0.960868
0.933743
0.760953
0.748072
0.663473
0.964783
0.905525
0.691349
0.951645
0.989216
0.738781
0.795527
0.795563
0.703822
0.57285
0.767369
0.645825
0.802402
0.944397
0.622399
0.630848
0.818129
0.484804
0.676497
0.880815
0.959511
0.784072
0.52286
0.623466
0.874709
0.803013
0.545178
0.136971
0.247439
0.323002
0.159526
0.50654
0.089905
0.039132
0.066257
0.239047
0.251928
0.336527
0.035217
0.094475
0.308651
0.048355
0.010784
0.261219
0.204473
0.204437
0.296178
0.42715
0.232631
0.354175
0.197598
0.055603
0.377601
0.369152
0.181871
0.515196
0.323503
0.119185
0.040489
0.215928
0.47714
0.376534
0.125291
0.196987
0.454822
11
9.64911E-006
6.74076E-006
1.004539E-005
1.71067E-006
0.000001589
7.41687E-006
0.0002261121
1.91042E-006
4.80473E-006
7.04549E-006
8.21959E-006
2.82873E-006
0.000007663
5.04876E-006
1.59639E-006
8.73616E-006
0.0003243203
8.89964E-006
2.02264E-006
3.36477E-006
9.19778E-006
7.29139E-006
2.43094E-006
1.73925E-006
9.46653E-006
8.82309E-006
0.0002217582
1.91494E-006
1.6334E-006
1.59652E-006
3.35782E-006
2.75846E-006
3.03069E-006
6.06819E-006
6.91131E-006
7.25071E-006
0.000331411
4.47452E-006
936
942
952
953
954
961
1187
1188
1192
1199
1207
1209
1216
1221
1222
1230
1554
1562
1564
1567
1576
1583
1585
1586
1595
1603
1824
1825
1826
1827
1830
1832
1835
1841
1847
1854
2185
2189
RL1-263
RL1-264
RL1-265
RL1-266
RL1-267
RL1-268
RL1-269
RL1-270
RL1-271
RL1-272
RL1-273
RL1-274
RL1-275
RL1-276
RL1-277
RL1-278
RL1-279
RL1-280
RL1-281
RL1-282
RL1-283
RL1-284
RL1-285
RL1-286
RL1-287
RL1-288
RL1-289
RL1-290
RL1-291
RL1-292
RL1-293
RL1-294
RL1-295
RL1-296
RL1-297
RL1-298
RL1-299
RL1-300
0.815965
0.818366
0.724692
0.68352
0.989999
0.985774
0.642113
0.464929
0.734131
0.632632
0.553081
0.764977
0.464551
0.851137
0.739427
0.555538
0.551021
0.593129
0.79749
0.848332
0.812696
0.715573
0.578981
0.786632
0.64689
0.600677
0.552264
0.836774
0.910408
0.705616
0.833055
0.55822
0.684736
0.973315
0.676965
0.698511
0.514109
0.895842
0.184035
0.181634
0.275308
0.31648
0.010001
0.014226
0.357887
0.535071
0.265869
0.367368
0.446919
0.235023
0.535449
0.148863
0.260573
0.444462
0.448979
0.406871
0.20251
0.151668
0.187304
0.284426
0.421019
0.213368
0.35311
0.399323
0.447736
0.163226
0.089592
0.294384
0.166945
0.44178
0.315264
0.026685
0.323035
0.301489
0.485891
0.104158
12
4.89193E-006
1.90565E-006
1.45521E-006
1.001287E-005
3.48414E-006
9.2895E-006
4.82072E-006
0.000002507
0.0003870134
8.94209E-006
0.000004175
5.60863E-006
9.08894E-006
1.002911E-005
9.30477E-006
6.07683E-006
1.71434E-006
6.07637E-006
0.0002724436
7.00277E-006
5.19928E-006
6.3489E-006
3.26024E-006
0.000008282
8.91268E-006
2.59076E-006
7.46941E-006
7.75812E-006
6.30604E-005
9.07012E-006
7.11207E-006
9.56152E-006
2.78967E-006
9.43315E-006
5.50475E-006
4.26716E-006
9.42184E-006
9.44663E-006
2193
2194
2195
2205
2208
2217
2221
2223
2610
2618
2622
2627
2636
2646
2655
2661
2662
2668
2940
2947
2952
2958
2961
2969
2977
2979
2986
2993
3056
3065
3072
3081
3083
3092
3097
3101
3110
3119
References
[EBT+ 08]
Todd L Edwards, William S Bush, Stephen D Turner, Scott M
Dudek, Eric S Torstenson, Mike Schmidt, Eden Martin, and
Marylyn D Ritchie. Generating Linkage Disequilibrium Patterns in Data Simulations using genomeSIMLA. Lecture Notes
in Computer Science, 4973:24–35, 2008.
[MVV11]
Jestinah M Mahachie John, Francois Van Lishout, and Kristel
Van Steen. Model-Based Multifactor Dimensionality Reduction
to detect epistasis for quantitative traits in the presence of errorfree and noisy data. Eur J Hum Genet, 19(6):696–703, June
2011.
[WDR+ 10]
Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco,
and Kathryn Roeder. Screen and clean: a tool for identifying
interactions in genome-wide association studies. Genetic epidemiology, 34:275–285, 2010.
[WYY+ 10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,
Nelson L S Tang, and Weichuan Yu. BOOST: A fast approach
to detecting gene-gene interactions in genome-wide case-control
studies. American journal of human genetics, 87:325–340, 2010.
[WYY+ 10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S
Tang, and Weichuan Yu. Predictive rule inference for epistatic
interaction detection in genome-wide association studies. Bioinformatics (Oxford, England), 26:30–37, 2010.
[YHW+ 09]
Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,
and Weichuan Yu. SNPHarvester: a filtering-based approach
for detecting epistatic interactions in genome-wide association
studies. Bioinformatics (Oxford, England), 25:504–511, 2009.
[Zha12]
Yu Zhang. A novel bayesian graphical model for genome-wide
multi-SNP association mapping. Genetic Epidemiology, 36:36–
47, 2012.
[ZHZW10]
Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang.
TEAM: efficient two-locus epistasis tests in human genome-wide
association study. Bioinformatics (Oxford, England), 26:i217–
i227, 2010.
13
Laboratory Note
Genetic Epistasis
II - Assessing Algorithm BEAM 3.0
LN-2-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm BEAM 3.0 is presented and tested
for main effect detection. This is a bayesian algorithm that creates
a graph with SNPs and the relations between them and the disease
expression. The results obtained reveal a high detection for data sets
with higher allele frequencies. This is also true for the population size,
however this increases Type I Error Rates, therefore Power values are
nearly equal to the error rates. The algorithm seems very scalable
with the data sets used, and may be scalable to large genome wide
association studies.
1
Introduction
The Bayesian Epistasis association Mapping (BEAM) [ZL07] is a stochastic algorithm that uses a Markov chain Monte Carlo (MCMC) [ADH10] to
create posterior probabilities that each marker is associated with the disease
phenotype.
Instead of the standard epistatic detection using χ2 statistic, BEAM uses
a new B statistic. The B statistic is defined by:
BM = ln
PA (DM , UM )
Pjoin (DM )[Pind (UM ) + Pjoin (UM )]
= ln
P0 (DM , UM )
Pind (DM , UM ) + Pjoin (DM , UM )
(1)
where M represents each set of k markers, representing different complexities
of interactions. DM and UM are genotype data from M cases and controls
and P0 (DM , UM ) and PA (DM , UM ) are the Bayes factors. Pind is the distribution that assumes independence among markers in M and Pjoin is a saturated
joint distribution of genotype combinations among all markers in M .
BEAM3 introduces multi-SNP associations and high-order interactions flexibility, using graphs, reducing the complexity and increasing the Power.
BEAM3 [Zha12] produces cleaner results with improved mapping sensitivity
and specificity.
Initially, the disease graph is built based on the probability that a given genotype configuration is related to the phenotype, considering the frequencies
of that genotype in controls and cases. Cliques (non overlapping groups of
SNPs) are then generated based on the disease related SNPs. A joint probability model and MCMC are used to update the disease graph and create
undirected edges between dependent SNPs.
1.1
Input files
The input file contains the phenotypes of all the individuals in the first row
and the genotypes of each SNP on the subsequent rows.
1.2
Output files
The algorithm outputs 3 files: posterior file; g.dot file; and chi.txt. The
posterior file contains the posterior probabilities of marginal and interaction
1
ID
rs1
rs2
rs3
Chr Pos 0 1 0 0 1
chr1 1
1 0 2 0 1
chr1 2
1 2 1 1 0
chr1 3
1 2 2 0 1
Table 1: An example of the input file containing the index of the SNPs,
the chromosome that they belong to, the position of the SNP, the phenotype, corresponding to the first row and subsequent rows correspond to the
genotype of each SNP for all individuals.
associations per SNP. The g.dot file contains the disease graph. The file requires a graph visualization software, such as GraphViz. The chi.txt contains
the chi square results, together with allele counts.
1.3
Parameters
There are some options available to the user:
• ”-filter k”: Tells the program to filter SNPs with too many missing
genotypes.
• ”-sample burnin mcmc”: Specifies the number of sampling interactions
by the MCMC.The default value is 100.
• ”-prior p”: specifies how likely each SNP is associated with the disease.
By default, p=5/L, where L is the number of SNPs.
• ”-T t”: Specifies the temperature which the MCMC starts running.
With a high temperature, the program can jump out of local modes
with few iterations. However, it can make the program very slow in
the first iterations.
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. The
computer used for this experiments used the 64-bit Ubuntu 13.10 operating
system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor
and 8,00 GB of RAM memory.
The parameters used in this experiment are the default parameters, with the
exception of ”-prior p”, which is p=2/L.
2
500 individuals
1000 individuals
2000 individuals
Power(%)
100
92
50
0
100 100 100
100 100 100
0.3
0.5
32
0
0
0.01
0
0
0
0.05
1
0
0.1
Allele frequency
Figure 1: Power by allele frequency. For each frequency, three sizes of data
sets were used to measure the Power, with odds ratio of 2.0 and prevalence
of 0.02. The Power is measured by the amount of data sets where the ground
truth was amongst the most relevant results, out of all 100 data sets.
3
Results
The results of epistasis detection of the algorithm consist of posterior probabilities. This is not comparable with χ2 tests, therefore only main effect
detections will be considered for this experiment.
Figure 1 shows near 0% Power for allele frequencies lower than 0.1, but increases greatly reaching 100% Power for frequencies of 0.3 and 0.5. There
is also a clear growth with population size, especially in data sets with 0.1
minor allele frequency.
The running time (a) of these experiments show a steady increase, with
a difference of nearly 3 seconds between data sets with 500 individuals and
data sets with 2000 individuals. The increase in running time is not very
significant, which may translate to larger data sets. This is also true for
memory usage (c), with only 1.5 MB increase from 500 to 2000 individuals
in a data set. The CPU usage (b increased has an increase of nearly 10%
from 500 individuals to 1000, lowering slightly for 2000 individuals.
The error rate results in Figure 3 contain high values of false positives.
The percentage of Type I Error Rate is bigger than the Power for smaller
allele frequencies. In frequencies higher than 0.1 the Type I Error Rate is
lower than the Power but the difference of both percentages decrease as the
number of individuals increases. This means that for a bigger sized data
sets, it is more likely to find the ground truth but it is also more likely to be
accompanied by false positives.
3
CPU Usage(%)
Running Time(seconds)
8
6
4
2
0
500
1000
2000
Number of Individuals
94
92
90
88
500
1000
2000
Number of Individuals
(a) Average running time
Memory Usage(Mbytes)
96
(b) Average CPU usage
5.5
5
4.5
4
500
1000
2000
Number of Individuals
(c) Average memory usage
Figure 2: Comparison of scalability measures between different sized data
sets. This figures shows the average running time, CPU usage, and memory
usage by each data set. The data sets have a minor allele frequency is 0.5,
2.0 odds ratio, 0.02 prevalence.
The distribution of Power by odds ratio reveals a big increase in Power
with the increase of odds ratio in Figure 5. This is similar to the Power by
population size in Figure 4. Data sets with low allele frequencies have a near
0% Power. With 0.1 minor allele frequency, there is a significant increase,
having 92% of Power, and reaching 100% for higher allele frequencies in
Figure 7. There is no clear difference in Power with prevalence changes on
Figure 6.
4
Type I Error Rate(%)
99 100
500 individuals
1000 individuals
2000 individuals
100
67
99 100 100
71
50
0
0
6
1
0.01
3
3
0.05
17
9
18
0.1
Allele frequency
0.3
0.5
Figure 3: Type I Error Rate by allele frequency and population size, with
odds ratio of 2.0 and prevalence of 0.02. The Type I Error Rate is measured
by the amount of data sets where the false positives were amongst the most
relevant results, out of all 100 data sets.
4
Summary
BEAM3 is the third iteration of a bayesian algorithm that uses posterior
probabilities to detect epistasis. BEAM3 generates a disease graph representing multi-SNP associations that have a high probability of being related
to the disease phenotype expression. This graph is updated using MCMC.
This version of BEAM also outputs χ2 values of single SNPs, which are comparable with other algorithms. Due to this the results consist of main effect
detection only. The Power obtained reveals similar values for Power and Type
I Error Rate, increasing with allele frequency and population size, but type
1 errors are lower in relation to Power in data sets with high allele frequency
and low population size. The scalability of the algorithm is promising.
References
[ADH10] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 72:269–342,
2010.
[Zha12]
Yu Zhang. A novel bayesian graphical model for genome-wide
multi-SNP association mapping. Genetic Epidemiology, 36:36–47,
2012.
5
[ZL07]
A
Yu Zhang and Jun S Liu. Bayesian inference of epistatic interactions in case-control studies. Nature genetics, 39:1167–1173, 2007.
Bar graphs
Power by Population
92
Power(%)
100
50
0
32
0
500
1000
2000
Population
Figure 4: Distribution of the Power by population. The allele frequency is
0.1, the odds ratio is 2.0, and the prevalence is 0.02.
Power by Odds Ratio
92
Power(%)
100
50
24
0
2
1.1
1.5
Odds Ratio
2.0
Figure 5: Distribution of the Power by odds ratios. The allele frequency is
0.1, the number of individuals is 2000, and the prevalence is 0.02.
6
Power by Prevalence
99
Power(%)
100
92
50
0
0.0001
0.02
Prevalence
Figure 6: Distribution of the Power by prevalence. The allele frequency is
0.1, the number of individuals is 2000, and the odds ratio is 2.0.
Power by Frequency
100
100
1
0.05 0.1
0.3
Frequency
0.5
92
Power(%)
100
50
0
0
0.01
Figure 7: Distribution of the Power by allele frequency. The number of
individuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
B
Table of Results
Table 2: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
7
TP (%)
100
100
FP (%)
99
95
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
0.5,1000,ME,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
8
100
100
80
79
100
100
100
100
100
100
100
100
100
100
100
100
100
100
88
89
21
23
100
100
100
100
100
100
100
100
100
100
90
81
0
12
0
0
53
57
20
22
100
100
100
100
100
98
100
100
100
97
57
60
71
79
24
30
11
6
100
100
99
100
54
50
99
100
68
63
25
25
9
17
5
6
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.05,500,ME,2.0,0.02
0.05,500,ME,2.0,0.0001
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
9
0
0
92
99
24
44
2
1
32
59
1
6
0
0
0
0
0
0
0
0
1
7
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
6
5
67
76
16
29
8
7
18
38
6
10
7
5
3
6
4
4
5
1
17
25
3
13
5
6
3
18
2
5
7
3
0
6
0
6
0
6
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0
0
0
0
0
0
0
0
0
0
0
0
1
3
3
2
3
2
6
3
7
3
3
4
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usage
in each configuration.
Configuration*
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
Running Time (s)
04.90
03.30
02.16
02.15
01.82
01.73
08.02
05.17
02.78
02.59
02.34
02.30
06.96
03.79
02.38
02.25
02.10
10
CPU Usage (%)
87.81
87.16
86.74
82.36
80.97
83.54
95.53
94.16
92.74
93.39
93.38
93.32
96.31
95.00
93.54
93.99
93.08
Memory Usage (KB)
4152.80
3446.24
2723.76
2757.20
2566.12
2556.08
5986.72
4108.72
3512.88
3508.48
3493.44
3492.60
4437.08
3240.00
2771.80
2729.16
2686.12
0.5,1000,ME,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.05,500,ME,2.0,0.02
02.02
02.60
02.32
01.93
01.83
01.17
01.09
02.77
02.95
02.38
02.32
02.30
02.28
02.42
02.45
02.04
02.04
01.82
01.76
0.80
0.95
0.61
0.64
0.57
0.58
02.24
02.26
01.37
01.45
01.02
0.99
01.38
01.50
0.78
0.83
0.69
0.68
0.59
11
93.41
94.60
93.51
93.41
92.49
89.70
88.25
94.79
95.25
94.49
94.27
94.73
94.56
94.03
94.19
93.96
93.88
93.43
92.86
85.95
88.33
82.27
84.21
82.88
81.66
93.47
94.12
90.66
91.55
90.22
90.46
89.81
91.44
88.49
88.49
83.77
89.10
81.11
2665.64
2970.00
2917.44
2615.88
2607.24
2483.28
2476.68
3534.72
3563.44
3493.60
3492.92
3491.44
3490.44
2886.64
2831.72
2675.80
2671.00
2665.28
2662.68
2471.00
2520.12
2383.04
2432.96
2367.56
2408.72
3493.84
3492.40
3489.68
3484.24
3482.16
3483.44
2681.04
2696.48
2655.24
2653.08
2652.16
2648.20
2380.88
0.05,500,ME,2.0,0.0001
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.93
0.57
0.60
0.59
0.57
01.18
01.19
0.98
0.98
0.94
0.94
0.70
0.81
0.67
0.70
0.66
0.69
0.55
0.59
0.54
0.58
0.55
0.59
0.91
0.93
0.91
0.92
0.91
0.93
0.66
0.67
06.55
0.67
0.66
0.66
12
84.09
81.72
81.99
79.48
81.46
89.59
89.80
89.07
89.80
89.82
90.33
85.95
86.89
81.01
82.83
84.68
80.38
77.93
79.62
81.51
79.48
78.36
79.67
85.28
91.10
91.18
91.62
91.07
91.07
86.84
89.19
88.46
80.52
84.18
81.46
2439.40
2361.20
2390.04
2361.20
2381.48
3485.56
3484.76
3480.08
3480.16
3479.28
3481.12
2651.56
2653.84
2647.16
2648.96
2648.20
2647.76
2340.40
2391.20
2345.64
2387.76
2349.92
2393.76
3476.88
3479.40
3478.80
3480.64
3477.44
3478.96
2645.76
2649.60
6100.36
2646.28
2644.68
2645.48
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
13
Laboratory Note
Genetic Epistasis
III - Assessing Algorithm BOOST
LN-3-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm BOOST is discussed. Its main
features are transforming the data representation of genotypes into a
Boolean type and making logic operations and pruning statistically
irrelevant epistatic interactions. The results show a higher Power in
main effect than epistasis detection, but has a much higher Type I
Error Rate than epistasis detection. This is also true for full effect
detection. The scalability of the algorithm is very good, revealing
only a slight increase in the use of resources and running time with
the increase of population size.
1
Introduction
BOOST (BOolean Operation-based Screening and Testing) [WYY+ 10] transforms the data representation into a Boolean type, making logic operations
more efficient and prunes insignificant epistatic interactions using an upper
bound based on the likelihood ratio test statistic. BOOST works in two
stages:
• Stage 1: Screening All pairwise interactions are evaluated using the
contingency tables collected by Boolean operations, removing interactions that fail to meet a predefined threshold. The evaluation of the
interactions at this stage is represented by Kullback-Leibler divergence
D = N · DKL (π̂||ρ̂) where π̂ represent the joint distribution by the full
logistic regression model MS = β0 + βix1 + βjx2 + βijx1 x2 , and ρ̂ is the approximate joint distribution under the main logistic regression model
MH = β0 + βix1 + βjx2 using the method ”Kirkwood superposition approximation”.
• Stage 2: Testing Two statistic tests are used: likelihood ratio test,
fitting the log-linear models MH and MS , and χ2 with four degrees of
freedom. The p value is adjusted by a Bonferroni correction.
1.1
Input files
The input data files contain a file with the SNP and phenotype information,
and a file containing the names of all data set files. In the data files, the
first column corresponds to the phenotype taking its value in 0,1. From the
second to the last column corresponds to the SNP, taking values in 0,1,2.
1.2
Output files
The output consists of two files, the interaction results, consisting of all SNP
pairs with a χ2 result above 30 and contains the following columns:
• Index : number of interaction. Begins with 0
• SNP1 : first SNP in the interaction. Numeration begins with 0.
• SNP2 : second SNP in the interaction. Numeration begins with 0.
• SinglelocusAssoc1 : value of the marginal effect for the first associated SNP.
1
• SinglelocusAssoc2 : value of the marginal effect for the second associated SNP.
• InteractionBOOST : The statistical value of BOOST from the χ2
test.
• InteractionPLINK : value obtained by using the statistic of PLINK.
The second file contains the marginal effect value for every SNP. The file
contains two columns:
• SNPindex : number of the SNP.
• Single-locusTestValue : value of the χ2 test.
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. The
computer used for this experiments used the 64-bit Ubuntu 13.10 operating
system, with an Intel(R) Core(TM)2 Quad6 CPU Q6600 2.40GHz processor
and 8,00 GB of RAM memory.
BOOST is a C program. There are no settings for BOOST.
3
Results
The Power displayed in epistasis (b) is inferior to the Power detected by main
effects (a) and the best results were obtained by full effect detection (c) in
almost all configurations. In epistasis detection, Figure 1 shows an increase
of Power with population size in nearly all allele frequencies. The increase of
allele frequency also increases the Power.
In Figure 2 shows a varying cpu usage (b) with a very slight increase
in running time (a), and memory usage (c)This increase is not significant,
which reveals a good scalability.
The Type I Error Rate shows a maximum value of 21% for epistasis detection, but is 100% in the data sets with the highest population size and
allele frequency, for main effect and full effect. Most of the type 1 errors
in epistasis are below 10% and therefore there is a bigger difference between
Power and type 1 errors in epistasis detection. For main effect and full effect,
there is an increase in Type 1 Error Rate with data set size and minor allele
2
500 individuals
1000 individuals
2000 individuals
Power(%)
100
91
100
66
41
50
0
100
94
0
0
0
0
0.01
7
0
0.05
26
14
1
0.1
Allele frequency
0.3
0.5
100 100 100
100 100 100
0.3
0.5
100 100 100
100 100 100
0.3
0.5
(a) Epistasis
Power(%)
97
500 individuals
1000 individuals
2000 individuals
100
43
50
0
0
0
0
0
0.01
14
1
0.05
2
0.1
Allele frequency
(b) Main Effect
Power(%)
98
500 individuals
1000 individuals
2000 individuals
100
42
50
0
0
0
0.01
0
0
15
2
0.05
1
0.1
Allele frequency
(c) Full Effect
Figure 1: Power by allele frequency. For each frequency, three sizes of data
sets were used to measure the Power, with odds ratio of 2.0 and prevalence
of 0.02. The Power is measured by the amount of data sets where the ground
truth was amongst the most relevant results, out of all 100 data sets. (b),
(a), and (c) represent main effect, epistatic and main effect + epistatic interactions.
3
CPU Usage(%)
Running Time(seconds)
0.3
0.2
0.1
99
98
97
96
0
500
1000
2000
Number of Individuals
500
1000
2000
Number of Individuals
Memory Usage(Kbytes)
(a) Average running time.
(b) Average CPU usage.
1,200
1,100
1,000
500
1000
2000
Number of Individuals
(c) Average memory usage.
Figure 2: Comparison of scalability measures between different sized data
sets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence and use the full effect disease model.
frequency.
The Power distribution by population 4 and by odds ratio 5 show a big
increase with higher population sizes and odds ratios. The prevalence result reveal very similar values for both prevalences , and the distribution by
allele frequency 7 increases sligthly for 0.05 minor allele frequency and increases greatly in 0.1 minor allele frequency, reaching 100% for higher allele
frequencies.
4
Type I Error Rate(%)
500 individuals
1000 individuals
2000 individuals
100
50
0
4
7
2
7
0.01
4
7
2
0.05
21
6
5
0.1
Allele frequency
2
6
0.3
4
8
8
0.5
Type I Error Rate(%)
(a) Epistasis
99 100
500 individuals
1000 individuals
2000 individuals
100
74
97 100 100
78
50
0
1
7
1
1
0.01
11
3
0.05
12
23
0.1
Allele frequency
0.3
0.5
100 100 100
100 100 100
0.3
0.5
Type I Error Rate(%)
(b) Main Effect
100
500 individuals
1000 individuals
2000 individuals
81
50
38
10 11 7
0
0.01
4
16 17
0.05
15
0.1
Allele frequency
(c) Full Effect
Figure 3: Type I Error Rate by allele frequency. For each frequency, three
sizes of data sets were used to measure the Power, with odds ratio of 2.0 and
prevalence of 0.02. The Type I Error Rate is measured by the amount of
data sets where the false positives were amongst the most relevant results,
out of all 100 data sets.. (a), (b), and (c) represent epistatic, main effect,
and main effect + epistatic interactions.
5
4
Summary
BOOST is an exhaustive algorithm that converts the data into a binary
format and prunes irrelevant interactions by the contingency tables collected
by Boolean operations. The results show a very good scalability, with a slight
but irrelevant increase in running time, memory usage and cpu usage. The
relation between Power and Type I Error Rates has a bigger difference in
epistasis detection, but the overall Power is lower than main effect and full
effect.
References
[WYY+ 10] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,
Nelson L S Tang, and Weichuan Yu. BOOST: A fast approach
to detecting gene-gene interactions in genome-wide case-control
studies. American journal of human genetics, 87:325–340, 2010.
A
Bar Graphs
6
Power by Population
94
Power(%)
100
41
50
0
1
500
1000
2000
Population
(a) Epistasis
Power by Population
97
Power(%)
100
43
50
0
2
500
1000
2000
Population
(b) Main Effect
Power by Population
98
Power(%)
100
42
50
0
1
500
1000
2000
Population
(c) Full Effect
Figure 4: Distribution of the Power by population for all disease models.
The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.
7
Power by Odds Ratio
95
94
1.5
Odds Ratio
2.0
Power(%)
100
50
0
27
1.1
(a) Epistasis
Power by Odds Ratio
97
Power(%)
100
50
0
33
2
1.1
1.5
Odds Ratio
2.0
(b) Main Effect
Power by Odds Ratio
98
Power(%)
100
72
50
0
4
1.1
1.5
Odds Ratio
2.0
(c) Full Effect
Figure 5: Distribution of the Power by odds ratios for all disease models.
The allele frequency is 0.1, the population size is 2000 individuals, and the
prevalence is 0.02.
8
Power by Prevalence
Power(%)
100
94
91
50
0
0.0001
0.02
Prevalence
(a) Epistasis
Power by Prevalence
Power(%)
100
98
97
50
0
0.0001
0.02
Prevalence
(b) Main Effect
Power by Prevalence
Power(%)
100
100
98
50
0
0.0001
0.02
Prevalence
(c) Full Effect
Figure 6: Distribution of the Power by prevalence for all disease models. The
allele frequency is 0.1, the odds ratio is 2.0, and the population size is 2000
individuals.
9
Power by Frequency
100
100
0.05 0.1
0.3
Frequency
0.5
94
Power(%)
100
50
0
0
0.01
7
(a) Epistasis
Power by Frequency
100
100
0.05 0.1
0.3
Frequency
0.5
97
Power(%)
100
50
0
0
0.01
14
(b) Main Effect
Power by Frequency
100
100
0.05 0.1
0.3
Frequency
0.5
98
Power(%)
100
50
15
0
0
0.01
(c) Full Effect
Figure 7: Distribution of the Power by allele frequency for all disease models. The population size is 2000 individuals, the odds ratio is 2.0, and the
prevalence is 0.02.
10
B
Table of Results
Table 1: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,I,2.0,0.02
11
TP (%)
26
20
19
1
0
0
100
100
100
67
15
7
91
90
79
13
1
0
14
46
15
12
0
0
100
100
100
100
7
8
66
FP (%)
4
3
2
4
1
3
8
17
11
2
6
7
8
7
8
4
5
6
6
6
3
2
4
1
6
10
30
10
4
6
2
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
12
100
81
71
0
0
1
0
1
1
0
0
94
91
95
77
27
24
41
13
36
10
0
0
0
1
0
0
0
0
7
70
65
47
0
0
0
8
11
8
10
6
2
3
7
1
3
5
7
4
21
7
9
6
5
2
5
7
4
3
6
5
7
2
8
6
11
3
2
35
49
28
7
0
4
8
7
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
13
1
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
100
100
100
100
81
78
100
100
100
100
100
100
100
100
100
100
100
5
4
6
4
2
5
4
5
7
2
6
4
1
2
6
7
4
2
4
5
8
97
95
63
61
30
22
100
100
100
100
100
97
100
100
100
97
61
0.5,1000,ME,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.05,500,ME,2.0,0.02
14
100
100
100
90
86
25
18
100
100
100
100
100
100
100
100
100
100
92
77
2
8
0
0
0
0
97
98
33
34
2
1
43
50
5
1
0
0
0
59
78
82
29
33
17
10
100
100
99
100
57
51
99
100
69
66
28
27
12
14
7
8
7
4
74
71
22
23
11
5
23
32
9
9
10
5
1
0.05,500,ME,2.0,0.0001
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.5,500,ME+I,2.0,0.02
0.5,500,ME+I,2.0,0.0001
0.5,500,ME+I,1.5,0.02
15
0
0
0
0
0
14
13
2
1
0
2
1
4
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
100
100
100
4
1
2
3
2
11
10
2
6
3
3
3
6
0
3
3
2
1
7
1
7
1
7
1
4
5
5
5
5
7
3
8
3
4
2
100
100
100
0.5,500,ME+I,1.5,0.0001
0.5,500,ME+I,1.1,0.02
0.5,500,ME+I,1.1,0.0001
0.5,2000,ME+I,2.0,0.02
0.5,2000,ME+I,2.0,0.0001
0.5,2000,ME+I,1.5,0.02
0.5,2000,ME+I,1.5,0.0001
0.5,2000,ME+I,1.1,0.02
0.5,2000,ME+I,1.1,0.0001
0.5,1000,ME+I,2.0,0.02
0.5,1000,ME+I,2.0,0.0001
0.5,1000,ME+I,1.5,0.02
0.5,1000,ME+I,1.5,0.0001
0.5,1000,ME+I,1.1,0.02
0.5,1000,ME+I,1.1,0.0001
0.3,500,ME+I,2.0,0.02
0.3,500,ME+I,2.0,0.0001
0.3,500,ME+I,1.5,0.02
0.3,500,ME+I,1.5,0.0001
0.3,500,ME+I,1.1,0.02
0.3,500,ME+I,1.1,0.0001
0.3,2000,ME+I,2.0,0.02
0.3,2000,ME+I,2.0,0.0001
0.3,2000,ME+I,1.5,0.02
0.3,2000,ME+I,1.5,0.0001
0.3,2000,ME+I,1.1,0.02
0.3,2000,ME+I,1.1,0.0001
0.3,1000,ME+I,2.0,0.02
0.3,1000,ME+I,2.0,0.0001
0.3,1000,ME+I,1.5,0.02
0.3,1000,ME+I,1.5,0.0001
0.3,1000,ME+I,1.1,0.02
0.3,1000,ME+I,1.1,0.0001
0.1,500,ME+I,2.0,0.02
0.1,500,ME+I,2.0,0.0001
0.1,500,ME+I,1.5,0.02
0.1,500,ME+I,1.5,0.0001
0.1,500,ME+I,1.1,0.02
16
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
79
89
100
100
100
100
100
100
100
100
100
100
100
100
1
30
0
1
0
100
89
86
100
100
100
100
100
100
100
100
100
100
100
100
100
100
79
93
28
25
100
100
100
100
97
99
100
100
99
100
63
66
15
34
10
10
5
0.1,500,ME+I,1.1,0.0001
0.1,2000,ME+I,2.0,0.02
0.1,2000,ME+I,2.0,0.0001
0.1,2000,ME+I,1.5,0.02
0.1,2000,ME+I,1.5,0.0001
0.1,2000,ME+I,1.1,0.02
0.1,2000,ME+I,1.1,0.0001
0.1,1000,ME+I,2.0,0.02
0.1,1000,ME+I,2.0,0.0001
0.1,1000,ME+I,1.5,0.02
0.1,1000,ME+I,1.5,0.0001
0.1,1000,ME+I,1.1,0.02
0.1,1000,ME+I,1.1,0.0001
0.05,500,ME+I,2.0,0.02
0.05,500,ME+I,2.0,0.0001
0.05,500,ME+I,1.5,0.02
0.05,500,ME+I,1.5,0.0001
0.05,500,ME+I,1.1,0.02
0.05,500,ME+I,1.1,0.0001
0.05,2000,ME+I,2.0,0.02
0.05,2000,ME+I,2.0,0.0001
0.05,2000,ME+I,1.5,0.02
0.05,2000,ME+I,1.5,0.0001
0.05,2000,ME+I,1.1,0.02
0.05,2000,ME+I,1.1,0.0001
0.05,1000,ME+I,2.0,0.02
0.05,1000,ME+I,2.0,0.0001
0.05,1000,ME+I,1.5,0.02
0.05,1000,ME+I,1.5,0.0001
0.05,1000,ME+I,1.1,0.02
0.05,1000,ME+I,1.1,0.0001
0.01,500,ME+I,2.0,0.02
0.01,500,ME+I,2.0,0.0001
0.01,500,ME+I,1.5,0.02
0.01,500,ME+I,1.5,0.0001
0.01,500,ME+I,1.1,0.02
0.01,500,ME+I,1.1,0.0001
0.01,2000,ME+I,2.0,0.02
17
1
98
100
72
46
4
2
42
91
2
8
0
0
0
0
0
1
0
0
15
27
1
4
0
0
2
8
1
1
0
0
0
0
0
0
0
0
0
9
81
100
51
38
14
16
38
70
18
18
13
14
4
11
3
11
8
7
17
30
13
8
7
7
16
11
6
5
5
6
10
13
13
12
7
11
7
0.01,2000,ME+I,2.0,0.0001
0.01,2000,ME+I,1.5,0.02
0.01,2000,ME+I,1.5,0.0001
0.01,2000,ME+I,1.1,0.02
0.01,2000,ME+I,1.1,0.0001
0.01,1000,ME+I,2.0,0.02
0.01,1000,ME+I,2.0,0.0001
0.01,1000,ME+I,1.5,0.02
0.01,1000,ME+I,1.5,0.0001
0.01,1000,ME+I,1.1,0.02
0.01,1000,ME+I,1.1,0.0001
0
0
0
0
0
0
0
0
0
0
0
12
13
8
16
8
11
8
13
7
10
3
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
Table 2: A table containing the running time, cpu usage and memory usage
in each configuration.
Configuration*
0.5,500,ME+I,2.0,0.02
0.5,500,ME+I,2.0,0.0001
0.5,500,ME+I,1.5,0.02
0.5,500,ME+I,1.5,0.0001
0.5,500,ME+I,1.1,0.02
0.5,500,ME+I,1.1,0.0001
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
Running Time (s)
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
18
CPU Usage (%)
95.70
96.05
96.07
95.08
96.93
95.35
96.97
95.79
97.07
98.08
97.83
97.93
98.11
98.02
97.67
98.41
97.59
97.38
Memory Usage (KB)
1003.04
1003.56
1001.04
991.88
995.92
973.64
997.92
980.40
996.76
972.08
993.84
971.52
996.16
973.60
998.08
970.00
995.36
967.72
0.5,2000,ME+I,2.0,0.02
0.5,2000,ME+I,2.0,0.0001
0.5,2000,ME+I,1.5,0.02
0.5,2000,ME+I,1.5,0.0001
0.5,2000,ME+I,1.1,0.02
0.5,2000,ME+I,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,ME+I,2.0,0.02
0.5,1000,ME+I,2.0,0.0001
0.5,1000,ME+I,1.5,0.02
0.5,1000,ME+I,1.5,0.0001
0.5,1000,ME+I,1.1,0.02
0.5,1000,ME+I,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
0.5,1000,ME,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,ME+I,2.0,0.02
0.3,500,ME+I,2.0,0.0001
0.34
0.34
0.32
0.32
0.31
0.31
0.32
0.32
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.22
0.22
0.22
0.22
0.21
0.21
0.22
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.16
0.16
19
97.87
97.56
97.89
97.26
97.13
97.29
97.95
97.69
97.78
98.06
97.20
97.86
98.14
97.71
97.82
98.39
97.77
98.57
98.79
98.28
95.92
96.97
97.21
98.45
98.66
98.77
98.88
98.09
98.19
98.18
97.86
97.54
96.90
97.46
96.44
97.60
98.13
98.23
1226.44
1217.20
1190.48
1188.16
1171.92
1171.28
1175.04
1174.32
1169.32
1168.92
1155.96
1142.44
1146.04
1134.80
1155.12
1132.64
1149.60
1127.76
1071.92
1053.88
1059.68
1046.96
1053.52
1028.16
1056.92
1036.44
1049.04
1016.32
1043.88
1002.92
1045.96
1003.00
1048.20
1006.16
1044.28
1000.88
998.44
1003.00
0.3,500,ME+I,1.5,0.02
0.3,500,ME+I,1.5,0.0001
0.3,500,ME+I,1.1,0.02
0.3,500,ME+I,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,ME+I,2.0,0.02
0.3,2000,ME+I,2.0,0.0001
0.3,2000,ME+I,1.5,0.02
0.3,2000,ME+I,1.5,0.0001
0.3,2000,ME+I,1.1,0.02
0.3,2000,ME+I,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,ME+I,2.0,0.02
0.3,1000,ME+I,2.0,0.0001
0.3,1000,ME+I,1.5,0.02
0.3,1000,ME+I,1.5,0.0001
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.32
0.34
0.31
0.32
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.21
0.22
0.21
0.21
20
97.97
98.37
98.12
97.90
98.13
98.44
98.78
98.05
98.11
98.36
98.39
98.55
98.36
97.51
97.83
97.29
97.82
96.01
96.61
96.50
96.16
97.24
97.27
95.77
96.48
97.31
96.43
95.85
95.56
95.81
97.12
98.11
97.10
95.79
94.97
96.59
96.59
96.29
996.72
983.32
997.60
966.76
998.60
971.80
997.36
969.24
996.00
968.72
997.20
978.32
995.88
964.56
997.64
968.68
1184.36
1225.96
1171.32
1175.60
1150.92
1148.96
1170.44
1171.48
1160.12
1151.52
1150.12
1129.60
1153.76
1131.24
1154.00
1134.88
1147.48
1128.28
1058.36
1059.72
1053.20
1043.68
0.3,1000,ME+I,1.1,0.02
0.3,1000,ME+I,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,ME+I,2.0,0.02
0.1,500,ME+I,2.0,0.0001
0.1,500,ME+I,1.5,0.02
0.1,500,ME+I,1.5,0.0001
0.1,500,ME+I,1.1,0.02
0.1,500,ME+I,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,ME+I,2.0,0.02
0.1,2000,ME+I,2.0,0.0001
0.1,2000,ME+I,1.5,0.02
0.1,2000,ME+I,1.5,0.0001
0.1,2000,ME+I,1.1,0.02
0.1,2000,ME+I,1.1,0.0001
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.31
0.31
0.31
0.31
0.31
0.31
21
98.29
97.93
97.33
98.02
95.34
95.54
96.92
97.23
97.65
96.54
95.78
96.49
96.77
96.72
96.61
95.81
95.41
97.03
96.58
97.02
96.29
96.80
95.68
96.15
95.78
96.61
96.69
96.31
96.01
95.68
95.82
96.19
96.47
97.14
97.21
98.25
98.12
96.25
1047.48
1006.04
1047.56
1025.24
1048.40
1006.60
1040.96
1001.80
1046.96
1010.36
1051.16
1006.16
1045.64
999.52
995.84
971.08
995.32
966.40
995.56
973.00
995.56
968.88
999.52
968.08
996.88
968.36
996.84
970.08
996.00
970.40
996.20
969.84
1154.56
1157.40
1151.28
1131.32
1148.64
1125.76
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,ME+I,2.0,0.02
0.1,1000,ME+I,2.0,0.0001
0.1,1000,ME+I,1.5,0.02
0.1,1000,ME+I,1.5,0.0001
0.1,1000,ME+I,1.1,0.02
0.1,1000,ME+I,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,ME+I,2.0,0.02
0.05,500,ME+I,2.0,0.0001
0.05,500,ME+I,1.5,0.02
0.05,500,ME+I,1.5,0.0001
0.05,500,ME+I,1.1,0.02
0.05,500,ME+I,1.1,0.0001
0.05,500,ME,2.0,0.02
0.05,500,ME,2.0,0.0001
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
22
97.41
97.62
97.73
97.73
97.87
97.22
97.24
95.68
94.91
95.62
94.25
96.06
97.34
97.42
97.26
97.05
96.66
98.06
97.68
97.66
97.74
97.69
97.95
97.83
96.10
95.81
96.72
96.96
96.48
96.97
97.24
96.63
97.31
97.56
97.33
97.15
96.23
97.46
1153.04
1139.76
1149.56
1131.12
1148.72
1129.48
1150.88
1133.92
1151.36
1135.04
1148.12
1132.36
1043.80
1012.40
1044.76
1000.32
1042.76
1000.16
1043.00
1003.92
1043.64
1001.60
1046.68
1002.68
1046.00
1003.88
1041.84
1000.64
1045.88
1003.84
997.76
969.04
995.32
966.84
995.60
969.08
997.08
971.00
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,ME+I,2.0,0.02
0.05,2000,ME+I,2.0,0.0001
0.05,2000,ME+I,1.5,0.02
0.05,2000,ME+I,1.5,0.0001
0.05,2000,ME+I,1.1,0.02
0.05,2000,ME+I,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,ME+I,2.0,0.02
0.05,1000,ME+I,2.0,0.0001
0.05,1000,ME+I,1.5,0.02
0.05,1000,ME+I,1.5,0.0001
0.05,1000,ME+I,1.1,0.02
0.05,1000,ME+I,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
23
97.55
96.71
95.80
96.52
98.23
96.40
96.81
96.51
96.55
96.97
98.08
97.88
97.97
97.98
97.77
98.04
98.05
98.14
98.21
98.15
98.11
97.26
97.86
97.84
98.56
98.07
97.36
98.04
96.67
96.80
95.98
97.60
97.59
97.92
95.85
96.70
97.88
97.31
994.76
967.52
995.36
969.68
995.56
967.08
998.12
971.36
996.88
973.32
1149.36
1131.84
1145.80
1127.56
1145.32
1127.08
1149.92
1128.48
1146.80
1128.40
1148.16
1124.52
1126.84
1135.60
1155.56
1134.72
1145.32
1127.44
1042.08
999.68
1043.12
1000.24
1044.92
998.92
1044.24
1001.08
1041.64
998.80
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,ME+I,2.0,0.02
0.01,500,ME+I,2.0,0.0001
0.01,500,ME+I,1.5,0.02
0.01,500,ME+I,1.5,0.0001
0.01,500,ME+I,1.1,0.02
0.01,500,ME+I,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,ME+I,2.0,0.02
0.01,2000,ME+I,2.0,0.0001
0.01,2000,ME+I,1.5,0.02
0.01,2000,ME+I,1.5,0.0001
0.01,2000,ME+I,1.1,0.02
0.01,2000,ME+I,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.21
0.21
0.21
0.21
0.22
0.21
0.21
0.21
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.16
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
0.31
24
97.01
98.04
97.76
97.61
97.33
97.58
97.67
97.16
95.84
96.53
97.68
97.60
96.47
97.49
96.47
97.93
97.54
96.80
98.11
97.90
96.09
96.16
95.26
97.09
97.41
97.18
97.00
96.89
96.86
97.57
97.07
98.56
98.16
97.32
97.36
97.76
98.06
97.91
1043.12
1000.96
1039.76
1006.80
1045.32
1000.04
1043.52
1000.44
995.28
967.96
995.84
971.52
995.80
965.56
995.40
965.68
995.92
965.56
995.92
965.64
997.20
968.28
997.04
968.28
994.24
966.48
1146.36
1128.12
1140.88
1125.52
1148.24
1125.16
1145.76
1128.20
1140.92
1125.64
1140.92
1125.76
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,ME+I,2.0,0.02
0.01,1000,ME+I,2.0,0.0001
0.01,1000,ME+I,1.5,0.02
0.01,1000,ME+I,1.5,0.0001
0.01,1000,ME+I,1.1,0.02
0.01,1000,ME+I,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
0.31
0.31
0.31
0.31
0.31
0.31
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
0.21
98.10
98.33
97.82
97.97
97.91
98.36
97.16
97.67
97.29
97.29
97.45
97.45
97.59
97.19
96.83
97.27
97.37
97.50
97.94
96.81
97.98
98.02
97.05
97.81
1144.24
1128.08
1148.32
1128.08
1144.64
1125.80
1046.00
1001.36
1043.36
998.12
1046.16
1000.64
1043.72
998.16
1042.20
998.08
1046.36
1000.84
1044.48
1001.80
1043.52
1000.44
1042.88
1002.12
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
25
Laboratory Note
Genetic Epistasis
IV - Assessing Algorithm Screen and Clean
LN-4-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm Screen and Clean is presented. As
the name indicates, the algorithm screens all relevant SNPs and fits
them using regression models for main effect and interaction. The
second part consists of cleaning the previously selected interactions
using a portion of the data, cleaning possible false positives. The
results of the algorithm show that it is nearly incapable of finding
any epistatic interactions but produces somewhat decent results in
main effect and full effect detection, for data sets with high allele
frequencies. The scalability of the algorithm is bad due to the elevated
increase in running time.
1
Introduction
Screen and Clean [WDR+ 10] is a two stage algorithm that works by creating
a dictionary of disease related SNPS and disease interactions that contracts
or expands during a multi-step statistical procedure and then is revised two
control Type I Error Rate.
In the beginning a dictionary, including all SNPs with minor allele frequency
above 0.01, is created. If the number of SNPS is greater than the specified
upper limit of covariates to enter the screen process, the SNPS are selected
based on the SNPS with the lowest marginal p-values. The data is divided
for step 1 (screen) and step 2 (clean). In step 1, a screen stage is applied
to restrict the number of terms. This restriction is applied using regression
models for main effects or interactions. For main effect models, the function
used is
g(E [Y |X]) = β0 +
N
X
βj Xj
(1)
j=1
where g is an appropriate link function. The Xj is the encoded genotype
value 0, 1 or 2 and Y is the encoded phenotype, 0 or 1. According to the selected SNPs, tries to find relevant interacting SNPs that fit into the following
interaction model:
g(E [Y |X]) = β0 +
N
X
βj Xj +
j=1
X
βij Xi Xj
(2)
i<j;i,j=1,...,N
where S = {j : βj 6= 0, j ∈ 1, ..., L} ∪ {(i, j) : βij 6= 0, (i, j) ∈ 1, ..., L} are set
of terms associated with the phenotype as main effects or interactions. A
cross-validation is applied in this stage, to apply a further restriction. In the
stage 2, the resulting dictionary is cleaned with p-values < α. This is done
using the traditional t-statistic obtained from least squares analysis of the
screened model.
1.1
Input files
The input consists of two files containing the phenotype of all individuals
and the genotype with all SNPS for all individuals.
1.2
Output files
There are many outputs available:
1
(a) Genotype
rs1, rs2, rs3, rs4,
0,
2,
0,
0,
0,
1,
1,
0,
0,
1,
1,
0,
1,
1,
0,
1,
0,
1,
1,
1,
0,
1,
2,
1,
(b) Phenotype
rs5
0
0
0
0
1
0
Label
0
1
0
0
1
1
Table 1: An example of the input file containing genotype and phenotype
information with 5 SNPs and 6 individuals. Genotype 0,1,2 corresponds
to homozygous dominant, heterozygous, and homozygous recessive. The
phenotype 0,1 corresponds to control and case respectively.
• snp screen - a vector of the column names of the SNPS picked by the
screen.
• snp screen2 - vector of K pairs and SNP pairs retained by the second
lasso screen.
• snp clean - vector of screened SNPs also retained by the multivariate
regression clean.
• clean - a data frame with regression output for all of the screened SNPs.
The snp2 column corresponds to the pairwise interaction or ”NA” for
main effects.
• f inal - a data frame with output from the regression of phenotype on
the final cleaned SNPs.
1.3
Parameters
The following parameters can be configured:
• L - number of SNPs to be retained with the smallest p-values.
• K pairs - Number of pairwise interactions to be retained by the lasso.
• response - The type of phenotype. Can be binomial or gaussian.
• alpha - The Bonferroni correction lower bound limit for retention of
SNPs.
2
• snp f ix - Index of SNPs that are forced into the lasso and multivariate
regression models. Optional.
• cov struct - Matrix of covariates that are forced in every model fit by
Screen & Clean. Optional.
• standardize - If true, the genotype coded as 0,1, or 2 are centered to
mean 0 and standard deviation 1. The data must be standardized to
run the Screen & Clean procedure.
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. The
computer used for this experiments used the 64-bit Ubuntu 13.10 operating
system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor
and 8,00 GB of RAM memory.
The parameters used for this experiments were : L = 200, K pairs =
100, response = ”binomial”, standardize = T RU E, alpha = 0.05.
3
Results
The results from Figure 1 show interesting results. for epistasis detection, the
Power is nearly 0 in all configurations. In main effect detection, the ground
truth is detected in data sets with an allele frequency higher than 0.05. In
full effect detection only data sets with 0.3 allele frequency or higher have
any Power. There is no clear pattern between the Power and the data set size.
The scalability test shows a clear increase in the running time with the
increase in the number of individuals of the data sets. Memory usage also
increases with data set size, but not as significantly as running time, which
may become a serious obstacle in larger data sets. CPU usage does not show
a clear increase.
Type I Error Rate in epistatic detection shows a seemingly random distribuition. Overall, the error rate is very constant. For main effect detection,
there is a bigger increase with population size and allele frequency. This is
even more clear in full effect detection, reaching a maximum of 84% in a
configuration of 2000 individuals and 0.5 allele frequency.
From Figure 4 and Figure 5 there is only an indication of Power for data
sets with 2000 individuals, except for full effect data sets, exacly the same as
with the odds ratio variance. The prevalence variation shows a small Power
3
500 individuals
1000 individuals
2000 individuals
Power(%)
100
50
0
0
0
0
0
0.01
0
0
0.05
0
0
6
0
0.1
Allele frequency
0
2
0
0.3
0
0
0.5
(a) Epistasis
Power(%)
80
70
500 individuals
1000 individuals
2000 individuals
60
40
54 58
62
39
20
20
0
54
0
0
0
0
0.01
0
0
0.05
0
0
0.1
Allele frequency
0.3
0.5
(b) Main Effect
Power(%)
100
91
500 individuals
1000 individuals
2000 individuals
73
58
50
0
40
30
0
0
0.01
0
0
0
0
0.05
0
0
49
0
0.1
Allele frequency
0.3
0.5
(c) Full Effect
Figure 1: Average Power by allele frequency. For each frequency, three sizes
of data sets were used to measure the Power, with odds ratio of 2.0 and
prevalence of 0.02. The Power is measured by the amount of data sets where
the ground truth was amongst the most relevant results, out of all 100 data
sets. (b), (a), and (c) represent main effect, epistatic and main effect +
epistatic interactions.
4
CPU Usage(%)
Running Time(seconds)
30
20
10
100
90
80
0
500
1000
2000
Number of Individuals
500
1000
2000
Number of Individuals
Memory Usage(Mbytes)
(a) Average running time.
(b) Average CPU usage.
150
140
130
500
1000
2000
Number of Individuals
(c) Average memory usage.
Figure 2: Comparison of scalability measures between different sized data
sets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence and use the full effect disease model.
5
Type I Error Rate(%)
100
500 individuals
1000 individuals
2000 individuals
50
15 17 18
15 22 20
14 16 21
19 15 16
18 22 14
0.01
0.05
0.1
Allele frequency
0.3
0.5
0
Type I Error Rate(%)
(a) Epistasis
100
500 individuals
1000 individuals
2000 individuals
50
14 14 13
17 21 22
0.01
0.05
21 23
36
23 28
48
38
30
15
0
0.1
Allele frequency
0.3
0.5
Type I Error Rate(%)
(b) Main Effect
100
500 individuals
1000 individuals
2000 individuals
84
68
50
18 14 14
15 21 21
0.01
0.05
19
28 33
35
37
45
19
0
0.1
Allele frequency
0.3
0.5
(c) Full Effect
Figure 3: Type I Error Rate by allele frequency. For each frequency, three
sizes of data sets were used to measure the Power, with odds ratio of 2.0 and
prevalence of 0.02. The Type I Error Rate is measured by the amount of
data sets where the false positives were amongst the most relevant results,
out of all 100 data sets. (a), (b), and (c) represent epistatic, main effect, and
main effect + epistatic interactions.
6
change in Figure 6, decreasing with the prevalence increase. Figure 7 reveals
that there is only relevant Power in higher allele frequencies, and in main
and full effects.
4
Summary
Screen and Clean is a heuristic algorithm that works by applying regression
models for main effect and epistatic detection, and pruning statistically irrelevant interactions. The obtained results show very low Power for epistatic
detection. In main effect detection, there is an increase in Power, especially
in higher dimension data sets. With higher allele frequencies, there is also
an increase of Power. In full effect detection, there is only Power in data
sets with allele frequency higher than 0.1. The scalability is bad, due to the
big increase in running time with different data set sizes. The type 1 errors
do not vary significantly with population size or allele frequency in epistasis
detection, contrary to main effect and full effect detection.
References
[WDR+ 10] Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco, and
Kathryn Roeder. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic epidemiology,
34:275–285, 2010.
A
Bar Graphs
7
Power by Population
Power(%)
100
50
0
0
500
6
0
1000
2000
Population
(a) Epistasis
Power by Population
Power(%)
100
50
0
39
0
500
0
1000
2000
Population
(b) Main Effect
Power by Population
Power(%)
100
50
0
0
500
0
0
1000
2000
Population
(c) Full Effect
Figure 4: Distribution of the Power by population for all disease models.
The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.
8
Power by Odds Ratio
Power(%)
100
50
0
0
1.1
0
1.5
Odds Ratio
6
2.0
(a) Epistasis
Power by Odds Ratio
Power(%)
100
50
0
39
0
1.1
0
1.5
Odds Ratio
2.0
(b) Main Effect
Power by Odds Ratio
Power(%)
100
50
0
0
1.1
0
1.5
Odds Ratio
0
2.0
(c) Full Effect
Figure 5: Distribution of the Power by odds ratios for all disease models.
The allele frequency is 0.1, the population size is 2000 individuals, and the
prevalence is 0.02.
9
Power by Prevalence
Power(%)
100
50
12
0
0
0.0001
0.02
Prevalence
(a) Epistasis
Power by Prevalence
Power(%)
100
74
62
50
0
0.0001
0.02
Prevalence
(b) Main Effect
Power by Prevalence
Power(%)
100
93
91
50
0
0.0001
0.02
Prevalence
(c) Full Effect
Figure 6: Distribution of the Power by prevalence for all disease models. The
allele frequency is 0.5, the odds ratio is 2.0, and the population size is 2000
individuals.
10
Power by Frequency
Power(%)
100
50
0
0
0.01
6
2
0
0.05 0.1
0.3
Frequency
0
0.5
(a) Epistasis
Power by Frequency
Power(%)
100
58
62
0
0.05 0.1
0.3
Frequency
0.5
50
0
39
0
0.01
(b) Main Effect
Power by Frequency
91
Power(%)
100
50
0
40
0
0.01
0
0
0.05 0.1
0.3
Frequency
0.5
(c) Full Effect
Figure 7: Distribution of the Power by allele frequency for all disease models. The population size is 2000 individuals, the odds ratio is 2.0, and the
prevalence is 0.02.
11
B
Table of Results
Table 2: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,I,2.0,0.02
12
TP (%)
0
0
0
0
0
0
0
12
5
0
1
0
0
3
2
0
0
0
0
0
0
0
0
0
2
0
0
1
0
0
0
FP (%)
18
9
11
17
18
25
14
19
25
17
20
10
22
17
11
17
15
13
19
14
14
11
18
14
16
19
20
11
19
14
15
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
13
0
1
0
0
0
0
0
0
0
0
0
6
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
14
10
11
13
12
14
11
17
13
8
16
21
19
13
15
17
16
16
20
11
12
17
10
15
5
23
21
19
17
20
14
25
19
15
14
22
15
14
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
54
35
40
21
15
10
62
74
77
83
91
82
70
72
71
77
53
12
22
16
15
13
15
19
13
11
18
13
11
19
19
11
17
11
11
17
13
11
15
23
17
19
14
27
48
35
31
17
17
21
30
24
22
18
21
0.5,1000,ME,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.05,500,ME,2.0,0.02
15
50
20
16
7
4
0
0
58
48
62
49
33
29
54
52
25
17
11
0
0
0
0
0
0
0
39
32
0
0
0
0
0
0
0
0
0
0
0
20
23
18
20
21
17
15
38
53
37
40
19
29
28
25
23
15
11
21
21
20
17
16
17
14
36
50
24
28
21
15
23
19
9
10
13
15
17
0.05,500,ME,2.0,0.0001
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.5,500,ME+I,2.0,0.02
0.5,500,ME+I,2.0,0.0001
0.5,500,ME+I,1.5,0.02
16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
49
37
54
24
13
12
16
16
22
18
10
12
16
15
21
17
16
13
15
16
14
22
14
22
12
24
13
16
20
14
19
14
14
15
13
16
18
17
37
33
33
0.5,500,ME+I,1.5,0.0001
0.5,500,ME+I,1.1,0.02
0.5,500,ME+I,1.1,0.0001
0.5,2000,ME+I,2.0,0.02
0.5,2000,ME+I,2.0,0.0001
0.5,2000,ME+I,1.5,0.02
0.5,2000,ME+I,1.5,0.0001
0.5,2000,ME+I,1.1,0.02
0.5,2000,ME+I,1.1,0.0001
0.5,1000,ME+I,2.0,0.02
0.5,1000,ME+I,2.0,0.0001
0.5,1000,ME+I,1.5,0.02
0.5,1000,ME+I,1.5,0.0001
0.5,1000,ME+I,1.1,0.02
0.5,1000,ME+I,1.1,0.0001
0.3,500,ME+I,2.0,0.02
0.3,500,ME+I,2.0,0.0001
0.3,500,ME+I,1.5,0.02
0.3,500,ME+I,1.5,0.0001
0.3,500,ME+I,1.1,0.02
0.3,500,ME+I,1.1,0.0001
0.3,2000,ME+I,2.0,0.02
0.3,2000,ME+I,2.0,0.0001
0.3,2000,ME+I,1.5,0.02
0.3,2000,ME+I,1.5,0.0001
0.3,2000,ME+I,1.1,0.02
0.3,2000,ME+I,1.1,0.0001
0.3,1000,ME+I,2.0,0.02
0.3,1000,ME+I,2.0,0.0001
0.3,1000,ME+I,1.5,0.02
0.3,1000,ME+I,1.5,0.0001
0.3,1000,ME+I,1.1,0.02
0.3,1000,ME+I,1.1,0.0001
0.1,500,ME+I,2.0,0.02
0.1,500,ME+I,2.0,0.0001
0.1,500,ME+I,1.5,0.02
0.1,500,ME+I,1.5,0.0001
0.1,500,ME+I,1.1,0.02
17
43
47
33
91
93
94
96
89
88
73
69
78
80
78
75
30
37
14
18
5
3
40
92
61
77
50
61
58
72
45
50
23
12
0
0
0
0
0
30
20
27
84
76
78
81
60
72
45
56
51
50
33
39
19
28
20
22
17
14
68
92
50
74
35
35
35
57
34
37
19
18
19
20
23
11
12
0.1,500,ME+I,1.1,0.0001
0.1,2000,ME+I,2.0,0.02
0.1,2000,ME+I,2.0,0.0001
0.1,2000,ME+I,1.5,0.02
0.1,2000,ME+I,1.5,0.0001
0.1,2000,ME+I,1.1,0.02
0.1,2000,ME+I,1.1,0.0001
0.1,1000,ME+I,2.0,0.02
0.1,1000,ME+I,2.0,0.0001
0.1,1000,ME+I,1.5,0.02
0.1,1000,ME+I,1.5,0.0001
0.1,1000,ME+I,1.1,0.02
0.1,1000,ME+I,1.1,0.0001
0.05,500,ME+I,2.0,0.02
0.05,500,ME+I,2.0,0.0001
0.05,500,ME+I,1.5,0.02
0.05,500,ME+I,1.5,0.0001
0.05,500,ME+I,1.1,0.02
0.05,500,ME+I,1.1,0.0001
0.05,2000,ME+I,2.0,0.02
0.05,2000,ME+I,2.0,0.0001
0.05,2000,ME+I,1.5,0.02
0.05,2000,ME+I,1.5,0.0001
0.05,2000,ME+I,1.1,0.02
0.05,2000,ME+I,1.1,0.0001
0.05,1000,ME+I,2.0,0.02
0.05,1000,ME+I,2.0,0.0001
0.05,1000,ME+I,1.5,0.02
0.05,1000,ME+I,1.5,0.0001
0.05,1000,ME+I,1.1,0.02
0.05,1000,ME+I,1.1,0.0001
0.01,500,ME+I,2.0,0.02
0.01,500,ME+I,2.0,0.0001
0.01,500,ME+I,1.5,0.02
0.01,500,ME+I,1.5,0.0001
0.01,500,ME+I,1.1,0.02
0.01,500,ME+I,1.1,0.0001
0.01,2000,ME+I,2.0,0.02
18
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
23
33
20
23
17
15
10
28
23
15
12
8
19
15
19
16
11
15
21
21
22
19
18
18
13
21
15
12
20
14
18
18
13
15
15
14
23
14
0.01,2000,ME+I,2.0,0.0001
0.01,2000,ME+I,1.5,0.02
0.01,2000,ME+I,1.5,0.0001
0.01,2000,ME+I,1.1,0.02
0.01,2000,ME+I,1.1,0.0001
0.01,1000,ME+I,2.0,0.02
0.01,1000,ME+I,2.0,0.0001
0.01,1000,ME+I,1.5,0.02
0.01,1000,ME+I,1.5,0.0001
0.01,1000,ME+I,1.1,0.02
0.01,1000,ME+I,1.1,0.0001
0
0
0
0
0
0
0
0
0
0
0
18
22
14
16
15
14
9
15
15
18
17
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usage
in each configuration.
Configuration*
0.5,500,ME+I,2.0,0.02
0.5,500,ME+I,2.0,0.0001
0.5,500,ME+I,1.5,0.02
0.5,500,ME+I,1.5,0.0001
0.5,500,ME+I,1.1,0.02
0.5,500,ME+I,1.1,0.0001
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
Running Time (s)
8.05
8.10
9.37
9.03
11.23
10.48
10.43
9.98
11.88
11.01
13.19
12.12
14.39
12.97
14.32
13.16
14.44
13.06
19
CPU Usage (%)
75.72
75.95
75.86
75.38
75.14
75.46
76.02
76.18
75.85
80.20
77.06
77.23
76.70
76.82
76.82
76.95
77.07
76.99
Memory Usage (KB)
132928.28
133723.36
132094.44
132148.28
133080.40
131997.88
132144.56
132479.40
133979.16
132260.32
135044.20
133516.64
133500.72
132901.40
133835.88
132729.96
133436.20
132833.76
0.5,2000,ME+I,2.0,0.02
0.5,2000,ME+I,2.0,0.0001
0.5,2000,ME+I,1.5,0.02
0.5,2000,ME+I,1.5,0.0001
0.5,2000,ME+I,1.1,0.02
0.5,2000,ME+I,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,ME+I,2.0,0.02
0.5,1000,ME+I,2.0,0.0001
0.5,1000,ME+I,1.5,0.02
0.5,1000,ME+I,1.5,0.0001
0.5,1000,ME+I,1.1,0.02
0.5,1000,ME+I,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
0.5,1000,ME,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,ME+I,2.0,0.02
0.3,500,ME+I,2.0,0.0001
34.65
31.37
67.20
51.19
106.54
76.74
86.58
65.83
115.43
84.74
141.42
101.23
177.72
121.39
175.38
121.20
175.60
121.17
18.65
17.79
27.78
23.17
36.70
30.35
31.89
27.30
38.06
31.63
43.46
36.17
52.11
42.34
52.57
41.92
52.19
41.54
10.43
8.47
20
77.25
76.78
98.96
98.96
99.00
99.00
99.00
98.98
99.00
99.00
98.99
98.99
98.97
98.99
99.00
99.00
99.00
99.00
98.99
98.96
98.93
98.90
98.96
98.96
98.98
98.98
99.00
99.00
99.00
99.00
98.93
98.70
98.75
98.96
98.94
98.92
77.27
77.17
156137.44
153924.36
156944.56
156487.44
157200.08
157071.08
157251.32
156912.40
156853.92
157327.68
156264.16
156739.32
155566.84
155160.08
155246.92
155048.64
155732.32
155220.24
140519.08
140777.40
140225.08
140721.96
139726.28
139558.00
139635.88
140585.16
140003.76
139501.32
139474.84
139679.12
138762.56
138701.96
138734.88
138545.88
138834.24
138252.56
132217.92
132292.96
0.3,500,ME+I,1.5,0.02
0.3,500,ME+I,1.5,0.0001
0.3,500,ME+I,1.1,0.02
0.3,500,ME+I,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,ME+I,2.0,0.02
0.3,2000,ME+I,2.0,0.0001
0.3,2000,ME+I,1.5,0.02
0.3,2000,ME+I,1.5,0.0001
0.3,2000,ME+I,1.1,0.02
0.3,2000,ME+I,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,ME+I,2.0,0.02
0.3,1000,ME+I,2.0,0.0001
0.3,1000,ME+I,1.5,0.02
0.3,1000,ME+I,1.5,0.0001
12.12
10.55
13.41
12.22
11.96
10.87
13.01
11.80
13.89
12.64
14.56
13.11
14.48
13.12
14.62
12.91
80.13
38.01
108.52
68.37
133.62
93.15
94.83
73.01
129.27
93.37
148.69
104.23
163.18
112.34
165.19
113.22
165.32
112.89
29.58
18.54
35.96
26.77
21
76.84
76.87
76.91
76.76
76.68
76.92
78.76
80.22
76.88
76.67
76.78
76.36
76.47
76.53
76.97
79.52
77.12
76.67
77.82
79.27
76.53
76.23
72.29
76.34
77.26
77.30
76.90
77.27
76.81
76.76
77.06
76.96
76.90
77.25
76.65
76.70
76.58
76.25
134113.20
131771.24
133854.48
132673.56
134613.12
132077.72
134016.68
133295.48
134210.72
133055.16
133087.44
133321.08
133449.28
132977.32
132934.04
132643.12
157067.76
156089.92
156891.04
157415.12
156605.84
156330.20
145398.52
157256.60
156373.32
156781.52
155918.68
156126.56
155362.60
155090.44
155644.08
155052.12
155570.60
155295.48
140007.28
140461.08
139731.16
139746.88
0.3,1000,ME+I,1.1,0.02
0.3,1000,ME+I,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,ME+I,2.0,0.02
0.1,500,ME+I,2.0,0.0001
0.1,500,ME+I,1.5,0.02
0.1,500,ME+I,1.5,0.0001
0.1,500,ME+I,1.1,0.02
0.1,500,ME+I,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,ME+I,2.0,0.02
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,2000,I,2.0,0.02
39.73
33.51
35.14
28.44
39.75
32.14
41.93
34.67
44.55
35.88
45.07
36.22
45.38
36.54
14.39
13.63
15.68
13.83
15.71
13.91
15.48
13.81
15.56
14.08
15.56
14.01
15.56
14.02
15.87
14.09
15.77
14.07
158.73
179.10
121.32
179.21
123.67
179.43
22
74.45
60.15
76.79
76.98
76.60
78.60
72.53
76.61
76.64
76.58
77.84
76.64
76.60
76.65
76.21
96.56
99.00
99.00
98.92
99.00
99.00
98.98
98.99
98.96
99.00
98.98
98.99
98.98
99.00
98.98
99.00
98.97
77.20
99.00
99.00
94.98
94.92
93.90
135312.52
138464.68
139728.88
139477.16
139325.40
139321.04
134819.68
139146.12
137672.52
138353.48
138644.36
138355.68
138536.52
138406.04
133049.56
133010.00
133365.44
132876.08
133175.88
132622.44
133278.80
132322.28
133141.08
132418.28
133151.84
133099.92
133436.84
133378.84
133192.84
132645.00
133209.40
133120.92
155763.80
155353.08
155054.96
155592.84
155266.52
155530.20
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,ME+I,2.0,0.02
0.1,1000,ME+I,2.0,0.0001
0.1,1000,ME+I,1.5,0.02
0.1,1000,ME+I,1.5,0.0001
0.1,1000,ME+I,1.1,0.02
0.1,1000,ME+I,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,ME+I,2.0,0.02
0.05,500,ME+I,2.0,0.0001
0.05,500,ME+I,1.5,0.02
0.05,500,ME+I,1.5,0.0001
0.05,500,ME+I,1.1,0.02
0.05,500,ME+I,1.1,0.0001
0.05,500,ME,2.0,0.02
0.05,500,ME,2.0,0.0001
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
122.25
178.88
122.16
177.35
121.66
48.96
39.09
50.81
40.85
51.26
41.67
50.57
39.97
50.41
40.33
50.04
40.06
49.73
40.38
50.46
39.94
49.87
40.13
16.57
14.88
16.49
14.89
16.64
15.06
16.73
15.19
16.43
13.11
14.49
13.12
14.35
13.03
14.30
23
99.00
98.99
99.00
99.00
99.00
99.00
99.00
95.48
96.90
97.28
92.17
95.22
97.55
98.21
98.03
98.85
98.99
99.00
98.19
98.66
98.98
99.00
98.99
98.78
98.78
98.75
98.83
98.82
98.96
98.89
98.79
98.87
76.92
76.75
76.65
76.96
78.87
79.83
155230.16
155394.24
155728.40
155322.72
155222.80
139186.68
138626.88
138308.36
138674.00
138804.08
138171.24
138776.32
138625.88
138871.48
138225.84
138539.48
138220.52
138651.80
138314.80
139010.80
138255.48
138711.12
138267.92
133539.84
132866.20
133801.12
132811.36
133192.68
133419.44
133147.12
132763.08
133810.48
133189.44
133125.68
133707.68
133531.60
133306.00
133015.40
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,ME+I,2.0,0.02
0.05,2000,ME+I,2.0,0.0001
0.05,2000,ME+I,1.5,0.02
0.05,2000,ME+I,1.5,0.0001
0.05,2000,ME+I,1.1,0.02
0.05,2000,ME+I,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,ME+I,2.0,0.02
0.05,1000,ME+I,2.0,0.0001
0.05,1000,ME+I,1.5,0.02
0.05,1000,ME+I,1.5,0.0001
0.05,1000,ME+I,1.1,0.02
0.05,1000,ME+I,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
12.98
14.35
12.98
190.54
131.34
180.23
123.29
178.12
124.22
178.90
123.05
178.36
123.93
180.79
123.34
119.66
122.52
178.53
121.41
178.34
122.37
50.33
40.63
50.46
40.53
50.30
40.63
50.16
40.17
49.89
40.33
49.85
39.79
49.24
40.29
50.96
41.18
50.33
24
78.90
78.62
79.54
99.00
99.00
99.00
98.98
99.00
99.00
99.00
99.00
99.00
99.00
98.53
99.00
99.00
99.00
99.00
99.00
99.00
98.98
98.97
98.91
98.92
98.89
98.92
97.77
99.00
98.59
98.97
98.91
98.98
99.00
98.88
98.85
92.36
88.89
97.64
133355.08
133549.00
133045.08
155711.80
155593.60
155621.12
155334.44
155591.40
155121.64
155657.64
155412.04
155709.28
155206.40
155466.84
155389.40
155255.48
155137.48
155502.28
155484.52
155644.44
155444.68
138882.24
138206.72
138581.16
138158.84
138835.84
138088.72
138770.92
138376.08
138975.00
138648.36
138953.24
138266.52
138535.88
138387.88
138616.68
137906.16
138686.16
0.05,1000,I,1.1,0.0001
0.01,500,ME+I,2.0,0.02
0.01,500,ME+I,2.0,0.0001
0.01,500,ME+I,1.5,0.02
0.01,500,ME+I,1.5,0.0001
0.01,500,ME+I,1.1,0.02
0.01,500,ME+I,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,ME+I,2.0,0.02
0.01,2000,ME+I,2.0,0.0001
0.01,2000,ME+I,1.5,0.02
0.01,2000,ME+I,1.5,0.0001
0.01,2000,ME+I,1.1,0.02
0.01,2000,ME+I,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,ME+I,2.0,0.02
40.04
15.85
14.01
15.65
13.97
15.61
14.17
15.55
13.99
15.52
13.96
15.65
14.10
15.71
14.17
15.64
13.96
15.63
14.00
164.42
113.09
162.51
112.62
164.66
123.62
165.13
111.69
162.59
113.86
164.27
113.76
164.08
113.61
163.14
111.07
162.67
109.01
50.30
25
98.91
98.98
98.96
98.99
98.94
98.98
98.99
98.97
98.94
98.96
98.95
98.92
98.95
98.97
98.97
98.95
98.90
98.94
98.93
77.57
77.42
77.55
77.64
77.83
99.00
77.53
79.50
78.46
76.92
76.97
76.75
77.22
77.24
78.94
79.09
77.36
75.25
98.92
138828.36
133601.64
132918.96
132987.16
133053.60
133764.08
132439.24
133469.20
132721.64
133784.12
132771.04
133548.44
132695.16
132971.68
133037.36
133494.84
132756.08
133574.76
133471.80
155881.28
155392.92
155558.80
155313.16
155642.44
155250.16
155705.44
155394.84
155294.12
155176.88
155396.96
155034.04
155443.76
154994.48
155548.28
155093.92
155497.36
150508.88
138738.04
0.01,1000,ME+I,2.0,0.0001
0.01,1000,ME+I,1.5,0.02
0.01,1000,ME+I,1.5,0.0001
0.01,1000,ME+I,1.1,0.02
0.01,1000,ME+I,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
40.31
50.50
40.31
50.41
40.45
50.41
40.44
50.07
40.31
49.94
40.09
49.99
40.10
49.52
40.10
50.00
40.43
99.00
99.00
98.97
99.00
99.00
98.99
99.00
98.99
99.00
98.96
98.99
98.99
98.95
98.92
98.89
98.97
97.41
138451.80
138824.72
138218.68
138501.24
138278.32
138407.60
138331.40
138824.28
138470.76
138802.64
138732.52
138941.92
138575.56
138679.60
138683.16
138552.00
138510.32
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
26
Laboratory Note
Genetic Epistasis
V - Assessing Algorithm SNPRuler
LN-5-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm SNPRuler is presented. SNPRuler
is an epistatic detection algorithm written in Java that creates rules
based on the epistatic interactions detected in data sets. Using many
configurations of data sets, the results obtained show a correlation between Power and the number of sampled individuals and a correlation
between Power and minor allele frequency. This shows that the algorithm has a very high accuracy in optimal conditions, but has very
low accuracy in below optimal conditions. The algorithm is very scalable with different number of individuals, with only a slight increase
in running time and memory usage. The Type I Error Rate is very
low in all configurations.
1
Introduction
SNPRuler [WYY+ 10] is a rule based algorithm that, based on the relations
between SNPs and the phenotype related to the expression of a disease,
creates rules of association, between SNPs and the phenotype expression.
The order or magnitude of these interactions can have any amount of SNPs.
For each rule, a 3x3 table is generated, relating to the probability of each
possible genotype combination and phenotype expression.
The way that rules are defined is described in the following steps:
1. Literal - A literal s is an index-value pair (i,v) with i denoting an index
and v a value in 1,2,3 representing the possible genotypes. A sample
satisfies a literal(i,v) if and only if its i-th SNP has the value v.
2. Predictive Rule - A predictive rule (r, ζ) : s1 ∩ s2 ∩ ... ∩ sn → ζ, is an
association between a conjunction of n literals denoted as r and a class
label ζ. A sample satisfies (r, ζ) if and only if it satisfies all literals in
r and its class label is ζ.
3. Literal Relevance - Given a predictive rule (r, ζ) and a utility function
U (r, ζ) for rule measurement, a literal si in the rule r is relevant if and
only if U (r, ζ) > U (r − si , ζ). Here, R − si means removing si from r.
4. Closed Rule - A predictive rule (r, ζ) if closed if and only if there is
not there is no literal si which satisfies U (r + si , ζ) > U (r, ζ). Here,
R + si means adding si into r.
The measurement rule of relevance is χ2 statistic. Considering that most of
the epistatic interactions involve many SNPs, before creating rules, an upper
bound is used to determine if a new SNP will reveal to be significant to a
rule. This decreases the amount of rules created immensely, compared to
exhaustive searches. A branch-and-bound approach is used for this effect.
1.1
Input files
The algorithm is written in Java and receives a file containing the genotype
and the phenotype expressed for each individual. The first row contains
the number of each SNP and the final column corresponds to the label.
Each subsequent row contains an individuals genotype 0,1,2 corresponding
to homozygous dominant genotype (AA), heterozygous genotype (Aa), and
homozygous recessive genotype (aa). The Label 0,1 corresponds to control
and disease affected, respectively.
1
X1
1
1
1
X2
1
0
2
X3
0
1
2
X4
2
1
0
Label
0
1
1
Table 1: An example of the input file containing genotype and phenotype
information with 4 SNPs and 3 individuals.
1.2
Output files
The output is a list of interactions ranked by their significance in the χ2 test.
A post-processing calculates the P-value of these interactions adjusting the
χ2 test with a Bonferroni correction with a significance threshold of 0.3.
1.3
Parameters
There are 3 configurable parameters:
• listSize - The expected number of interactions.
• depth - Order of interaction. Number of interacting SNPs.
• updateRatio - The step size of updating a rule. Takes a value between
0 and 1, 0 being not updated and 1 updating a rule at each step.
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. The
computer used for this experiments used the 64-bit Ubuntu 13.10 operating
system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor
and 8,00 GB of RAM memory.
The algorithms settings consist of a -Xmx7000M heap size, with a maximum
number or rules set as 50 000. The length of the rules is 2, considering that
the data sets used contain ground-truths of pairs of SNPs. The pruning
threshold is 0, which means that all possible combinations will be tested.
3
Results
SNPRuler is used exclusively for the interactive effect, therefore data sets
with main effect and full effect wont be analyzed.
In the Figure 1 the Power obtained from each allele frequency with different
2
population sizes is displayed. For data sets with 500 individuals, the Power is
nearly 0 for all allele frequencies. However, as the population size increases,
the Power starts to rise in data sets with allele frequency higher than 0.1.
This is also true for data sets with 2000 individuals, with a slightly higher
Power than smaller data sets. The configuration with the most Power corresponds to the datasets with 2000 population and 0.5 minor allele frequency.
Power(%)
100
92
500 individuals
1000 individuals
2000 individuals
71
50
0
35
32
0
0
0.01
0
0
0
0.05
0
0
10
0.1
Allele frequency
44
6
3
0.3
0.5
Figure 1: Power by allele frequency. For each frequency, three sizes of data
sets were used to measure the Power, with an odds ratio of 2.0 and prevalence
of 0.02. The Power is measured by the amount of data sets where the ground
truth was amongst the most relevant results, out of all 100 data sets.
In Figure 2 the average running time, percentage of CPU usage, and
memory usage are displayed by individuals in the data set, to evaluate the
scalability of the algorithm. The results show that there is a slight increase
in running time when applied to larger data sets. In this results, the increase
in running time is not very significant. The CPU usage shows an increase
with the data set size, with all the data sets having a CPU usage higher than
100%. This means that for each data set, more than one core was used. The
memory usage results show that there is an increase of nearly 10 megabytes
in memory usage. This increase may be significant in more complex data
sets but is not as significant as the running time increase or the CPU usage.
For the Type I Error Rate test, Figure 3 shows that the Type I Error Rate
is relatively small across all the data sets, having outliers with allele frequency
of 0.1 and 2000 individuals. This is the only groups of configurations that
yield a Type I Error Rate higher than 1%.
According to Figure 4 we can conclude that the number of individuals has
a big influence in the Power of the algorithm. This is also true for the allele
frequency. With very small number of individuals, the Power is nearly 0.The
3
CPU Usage(%)
Running Time(seconds)
4
2
0
500
1000
2000
Number of Individuals
140
130
500
1000
2000
Number of Individuals
(a) Average running time.
Memory Usage(Mbytes)
150
(b) Average CPU usage.
320
318
316
314
312
500
1000
2000
Number of Individuals
(c) Average memory usage.
Figure 2: Comparison of scalability measures between different sized data
sets. This figures shows the average running time, CPU usage, and memory
usage by each data set. The data sets have a minor allele frequency is 0.5,
2.0 odds ratio, 0.02 prevalence.
Power also increases with the frequency of the alleles with the ground truth.
On Figure 5 and 6, the influence of odds ratio, through the penetrance table
of the disease, and the prevalence of the disease are undetermined. There is
an increase in Power with the odds ratio of 1.5, but it decreases for 2.0 odds
ratio. The difference in prevalence does not show a very significant difference
in Power. Figure 7 shows the Power by frequency, independent of population
size.
Overall, the Power of the algorithm shows very high accuracy in certain configurations with the optimal conditions, but also shows very low Power in
many configurations.
4
Type I Error Rate(%)
8
500 individuals
1000 individuals
2000 individuals
8
6
4
2
0
0
0
0
0
0.01
0
0.05
1
0
0
0.1
Allele frequency
0
0
0.3
0
0
0
0.5
Figure 3: Type I Error Rate by allele frequency and population size, with an
odds ratio of 2.0 and prevalence of 0.02. The Type I Error Rate is measured
by the amount of data sets where the false positives were amongst the most
relevant results, out of all 100 data sets.
4
Summary
In this lab note, the algorithm SNPRuler was presented and tested to detect
epistasis interactions that manifest complex diseases using generated data
sets. The results obtained showed that The number of individuals is important to epistasis detection. Diseases with ground truths in high frequency
SNPs are easier to detect. The scalability test revealed a significant increase
in the use of computer resources and running time with the increase in number of individuals, which may have a significant impact in datasets with a
higher amount of SNPs. The Type I Error Rate results show very low error
rate in all configurations.
References
[WYY+ 10] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S Tang,
and Weichuan Yu. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics (Oxford, England), 26:30–37, 2010.
A
Bar graphs
5
0
Power by Population
Power(%)
100
50
0
32
0
500
10
1000
2000
Population
Figure 4: Distribution of the Power by population. The allele frequency is
0.1, the odds ratio is 2.0, and the prevalence is 0.02.
Power by Odds Ratio
Power(%)
100
67
50
0
32
0
1.1
1.5
Odds Ratio
2.0
Figure 5: Distribution of the Power by odds ratios. The allele frequency is
0.1, the number of individuals is 2000, and the prevalence is 0.02.
6
Power by Prevalence
Power(%)
100
50
32
29
0
0.0001
0.02
Prevalence
Figure 6: Distribution of the Power by prevalence. The allele frequency is
0.1, the number of individuals is 2000, and the odds ratio is 2.0.
Power by Frequency
92
Power(%)
100
50
0
32
0
0.01
44
0
0.05 0.1
0.3
Frequency
0.5
Figure 7: Distribution of the Power by allele frequency. The number of
individuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
7
B
Table of Results
Table 2: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.5,2000,I,2.0,0.02
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.5,2000,I,1.5,0.02
0.5,1000,I,2.0,0.02
0.1,2000,I,1.5,0.02
0.5,2000,I,2.0,0.0001
0.5,1000,I,2.0,0.0001
0.05,2000,I,2.0,0.0001
0.3,2000,I,2.0,0.02
0.3,1000,I,1.5,0.02
0.5,1000,I,1.5,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,2.0,0.02
0.05,2000,I,1.5,0.0001
0.5,2000,I,1.5,0.0001
0.1,2000,I,2.0,0.02
0.3,1000,I,1.5,0.0001
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.0001
0.05,2000,I,1.5,0.02
0.3,2000,I,2.0,0.0001
0.1,1000,I,2.0,0.02
0.5,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.5,2000,I,1.1,0.02
0.5,500,I,2.0,0.0001
0.5,1000,I,1.5,0.0001
0.3,500,I,2.0,0.02
0.5,2000,I,1.1,0.0001
0.3,500,I,1.5,0.0001
8
TP (%)
92
89
84
81
71
67
52
50
50
44
41
40
36
35
35
34
32
29
29
29
23
12
10
6
6
5
4
3
3
2
2
FP (%)
0
7
3
1
0
2
8
1
19
0
1
0
0
0
12
1
8
1
0
1
14
1
0
0
0
1
0
0
0
0
0
0.3,2000,I,1.1,0.0001
0.1,1000,I,1.5,0.02
0.05,1000,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.3,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.0001
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,1.1,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,1.5,0.02
9
2
2
2
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
1
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usage
in each configuration.
Configuration*
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
Running Time (s)
2.70
2.69
2.68
2.69
2.73
2.70
4.10
4.16
10
CPU Usage (%)
130.19
136.88
140.78
141.46
136.88
136.47
156.28
143.03
Memory Usage (KB)
320211.28
319311.36
319508.72
320285.24
320504.08
319897.04
327876.12
330393.48
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
4.10
4.01
3.96
3.97
3.09
3.12
3.08
3.11
3.09
3.12
2.75
2.73
2.73
2.75
2.74
2.74
4.05
4.04
4.04
4.07
4.12
4.11
3.07
3.10
3.07
3.11
3.09
3.09
2.75
2.74
2.73
2.77
2.77
2.76
4.01
4.05
4.07
4.09
140.41
136.85
125.00
126.28
141.88
139.30
141.47
140.43
142.06
141.69
148.18
149.82
149.43
150.12
150.35
150.21
124.62
119.74
122.47
126.54
125.71
123.02
118.96
127.03
124.95
131.41
134.61
138.13
119.13
119.29
118.35
119.58
118.50
121.01
119.18
122.05
127.11
126.69
11
329206.28
327414.84
325492.92
325792.92
323600.36
324334.68
323865.08
323880.44
323780.88
323507.80
321318.64
319605.00
321487.72
320878.40
320952.24
319914.16
325950.12
325417.16
325669.04
326147.32
325679.80
325735.24
322399.76
323056.56
322673.52
323709.60
323485.68
323444.76
320066.32
319312.12
320222.28
319002.32
320626.68
320034.20
325869.52
325484.96
326038.04
326636.80
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
4.10
4.12
3.13
3.13
3.12
3.14
3.14
3.14
2.73
2.76
2.73
2.76
2.75
2.77
3.85
3.93
3.87
3.88
3.99
3.94
2.94
3.00
3.00
3.02
3.00
3.02
2.63
2.65
2.64
2.75
2.76
2.72
3.99
4.03
4.06
4.02
4.00
4.04
127.66
126.83
128.79
128.00
126.40
125.43
126.95
126.27
135.34
139.71
131.66
139.02
137.41
132.74
128.39
135.36
144.42
137.91
131.54
131.49
147.28
149.84
146.13
143.14
143.31
146.23
154.11
150.07
126.83
129.40
129.15
182.19
130.97
129.72
126.38
110.41
121.24
128.06
12
326390.36
326720.76
323402.72
323800.64
323558.52
323584.04
323569.56
323193.08
319177.48
320980.88
320560.40
320381.20
320737.96
320620.16
325633.16
324273.96
326558.92
325713.84
325690.40
324629.08
323110.24
323443.36
323144.92
323136.72
323410.08
323356.00
320784.96
320432.16
320529.56
319814.80
320633.56
321332.20
325901.32
325971.00
325816.40
324423.52
325429.32
326333.80
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
3.06
3.07
3.07
3.05
3.03
3.01
127.62
127.73
126.69
156.55
163.41
156.46
323421.92
323639.96
323483.56
325006.56
324945.28
320749.28
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
13
Laboratory Note
Genetic Epistasis
VI - Assessing Algorithm SNPHarvester
LN-6-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm SNPHarvester was presented and
tested. This algorithm consists of a stochastic approach that searches
relevant SNPs for main effect and epistasis interactions, using a PathSeeker algorithm and revealing relevant results using χ2 test. The
results show that the Power and Type 1 Error Rate of the algorithm
is high for main effect and full effect, but shows good results for epistasis detection, with very low error rates. The scalability test of the
algorithm shows that this may be a problem for higher data sets.
1
Introduction
SNPHarvester [YHW+ 09] works as a stochastic algorithm, by generating
multiple paths among the many SNPS and these are joined into groups.
Significant groups are selected if their scores are above a statistical predetermined threshold. The score function used to measure the association between
a k-SNP group, where k is the number of SNPS in epistatic interaction, and
the phenotype is χ2 test.
For this effect, a PathSeeker algorithm was developed to randomly start a
new path and for each group tries to increase the score, changing only one
SNP in active set at a time, converging to a local minimum, typically converging in two or three iterations. The evaluation is based on the χ2 value
and has a significance threshold of α = 0.05 after Bonferroni corrections.
A post processing stage is applied to eliminate k-SNP groups that may be
significant due to a sub-group and SNPs that may show a false strong association due to a small marginal effect. An L2 penalized logistic regression is
used to filter out these interactions.
λ
kβk
(1)
2
where l(β0 , β) is the binomial log-likelihood and λ is a regularization parameter.
The differences between SNPHarvester and the other algorithms is that SNP
focuses on local optima instead of a global optima. Each local optima is
significant because there are usually multiple interaction patterns. SNPHarvester also uses a sequential optimization rather than parallel optimization,
removing local optima in the search process, becoming a smaller space in
later stages. SNPHarvester also uses a model-free approach,randomly creating paths to directly detect significant associations.
L(β0 , β, λ) = −l(β0 , β) +
1.1
Input files
The input of SNPHarvester consists of the names of each column in the first
row, i.e. SNPs and the label phenotype as the last column. The next rows
contain the genotypes 0,1,2 as homozygous dominant, heterozygous, and homozygous recessive, respectively. The Label 0,1 is control, case, respectively.
1
X1,X2,X3,X4,Label
1, 1, 0, 2, 0
1, 1, 2, 1, 1
1, 2, 2, 0, 1
Table 1: An example of the input file containing genotype and phenotype
information with 4 SNPs and 3 individuals.
1.2
Output files
The algorithm outputs contain the final extracted single or interacting SNPs,
with the χ2 value of that specific interaction or single SNP, and the running
time of the algorithm.
1.3
Parameters
There are two modes: ”Threshold-Based” mode, where the program outputs
all of the significant SNPs above a user specified significance threshold, and
”Top-K Based” mode where the program outputs a specified number of SNP
interactions, regardless of their significance level. Both modes have parameters to chose the minimum and maximum number of interacting SNPs to be
detected. If the minimum is 1, it will test main effects of SNPs.
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. The
computer used for this experiments used the 64-bit Ubuntu 13.10 operating
system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor
and 8,00 GB of RAM memory.
SNPHarvester provides a Java program. The ”Threshold-Based” mode was
chosen for this analysis, with a significance level of α = 0.05. The heap size
is set to -Xmx7000M. Main effects and pairwise interactions are tested for
this experiment.
3
Results
SNPHarvester works in epistasis detection, main effect detection, and full
effect detection. All data set configurations were used in this experiment.
Figure 1 shows the Power obtained in relation to allele frequency and population size for epistasis (a), main effect (b) and epistasis + main effect (c). The
2
results show that the Power is higher in main effect detection than epistasis
detection overall, reaching 100% Power in data sets with 0.3 and 0.5 allele
frequency. Epistasis detection shows a much lower result, with 0.1 allele frequency and 2000 individuals having the best Power for epistasis detection,
with 85%. However, significant results can be seen with allele frequency as
low as 0.05%, which is not true for main effect detection. In full effect detection, the results are very similar to main effect detections.
For scalability detection, Figure 2 shows a very significant difference in
running time a from the data sets with 500 individuals running for an average
of 9.29 seconds, and data sets with 2000 individuals, showing an average of
33 seconds. This somewhat linear growth, together with the slight increase
in memory usage (c) reveal a scalability problem. The CPU usage is near
100% across all data set sizes.
Type 1 Error Rates show a concerning increase in main effect and full effect data sets, in relation to epistasis detection. This disproportion is due to
the ease in main effect detection, which reveals highly valued ground truths,
but also increases the chances of detecting false positives, even if their statistical significance is significantly lower than the ground truth. There is still
a higher Power than Type 1 Error Rate in most cases, with the exception of
high allele frequencies and high population, which reveal error rates of 100%.
This is not true for epistasis detection, having a maximum of 27% error rate
with data sets with 2000 individuals with an allele frequency of 0.05. The
other configurations show a slight increase of error rate with the increase of
data set population size. There is not clear Type 1 Error Rate difference
between allele frequencies for epistasis detection.
The relation of Power and population and allele frequency is reinforced in
Figure 4 and 7. However, the Power by allele frequency in Epistasis detection
shows a peak in 0.1 minor allele frequency and a descent for higher allele frequencies. Figure 6 shows a slight but not significant increase of Power with
prevalence for epistasis detection and a slight decrease for main and full effect
detection. The linear increase in Power shown in Figure 5 by odds ratios is
similar to the distribution of Power by population.
3
500 individuals
1000 individuals
2000 individuals
Power(%)
100
85
70
43
50
0
0
0
0
0
0.01
0
4
0
0.05
33
21
18
0.1
Allele frequency
2
14
0.3
0.5
100 100 100
100 100 100
0.3
0.5
100 100 100
100 100 100
0.3
0.5
(a) Epistasis
500 individuals
1000 individuals
2000 individuals
Power(%)
100
92
50
0
38
0
0
0
0
0.01
1
0
0.05
0
0.1
Allele frequency
(b) Main Effect
500 individuals
1000 individuals
2000 individuals
Power(%)
100
95
50
0
32
0
0
0.01
0
0
0
0
0.05
0
0.1
Allele frequency
(c) Full Effect
Figure 1: Power by allele frequency. For each frequency, three sizes of data
sets were used to measure the Power, with odds ratio of 2.0 and prevalence
of 0.02. The Power is measured by the amount of data sets where the ground
truth was amongst the most relevant results, out of all 100 data sets. (b),
(a), and (c) represent main effect, epistatic and main effect + epistatic interactions.
4
CPU Usage(%)
Running Time(seconds)
30
20
10
0
100
50
0
500
1000
2000
Number of Individuals
500
1000
2000
Number of Individuals
Memory Usage(Mbytes)
(a) Average Running time.
(b) Average CPU usage.
76
74
72
70
68
2000
500
1000
Number of Individuals
(c) Average Memory usage.
Figure 2: Comparison of scalability measures between different sized data
sets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence and use the full effect disease model.
5
Type 1 Error Rate(%)
500 individuals
1000 individuals
2000 individuals
100
50
0
4
4
2
4
0.01
27
13
7
0.05
9
19
3
0.1
Allele frequency
9 11
0.3
4
3
5
0.5
Type 1 Error Rate(%)
(a) Epistasis
99 100
500 individuals
1000 individuals
2000 individuals
100
79
99 100 100
78
50
0
1
10
24
1
5
0.01
4
0.05
11
22
0.1
Allele frequency
0.3
0.5
100 100 100
100 100 100
0.3
0.5
Type 1 Error Rate(%)
(b) Main Effect
500 individuals
1000 individuals
2000 individuals
100
79
50
0
2
4
0.01
1
8
8
0.05
27
20
9
0.1
Allele frequency
(c) Full Effect
Figure 3: Type 1 Error Rate by allele frequency. For each frequency, three
sizes of data sets were used to measure the Power, with odds ratio of 2.0 and
prevalence of 0.02. The Type 1 Error Rate is measured by the amount of
data sets where the false positives were amongst the most relevant results,
out of all 100 data sets. (a), (b), and (c) represent epistatic, main effect, and
main effect + epistatic interactions.
6
4
Summary
In this experiment, the SNPHarvester was tested using many data sets with
significant configuration changes. The results show that the algorithm has
a high Power in main effect and full effect detections, but also a high Type
1 Error Rate. For epistasis, the Power is lower but reveals very low Type 1
Error Rate values. There is a linear increase of Power with the number of
individuals and odds ratios, and a significant increase with allele frequencies.
The algorithm shows scalability problems, due to the high increase in running
time, which may be crucial in genome wide studies.
References
[YHW+ 09] Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,
and Weichuan Yu. SNPHarvester: a filtering-based approach
for detecting epistatic interactions in genome-wide association
studies. Bioinformatics (Oxford, England), 25:504–511, 2009.
A
Bar Graph
7
Power by Population
Power(%)
100
85
50
21
0
0
500
1000
2000
Population
(a) Epistasis
Power by Population
92
Power(%)
100
50
0
38
0
500
1000
2000
Population
(b) Main Effect
Power by Population
95
Power(%)
100
50
0
32
0
500
1000
2000
Population
(c) Full Effect
Figure 4: Distribution of the Power by population for all disease models.
The allele frequency is 0.1, the odds ratio is 2.0, and the prevalence is 0.02.
8
Power by Odds Ratio
Power(%)
100
85
41
50
0
0
1.1
1.5
Odds Ratio
2.0
(a) Epistasis
Power by Odds Ratio
92
Power(%)
100
50
25
0
2
1.1
1.5
Odds Ratio
2.0
(b) Main Effect
Power by Odds Ratio
95
Power(%)
100
66
50
0
3
1.1
1.5
Odds Ratio
2.0
(c) Full Effect
Figure 5: Distribution of the Power by odds ratios for all disease models.
The allele frequency is 0.1, the population size is 2000 individuals, and the
prevalence is 0.02.
9
Power by Prevalence
Power(%)
100
74
85
50
0
0.0001
0.02
Prevalence
(a) Epistasis
Power by Prevalence
Power(%)
100
99
92
50
0
0.0001
0.02
Prevalence
(b) Main Effect
Power by Prevalence
Power(%)
100
100
95
50
0
0.0001
0.02
Prevalence
(c) Full Effect
Figure 6: Distribution of the Power by prevalence for all disease models. The
allele frequency is 0.1, the odds ratio is 2.0, and the population size is 2000
individuals.
10
Power by Frequency
Power(%)
100
85
70
50
33
18
0
0
0.01
0.05 0.1
0.3
Frequency
0.5
(a) Epistasis
Power by Frequency
100
100
1
0.05 0.1
0.3
Frequency
0.5
92
Power(%)
100
50
0
0
0.01
(b) Main Effect
Power by Frequency
100
100
0
0.05 0.1
0.3
Frequency
0.5
95
Power(%)
100
50
0
0
0.01
(c) Full Effect
Figure 7: Distribution of the Power by allele frequency for all disease models. The population size is 2000 individuals, the odds ratio is 2.0, and the
prevalence is 0.02.
11
B
Table of Results
Table 2: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.01,500,ME+I,2.0,0.02
0.01,500,ME+I,2.0,0.0001
0.01,500,ME+I,1.5,0.02
0.01,500,ME+I,1.5,0.0001
0.01,500,ME+I,1.1,0.02
0.01,500,ME+I,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,ME+I,2.0,0.02
0.01,2000,ME+I,2.0,0.0001
0.01,2000,ME+I,1.5,0.02
0.01,2000,ME+I,1.5,0.0001
0.01,2000,ME+I,1.1,0.02
0.01,2000,ME+I,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
0.01,2000,I,2.0,0.02
12
TP (%)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
FP (%)
2
5
7
5
1
6
1
6
1
6
1
6
4
5
5
10
3
6
1
3
6
4
6
6
1
3
6
4
6
4
2
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,ME+I,2.0,0.02
0.01,1000,ME+I,2.0,0.0001
0.01,1000,ME+I,1.5,0.02
0.01,1000,ME+I,1.5,0.0001
0.01,1000,ME+I,1.1,0.02
0.01,1000,ME+I,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
0.05,500,ME+I,2.0,0.02
0.05,500,ME+I,2.0,0.0001
0.05,500,ME+I,1.5,0.02
0.05,500,ME+I,1.5,0.0001
0.05,500,ME+I,1.1,0.02
0.05,500,ME+I,1.1,0.0001
0.05,500,ME,2.0,0.02
0.05,500,ME,2.0,0.0001
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9
8
12
7
6
4
3
10
5
3
5
10
5
11
5
3
5
4
4
4
5
6
5
8
8
4
5
3
7
5
7
5
6
7
4
4
6
5
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,ME+I,2.0,0.02
0.05,2000,ME+I,2.0,0.0001
0.05,2000,ME+I,1.5,0.02
0.05,2000,ME+I,1.5,0.0001
0.05,2000,ME+I,1.1,0.02
0.05,2000,ME+I,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,ME+I,2.0,0.02
0.05,1000,ME+I,2.0,0.0001
0.05,1000,ME+I,1.5,0.02
0.05,1000,ME+I,1.5,0.0001
0.05,1000,ME+I,1.1,0.02
0.05,1000,ME+I,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
14
0
0
0
0
26
0
1
0
0
1
9
0
0
0
0
18
45
39
40
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
2
0
1
0
11
5
3
20
42
13
16
4
9
24
30
5
15
6
8
27
5
9
9
5
6
8
21
1
7
4
5
4
26
2
6
7
4
13
5
4
4
3
0.05,1000,I,1.1,0.0001
0.1,500,ME+I,2.0,0.02
0.1,500,ME+I,2.0,0.0001
0.1,500,ME+I,1.5,0.02
0.1,500,ME+I,1.5,0.0001
0.1,500,ME+I,1.1,0.02
0.1,500,ME+I,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,ME+I,2.0,0.02
0.1,2000,ME+I,2.0,0.0001
0.1,2000,ME+I,1.5,0.02
0.1,2000,ME+I,1.5,0.0001
0.1,2000,ME+I,1.1,0.02
0.1,2000,ME+I,1.1,0.0001
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,ME+I,2.0,0.02
15
0
0
41
0
1
0
1
0
13
0
0
0
0
0
1
0
0
0
0
95
100
66
61
3
3
92
99
25
48
2
1
85
74
41
23
0
2
32
11
9
38
4
9
3
4
11
20
7
7
6
6
7
3
5
1
5
7
79
99
40
48
8
17
79
88
20
41
11
7
19
11
9
6
7
9
27
0.1,1000,ME+I,2.0,0.0001
0.1,1000,ME+I,1.5,0.02
0.1,1000,ME+I,1.5,0.0001
0.1,1000,ME+I,1.1,0.02
0.1,1000,ME+I,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.3,500,ME+I,2.0,0.02
0.3,500,ME+I,2.0,0.0001
0.3,500,ME+I,1.5,0.02
0.3,500,ME+I,1.5,0.0001
0.3,500,ME+I,1.1,0.02
0.3,500,ME+I,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,ME+I,2.0,0.02
0.3,2000,ME+I,2.0,0.0001
0.3,2000,ME+I,1.5,0.02
16
97
1
13
0
0
38
59
2
6
0
0
21
9
1
1
0
0
100
100
100
100
77
93
100
100
89
89
25
25
4
20
1
3
0
0
100
100
100
74
11
12
7
8
22
43
9
12
7
12
9
9
2
4
6
12
100
100
75
96
21
28
78
89
27
37
13
9
3
8
3
6
3
6
100
100
100
0.3,2000,ME+I,1.5,0.0001
0.3,2000,ME+I,1.1,0.02
0.3,2000,ME+I,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,ME+I,2.0,0.02
0.3,1000,ME+I,2.0,0.0001
0.3,1000,ME+I,1.5,0.02
0.3,1000,ME+I,1.5,0.0001
0.3,1000,ME+I,1.1,0.02
0.3,1000,ME+I,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.5,500,ME+I,2.0,0.02
0.5,500,ME+I,2.0,0.0001
0.5,500,ME+I,1.5,0.02
0.5,500,ME+I,1.5,0.0001
0.5,500,ME+I,1.1,0.02
17
100
100
100
100
100
100
100
100
100
70
73
58
53
1
1
100
100
100
100
100
100
100
100
100
100
93
84
43
79
30
27
0
0
100
100
100
100
100
100
99
100
100
100
100
100
67
62
11
20
8
7
5
8
100
100
99
100
66
69
99
100
78
75
30
33
9
9
3
9
4
5
100
100
100
100
79
0.5,500,ME+I,1.1,0.0001
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,ME+I,2.0,0.02
0.5,2000,ME+I,2.0,0.0001
0.5,2000,ME+I,1.5,0.02
0.5,2000,ME+I,1.5,0.0001
0.5,2000,ME+I,1.1,0.02
0.5,2000,ME+I,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,ME+I,2.0,0.02
0.5,1000,ME+I,2.0,0.0001
0.5,1000,ME+I,1.5,0.02
0.5,1000,ME+I,1.5,0.0001
0.5,1000,ME+I,1.1,0.02
0.5,1000,ME+I,1.1,0.0001
0.5,1000,ME,2.0,0.02
18
100
100
100
100
100
80
79
2
4
1
0
0
0
100
99
100
100
100
100
100
100
100
100
100
100
33
78
65
21
7
2
100
100
100
100
100
100
100
89
99
97
63
62
27
28
4
5
4
7
1
8
100
99
100
100
100
100
100
100
100
100
100
98
5
9
2
2
8
7
100
100
100
100
100
100
100
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
0.5,1000,ME,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
100
100
100
100
100
14
52
28
1
0
0
100
100
97
64
69
3
6
5
5
3
9
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
Table 3: A table containing the running time, cpu usage and memory usage
in each configuration.
Configuration*
0.5,500,ME+I,2.0,0.02
0.5,500,ME+I,2.0,0.0001
0.5,500,ME+I,1.5,0.02
0.5,500,ME+I,1.5,0.0001
0.5,500,ME+I,1.1,0.02
0.5,500,ME+I,1.1,0.0001
0.5,500,ME,2.0,0.02
0.5,500,ME,2.0,0.0001
0.5,500,ME,1.5,0.02
0.5,500,ME,1.5,0.0001
0.5,500,ME,1.1,0.02
0.5,500,ME,1.1,0.0001
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
Running Time (s)
0
0
0.07
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19
CPU Usage (%)
75.44
76.83
73.91
74.17
78.78
76.66
77.81
77.69
78.18
76.23
80.54
78.98
76.74
75.78
77.19
78.27
78.71
78.75
Memory Usage (KB)
8975.12
8975.00
9593.72
8975.28
8975.20
8974.80
8974.84
8975.60
8975.36
8975.16
8975.24
8975.04
8975.16
8974.76
8975.40
8974.64
8975.20
8975.00
0.5,2000,ME+I,2.0,0.02
0.5,2000,ME+I,2.0,0.0001
0.5,2000,ME+I,1.5,0.02
0.5,2000,ME+I,1.5,0.0001
0.5,2000,ME+I,1.1,0.02
0.5,2000,ME+I,1.1,0.0001
0.5,2000,ME,2.0,0.02
0.5,2000,ME,2.0,0.0001
0.5,2000,ME,1.5,0.02
0.5,2000,ME,1.5,0.0001
0.5,2000,ME,1.1,0.02
0.5,2000,ME,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,ME+I,2.0,0.02
0.5,1000,ME+I,2.0,0.0001
0.5,1000,ME+I,1.5,0.02
0.5,1000,ME+I,1.5,0.0001
0.5,1000,ME+I,1.1,0.02
0.5,1000,ME+I,1.1,0.0001
0.5,1000,ME,2.0,0.02
0.5,1000,ME,2.0,0.0001
0.5,1000,ME,1.5,0.02
0.5,1000,ME,1.5,0.0001
0.5,1000,ME,1.1,0.02
0.5,1000,ME,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,ME+I,2.0,0.02
0.3,500,ME+I,2.0,0.0001
1.03
35.91
45.54
47.30
53.11
51.93
54.63
54.31
44.44
39.89
20.10
18.37
13.30
14.32
14.52
12.63
12.16
11.82
25.89
26.33
26.16
25.22
14.21
12.89
19.63
17.44
10.19
9.25
6.62
6.88
6.60
7.38
7.03
6.38
6.47
6.54
6.26
12.21
20
84.06
100.13
99.68
99.04
99.33
98.62
100.05
99.80
101.10
101.33
100.50
101.77
101.99
101.73
102.26
102.34
102.54
101.52
86.51
89.30
98.22
100.91
103.72
104.18
102.59
103.13
105.15
105.96
107.41
109.39
108.50
108.82
100.62
101.91
102.15
102.20
104.90
102.21
11814.04
76053.20
77689.64
78211.72
78391.92
79691.56
77422.96
79153.36
76040.16
75383.84
72422.88
72461.68
71876.44
70963.00
70528.40
71372.92
73187.88
71635.04
73035.92
73462.04
73189.12
73075.68
71475.92
71463.92
72507.32
72036.12
70972.68
70377.84
69163.76
69143.72
69107.64
68815.08
68200.44
67529.52
68962.80
68464.08
68340.96
70501.56
0.3,500,ME+I,1.5,0.02
0.3,500,ME+I,1.5,0.0001
0.3,500,ME+I,1.1,0.02
0.3,500,ME+I,1.1,0.0001
0.3,500,ME,2.0,0.02
0.3,500,ME,2.0,0.0001
0.3,500,ME,1.5,0.02
0.3,500,ME,1.5,0.0001
0.3,500,ME,1.1,0.02
0.3,500,ME,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,ME+I,2.0,0.02
0.3,2000,ME+I,2.0,0.0001
0.3,2000,ME+I,1.5,0.02
0.3,2000,ME+I,1.5,0.0001
0.3,2000,ME+I,1.1,0.02
0.3,2000,ME+I,1.1,0.0001
0.3,2000,ME,2.0,0.02
0.3,2000,ME,2.0,0.0001
0.3,2000,ME,1.5,0.02
0.3,2000,ME,1.5,0.0001
0.3,2000,ME,1.1,0.02
0.3,2000,ME,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,ME+I,2.0,0.02
0.3,1000,ME+I,2.0,0.0001
0.3,1000,ME+I,1.5,0.02
0.3,1000,ME+I,1.5,0.0001
4.13
5.38
3.74
3.82
4.08
4.43
3.73
3.80
3.67
3.73
3.79
3.90
3.68
3.70
3.67
3.75
52.54
33.90
50.68
54.60
18.08
22.20
47.70
51.00
23.96
23.65
13.11
12.94
14.49
13.89
14.63
14.30
11.88
12.30
24.07
25.39
18.57
24.64
21
106.37
105.98
105.94
105.86
106.11
106.19
106.90
109.81
109.74
105.00
102.17
105.07
105.09
100.02
105.56
103.87
97.45
101.81
100.31
101.14
104.45
103.66
101.73
98.52
98.85
99.04
96.47
100.18
99.58
100.05
102.00
99.79
100.25
100.42
99.29
99.83
98.83
98.82
66912.32
68096.64
65277.00
65443.96
65608.68
67463.80
65098.52
65059.36
64770.44
65267.44
65143.80
65876.56
65198.44
65044.64
64918.80
65366.36
78867.88
75939.52
76332.04
75759.68
71770.00
72230.52
76081.80
77481.44
73314.16
72820.92
71377.16
72271.40
70987.12
70761.04
71405.12
71175.52
72587.36
72231.52
72695.04
72831.76
71546.72
72522.36
0.3,1000,ME+I,1.1,0.02
0.3,1000,ME+I,1.1,0.0001
0.3,1000,ME,2.0,0.02
0.3,1000,ME,2.0,0.0001
0.3,1000,ME,1.5,0.02
0.3,1000,ME,1.5,0.0001
0.3,1000,ME,1.1,0.02
0.3,1000,ME,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,ME+I,2.0,0.02
0.1,500,ME+I,2.0,0.0001
0.1,500,ME+I,1.5,0.02
0.1,500,ME+I,1.5,0.0001
0.1,500,ME+I,1.1,0.02
0.1,500,ME+I,1.1,0.0001
0.1,500,ME,2.0,0.02
0.1,500,ME,2.0,0.0001
0.1,500,ME,1.5,0.02
0.1,500,ME,1.5,0.0001
0.1,500,ME,1.1,0.02
0.1,500,ME,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,ME+I,2.0,0.02
0.1,2000,ME+I,2.0,0.0001
0.1,2000,ME+I,1.5,0.02
0.1,2000,ME+I,1.5,0.0001
0.1,2000,ME+I,1.1,0.02
0.1,2000,ME+I,1.1,0.0001
11.58
11.98
16.77
19.44
12.06
12.44
11.00
11.27
12.35
13.31
12.28
12.17
11.20
11.12
6.07
6.13
6.06
6.12
6.13
6.14
6.12
6.15
6.16
6.16
6.12
6.13
6.11
6.11
6.16
6.08
6.11
6.15
24.39
35.63
21.49
22.66
20.98
21.20
22
98.75
98.73
98.95
98.35
98.94
98.89
98.78
98.53
98.54
98.89
98.94
98.91
98.96
98.62
99.82
99.08
99.41
98.96
99.57
99.51
99.33
99.98
99.87
99.11
99.47
99.55
99.21
98.95
98.99
99.29
98.84
99.60
98.80
98.93
98.70
98.90
98.79
98.72
71113.48
71254.12
71090.84
71731.32
71088.80
70971.28
71013.08
70955.92
71127.96
70009.72
71714.48
71538.40
71779.92
71854.36
70315.16
69049.48
70168.84
69809.04
70186.84
70018.72
70033.16
69367.68
70112.36
70216.16
70135.28
70127.36
70007.60
70187.96
70377.76
70226.68
70228.80
70123.92
72612.80
74894.12
72437.88
72547.36
73179.28
73060.00
0.1,2000,ME,2.0,0.02
0.1,2000,ME,2.0,0.0001
0.1,2000,ME,1.5,0.02
0.1,2000,ME,1.5,0.0001
0.1,2000,ME,1.1,0.02
0.1,2000,ME,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,ME+I,2.0,0.02
0.1,1000,ME+I,2.0,0.0001
0.1,1000,ME+I,1.5,0.02
0.1,1000,ME+I,1.5,0.0001
0.1,1000,ME+I,1.1,0.02
0.1,1000,ME+I,1.1,0.0001
0.1,1000,ME,2.0,0.02
0.1,1000,ME,2.0,0.0001
0.1,1000,ME,1.5,0.02
0.1,1000,ME,1.5,0.0001
0.1,1000,ME,1.1,0.02
0.1,1000,ME,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,ME+I,2.0,0.02
0.05,500,ME+I,2.0,0.0001
0.05,500,ME+I,1.5,0.02
0.05,500,ME+I,1.5,0.0001
0.05,500,ME+I,1.1,0.02
0.05,500,ME+I,1.1,0.0001
0.05,500,ME,2.0,0.02
0.05,500,ME,2.0,0.0001
23.49
27.55
20.96
22.40
20.94
21.46
24.99
25.33
24.72
23.07
21.07
21.58
11.19
13.23
11.13
11.32
11.01
11.26
11.10
11.59
11.11
11.11
11.00
11.18
11.52
11.29
11.12
11.12
11.11
11.16
6.23
6.14
6.13
6.15
6.20
6.13
6.19
6.13
23
98.81
98.84
98.65
98.84
98.07
99.20
98.67
98.95
99.68
99.88
101.62
101.93
104.18
103.98
104.34
104.25
104.52
104.54
104.39
104.16
104.37
104.28
104.53
104.47
104.40
104.52
104.56
104.58
104.71
104.38
108.59
108.27
108.57
108.44
108.71
108.78
108.97
108.19
72854.64
73112.76
72501.96
72125.24
73506.48
73131.60
71017.84
71315.96
72858.24
72926.08
73633.20
73469.88
71543.48
71146.12
71897.60
71671.52
72502.08
72783.76
71333.52
71271.44
71968.00
71837.24
72532.48
72076.52
71641.20
72135.84
72310.20
72486.12
72571.52
72365.84
69799.12
69907.08
70177.28
70259.12
70280.92
69942.24
70123.56
69865.56
0.05,500,ME,1.5,0.02
0.05,500,ME,1.5,0.0001
0.05,500,ME,1.1,0.02
0.05,500,ME,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,ME+I,2.0,0.02
0.05,2000,ME+I,2.0,0.0001
0.05,2000,ME+I,1.5,0.02
0.05,2000,ME+I,1.5,0.0001
0.05,2000,ME+I,1.1,0.02
0.05,2000,ME+I,1.1,0.0001
0.05,2000,ME,2.0,0.02
0.05,2000,ME,2.0,0.0001
0.05,2000,ME,1.5,0.02
0.05,2000,ME,1.5,0.0001
0.05,2000,ME,1.1,0.02
0.05,2000,ME,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,ME+I,2.0,0.02
0.05,1000,ME+I,2.0,0.0001
0.05,1000,ME+I,1.5,0.02
0.05,1000,ME+I,1.5,0.0001
0.05,1000,ME+I,1.1,0.02
0.05,1000,ME+I,1.1,0.0001
0.05,1000,ME,2.0,0.02
0.05,1000,ME,2.0,0.0001
0.05,1000,ME,1.5,0.02
0.05,1000,ME,1.5,0.0001
6.14
6.14
6.12
6.11
6.14
6.16
6.09
6.18
6.22
6.21
21.74
22.90
21.44
21.35
21.32
21.79
21.33
22.45
22.20
21.67
21.64
21.36
21.75
24.67
23.76
24.41
21.99
21.83
11.21
11.22
11.07
11.12
10.98
11.04
10.99
11.11
10.91
11.01
24
109.00
108.88
108.32
108.61
108.80
108.74
108.90
108.73
107.68
101.49
96.81
92.35
97.61
97.65
100.62
100.34
102.42
100.06
97.06
101.75
102.07
102.47
101.65
99.33
99.50
99.04
98.53
99.55
102.98
102.77
103.08
101.16
103.88
105.79
105.88
105.82
105.93
105.99
70221.36
69994.16
69700.40
70141.16
70028.96
70185.16
69967.76
69698.04
69844.80
69872.32
72821.72
72652.16
73224.84
73010.56
73583.24
73084.72
72800.80
72805.60
73334.84
73099.68
73360.48
73394.80
72672.52
72954.52
72027.12
72224.64
73594.44
73442.16
71934.28
71696.96
72868.56
72054.76
72243.88
71986.88
72255.96
71831.64
72275.60
72655.04
0.05,1000,ME,1.1,0.02
0.05,1000,ME,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,ME+I,2.0,0.02
0.01,500,ME+I,2.0,0.0001
0.01,500,ME+I,1.5,0.02
0.01,500,ME+I,1.5,0.0001
0.01,500,ME+I,1.1,0.02
0.01,500,ME+I,1.1,0.0001
0.01,500,ME,2.0,0.02
0.01,500,ME,2.0,0.0001
0.01,500,ME,1.5,0.02
0.01,500,ME,1.5,0.0001
0.01,500,ME,1.1,0.02
0.01,500,ME,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,ME+I,2.0,0.02
0.01,2000,ME+I,2.0,0.0001
0.01,2000,ME+I,1.5,0.02
0.01,2000,ME+I,1.5,0.0001
0.01,2000,ME+I,1.1,0.02
0.01,2000,ME+I,1.1,0.0001
0.01,2000,ME,2.0,0.02
0.01,2000,ME,2.0,0.0001
0.01,2000,ME,1.5,0.02
0.01,2000,ME,1.5,0.0001
0.01,2000,ME,1.1,0.02
0.01,2000,ME,1.1,0.0001
10.91
10.88
10.89
10.98
11.06
11.02
11.07
11.17
6.10
6.19
6.23
6.19
6.21
6.18
6.17
6.20
6.20
6.20
6.21
6.24
6.23
6.17
6.25
6.18
6.30
6.27
21.61
21.65
21.57
21.52
21.20
21.51
21.70
21.51
21.44
21.44
21.53
21.16
25
106.09
106.08
105.87
105.90
105.79
106.11
105.81
105.82
110.76
100.90
88.83
92.12
95.10
94.75
100.01
98.25
96.74
97.67
100.55
91.17
102.47
103.53
101.20
101.27
99.65
99.71
99.05
98.32
99.21
98.94
99.76
96.31
92.67
99.72
99.98
99.25
97.16
101.37
72631.16
72100.20
72380.60
72264.56
72350.44
72392.64
72310.56
72267.92
70623.48
70173.60
70297.80
70435.84
70178.08
69908.24
70107.48
70174.40
70149.52
69945.92
70039.24
69905.40
70315.16
70046.48
69696.88
69886.68
70166.36
70019.80
73708.60
73273.96
73551.20
73723.64
73592.64
73391.08
73853.88
73350.92
73758.00
73351.20
73346.52
73605.40
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,ME+I,2.0,0.02
0.01,1000,ME+I,2.0,0.0001
0.01,1000,ME+I,1.5,0.02
0.01,1000,ME+I,1.5,0.0001
0.01,1000,ME+I,1.1,0.02
0.01,1000,ME+I,1.1,0.0001
0.01,1000,ME,2.0,0.02
0.01,1000,ME,2.0,0.0001
0.01,1000,ME,1.5,0.02
0.01,1000,ME,1.5,0.0001
0.01,1000,ME,1.1,0.02
0.01,1000,ME,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
21.07
21.55
21.19
21.60
21.43
21.91
11.24
11.18
11.29
11.29
11.19
11.28
11.34
11.35
11.34
11.39
11.21
11.35
11.19
11.36
11.25
11.29
11.28
11.36
101.02
100.96
100.46
99.60
100.51
97.81
98.26
97.90
98.82
99.32
98.55
98.35
97.87
99.01
95.71
96.87
99.33
98.86
98.84
97.89
97.67
96.01
97.48
97.21
73651.60
73217.44
73303.00
73325.36
73260.24
73582.96
71785.68
72111.08
71760.68
71912.76
71999.88
72015.68
71920.68
72120.60
71681.44
71781.12
71747.48
71964.16
71847.96
71583.32
72072.00
71814.48
71709.20
71803.64
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
26
Laboratory Note
Genetic Epistasis
VII - Assessing Algorithm TEAM
LN-7-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
In this lab note, the algorithm TEAM is presented. TEAM is
an exhaustive algorithm that works by updating contingency tables
and the minimum spanning tree made from SNPs. The results obtained show an increase in Power by population size, allele frequency,
and odds ratio. There is also an increase in Type 1 Error Rate with
population size, but not a clear indicator for allele frequency. The
scalability of the algorithm is questionable, considering that there is
a big increase in the running time required by data sets with different
population sizes, which is not relevant for these experiments but may
be problematic for larger data sets.
1
Introduction
Tree-based Epistasis Association Mapping) TEAM [ZHZW10] is an exhaustive algorithm that computes all two-locus pairs to obtain a permutation
test, which is applicable to all statistical relevancy tests, due to the contingency table generated. TEAM also uses Family-wise error rate (FWER) and
false discovery rate (FDR) to control error rate using the permutation test,
which is better than Bonferroni correction but also more computationally
expensive. The algorithm builds a minimum spanning tree containing SNPs
as nodes and the edges represent the genotype difference between two SNPs.
This is used to update the contingency tables, allowing for a pruning of many
individuals.
The algorithm receives the SNPs genotypes and phenotypes of each individual, creating a specified number of permutations. The contingency tables for
each single-locus are generated. The minimum spanning tree is built, using
the different genotypes associated to each edge. The tree is then updated
for each leaf node with the information related to the contingency table for
genotype relation between SNPs. The test values are then calculated, using
the contingency tables.
X1
X2
X3
X4
X5
Y1
Y2
Y3
S1
0
2
2
0
2
1
0
1
S2
0
0
2
2
1
1
0
0
S3
0
0
0
2
2
1
0
1
S4
1
2
1
0
0
0
1
1
S5
2
0
2
0
1
1
1
1
S6
0
2
2
0
2
0
0
0
S7
2
0
0
0
0
1
1
1
S8
0
1
1
1
1
1
0
0
S9
2
2
1
0
0
1
1
1
S10
0
1
0
1
2
0
0
0
Table 1: An example of the input data, consisting of 5 SNPs X1 ...X6 , the original Phenotype Y1 , and two permutations Y2 ,Y3 for 10 individuals S1 ,...,S10 .
1.1
Input files
The input consists of 2 files: a file containing the genotype information and
another containing the phenotype information for each individual.
1
Yk=0
Yk=1
Xi=0
Xi=0 Xi=1 Xi=2
Event Event Event
a1
a2
a3
Event Event Event
c1
c2
c3
Xi=1
Xj=0 Xj=1 Xj=2
Event Event Event
b1
b2
b3
Event Event Event
d1
d2
d3
Xi=2
Total
Xj=0 Xj=1 Xj=2
Event Event Event
e1
e2
e3
Event Event Event
f1
f2
f3
M
Total
Table 2: The contingency table between two SNPs Xi and Xj for a given
phenotype Yk . M refers to the total amount of individuals.
(a) Genotype
0011001121
1212111121
1001000102
2202121111
(b) Phenotype
0000000010
Table 3: An example of the input file containing genotype and phenotype
information with 4 SNPs and 10 individuals. Genotype 0,1,2 corresponds
to homozygous dominant, heterozygous, and homozygous recessive. The
phenotype 0,1 corresponds to control and case respectively.
1.2
Output files
The output consists of a list of every SNP pair and the relevant test score.
The score can be calculated for any statistics defined in the contingency table.
In this experiment, the test score corresponds to chi-square statistic.
1.3
Parameters
The customizable parameters are as follows:
• individual - The number of individuals in the data. In this case it is
dependent on the data set parameters.
• SN P s - Number of SNPs in the data. In this case it is dependent on
the data sets ( which is fixed at 300).
• permutation - Number of permutations used in the significant test.
• f drt hreshold - The FDR threshold for significance.
2
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1. The
computer used for this experiments used the 64-bit Ubuntu 13.10 operating
system, with an Intel(R) Core(TM)2 Quad CPU Q6600 2.40GHz processor
and 8,00 GB of RAM memory.
Team contains a C++ program that takes as parameters the genotype file,
the phenotype file, the number of individuals, number of SNPs, number of
permutations for the significance test and FDR threshold. The number of
permutations is set to 100 and the FDR threshold is set to 1.
3
Results
The algorithm only outputs pairwise relations between SNPs. Due to this,
only epistasis detection will be evaluated.
The Power observed in Figure 1 show a maximum value of 8% of Power for
data sets with 500 individuals, 65% for 1000 individuals and 95% for 2000
individuals. There is a big correlation between the Power and the size of the
data sets. However, for frequencies smaller than 0.1 there is a near 0% Power
for most configurations, with the exception of data sets with 2000 individuals and 0.05 minor allele frequency. These values also increase with allele
frequency, with the exception of 0.5 allele frequencies.
Power(%)
92
500 individuals
1000 individuals
2000 individuals
100
65
47
43
50
21
0
0
0
0.01
0
0
1
0.05
95
92
0
0.1
Allele frequency
8
6
0.3
0.5
Figure 1: Power by allele frequency. For each frequency, three sizes of data
sets were used to measure the Power, with an odds ratio of 2.0 and prevalence
of 0.02. The Power is measured by the amount of data sets where the ground
truth was amongst the most relevant results, out of all 100 data sets.
The Type 1 Error Rate in Figure 2 has an interesting pattern, clearly
3
Type 1 Error Rate(%)
40
500 individuals
1000 individuals
2000 individuals
30
37
28
20
10
0
0
2
0.01
1
0
4
0.05
2
5
0.1
Allele frequency
10
0
1
0.3
1
0
0.5
Figure 2: Type 1 Error Rate by allele frequency and population size. The
Type 1 Error Rate is measured by the amount of data sets where the false
positives were amongst the most relevant results, out of all 100 data sets.
showing a growth in error rate with the population size. However, the error
does not necessarily increase with allele frequency, revealing a maximum of
37% in data sets with 0.05 allele frequency and 2000 individuals. There is
also a decrease in data sets with higher allele frequencies in data sets with
2000 individuals. Therefore the relation between error rate and allele frequency is undetermined.
There is a 10% difference in CPU usage (b) and 7 seconds in running
time (a), with a maximum of 74% and 10 seconds, respectively. The memory usage increases from 162 MB to 228 MB, which is a 40% increase. The
most relevant increase is the running time because the running time for 2000
individuals is the triple of the running time for 500 individuals, which is a
problem for big data sets.
There is a clear increase in Power for odds ratio increase in Figure 5,
especially from a 1.1 to a 1.5 odds ratio and population increase in Figure
4, with emphasis on the difference between 1000 and 2000 individuals The
prevalence test shows very little difference between disease prevalence in Figure 6 and the allele frequency shows a growth with the increase in minor
allele frequency.
4
1
CPU Usage(%)
Running Time(seconds)
10
5
0
500
1000
2000
Number of Individuals
74
72
70
68
500
1000
2000
Number of Individuals
(a) Average running time.
Memory Usage(Mbytes)
76
(b) Average CPU usage.
220
200
180
160
500
1000
2000
Number of Individuals
(c) Average memory usage.
Figure 3: Comparison of scalability measures between different sized data
sets. The data sets have a minor allele frequency is 0.5, 2.0 odds ratio, 0.02
prevalence.
5
4
Summary
TEAM is an exhaustive algorithm that uses permutation tests to generate
contingency tables, which then can be applied to any relevancy test. The
results show an increase in Power related to the increase in population size
and allele frequency. The scalability test shows that the running time of data
sets with the highest population size is the triple of the running time for data
sets with the lowest population size. The Type 1 Error Rate increases with
the population size, but the relation between error rate and allele frequency is
undetermined. The results of data set configurations by population and allele
frequency confirm the previously discussed results. The odds ratio increase
yields a clear increase in Power, but the prevalence increase shows nearly the
same Power.
References
[ZHZW10] Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang. TEAM:
efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics (Oxford, England), 26:i217–i227, 2010.
A
Bar Graphs
Power by Population
92
Power(%)
100
50
21
0
0
500
1000
2000
Population
Figure 4: Distribution of the Power by population. The allele frequency is
0.1, the odds ratio is 2.0, and the prevalence is 0.02.
6
Power by Odds Ratio
Power(%)
100
92
81
50
0
1
1.1
1.5
Odds Ratio
2.0
Figure 5: Distribution of the Power by odds ratios. The allele frequency is
0.1, the number of individuals is 2000, and the prevalence is 0.02.
Power by Prevalence
Power(%)
100
92
89
50
0
0.0001
0.02
Prevalence
Figure 6: Distribution of the Power by prevalence. The allele frequency is
0.1, the number of individuals is 2000, and the odds ratio is 2.0.
7
Power by Frequency
92
95
0.05 0.1
0.3
Frequency
0.5
92
Power(%)
100
43
50
0
0
0.01
Figure 7: Distribution of the averaged Power by allele frequency. The number
of individuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
8
B
Table of Results
Table 4: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,I,2.0,0.02
9
TP (%)
8
9
3
0
0
0
95
100
93
47
10
3
65
79
53
4
0
0
6
26
2
5
0
0
92
100
95
98
2
2
47
FP (%)
1
0
0
1
0
3
1
22
7
1
3
2
0
5
3
0
1
3
0
3
0
1
0
4
10
56
15
10
1
4
1
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
10
100
40
49
0
0
0
1
1
0
0
0
92
89
81
42
1
5
21
12
5
1
0
0
0
0
0
0
0
0
43
57
40
43
0
0
1
3
0
12
3
5
0
1
2
1
0
1
1
1
28
20
5
3
3
2
5
6
0
1
2
4
0
1
0
3
1
1
37
21
24
19
3
3
4
2
1
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
1
6
0
2
0
4
1
0
1
5
3
4
1
2
2
1
0
2
2
3
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
Table 5: A table containing the running time, cpu usage and memory usage
in each configuration.
Configuration*
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.02
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
Running Time (s)
3.28
3.81
3.07
3.76
3.08
3.91
9.81
11.00
11
CPU Usage (%)
66.99
54.75
74.40
57.98
68.52
55.00
74.75
72.09
Memory Usage (KB)
166590.64
166590.28
166590.44
166590.60
161592.60
166590.72
233543.92
233802.28
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.5,1000,I,1.1,0.02
0.5,1000,I,1.1,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
9.83
10.98
9.82
10.99
5.28
6.08
5.53
6.10
5.40
6.09
3.12
3.79
3.13
3.77
3.08
3.78
9.84
10.94
9.92
10.95
9.95
11.00
5.34
6.09
5.35
6.12
5.44
6.11
3.28
3.78
3.13
3.81
3.22
3.84
9.91
10.95
9.94
10.97
72.85
66.89
73.74
69.46
69.71
68.02
66.72
66.35
68.68
65.91
71.53
56.19
70.93
56.60
72.65
56.01
72.54
73.45
72.03
73.75
72.35
70.49
67.05
63.97
69.00
63.37
67.27
65.06
65.33
56.18
71.07
55.52
67.56
54.26
72.77
73.18
71.34
71.25
12
233535.72
233821.76
233562.12
233832.84
181210.92
181210.72
181210.60
181210.64
181210.64
181210.84
166590.44
166590.60
166590.72
166590.72
166590.68
166590.40
233557.00
233778.48
233546.36
233801.92
233546.48
233828.96
181210.88
181210.80
181210.56
181210.80
181210.44
181210.68
166590.76
166590.60
166590.60
166590.84
166590.64
166590.52
233527.88
233788.92
233538.28
233803.52
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,2000,I,2.0,0.02
0.01,2000,I,2.0,0.0001
0.01,2000,I,1.5,0.02
0.01,2000,I,1.5,0.0001
0.01,2000,I,1.1,0.02
0.01,2000,I,1.1,0.0001
9.82
10.76
5.46
6.10
5.41
6.17
5.49
6.25
3.06
3.67
3.10
3.70
3.09
3.74
10.87
10.88
9.76
10.84
9.76
10.89
5.45
6.01
5.24
6.04
5.34
5.99
3.13
3.69
3.02
3.77
3.20
3.72
9.70
10.87
9.76
10.83
9.81
10.86
69.45
73.08
66.40
65.14
67.52
63.74
65.42
57.76
74.66
63.00
73.32
60.99
74.07
60.54
75.38
76.33
75.67
77.87
76.10
78.26
69.61
69.13
74.24
68.74
71.82
68.72
72.69
60.83
76.70
59.05
69.81
60.54
77.18
76.75
76.94
77.23
81.09
76.11
13
231225.76
233841.76
181210.92
181210.64
181210.80
181210.80
181210.52
181210.68
166590.52
166590.84
166590.60
166590.68
166590.96
166590.84
233762.72
233830.32
233551.64
233818.16
233559.48
233821.16
181210.40
181210.88
181210.68
181211.00
181210.52
181210.64
166590.40
166590.96
166590.72
166590.52
166590.60
166590.64
233557.00
233813.88
233554.60
233817.52
233562.32
233839.32
0.01,1000,I,2.0,0.02
0.01,1000,I,2.0,0.0001
0.01,1000,I,1.5,0.02
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.02
0.01,1000,I,1.1,0.0001
5.28
6.00
5.35
6.05
5.35
5.99
73.35
69.20
71.68
68.01
71.71
68.05
181210.76
181210.72
181210.80
181210.36
181210.84
181211.00
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
14
Laboratory Note
Genetic Epistasis
VIII - Assessing Algorithm MBMDR
LN-8-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045
[email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
Model-Based Multifactor DImensionality Reduction (MBMDR) is
an algorithm that implements on the previous MDR methodology,
which consists on dividing SNPs into two clusters based on their risk
to the determination of the disease. Instead of using a predetermined
threshold from the frequency of SNPs in the data, MBMDR uses a
testing approach followed by a significance assessment. The results
show a high Power only for large sized data sets and very low Type 1
Error Rate for all configurations. The running time of the algorithm
makes the algorithm not viable for larger data sets.
1
Introduction
Multifactor DImensionality Reduction (MDR) [CLEP07] is one of the most
referenced algorithms for epistasis detection. MDR filters SNPs, based on
the frequency in case control data, to divide SNPs into high risk or low risk
based on a predetermined threshold. Using cross validation and permutations to determine the high/low risk groups, the algorithm returns the high
risk loci that have a stronger connection in the disease outcome. However,
it samples many SNPs together analysing at most one significant epistasis
model, skiping other possible SNP groups that may not have such significant
conection, but may also be related to the disease.
Model-Based Multifactor Dimensionality Reduction [MVV11] merges multilocus genotypes that have significant high or low risk based on association
testing, rather than a threshold value.
MB-MDR process can be divided into the following steps:
1. Multi-locus cell prioritization - Each two-locus genotype is assigned
to either High risk, Low risk or No Evidence of risk categories.
2. Association test on lower-dimensional construct - The result of
the first step creates a new variable with a value correlated to one of
the categories. This new variable is then compared with the original
label to find the weight of high and low risk genotype cells.
3. Significance assessment - This stage tries to correct the inflation of
type I errors after the combination of cells into the weight of High risk
and Low risk. This is done using the Wald statistic.
1.1
Input files
The input file consists of the Index and phenotype in the first two columns,
and the genotype of each SNP in the following columns. The first row corresponds to the name of each column.
””,”Y”,”SNP1”, ”SNP2”, ”SNP3”, ”SNP4”, ”SNP5”
”0”, 0,
1,
2,
0,
0,
0
”1”, 0,
0,
2,
1,
2,
0
”2”, 1,
1,
0,
1,
0,
1
”3”, 1,
1,
1,
2,
1,
0
Table 1: An example of the input file containing genotype and phenotype
information with 5 SNPs and 4 individuals.
1
1.2
Output files
The output consist of a list of SNP interactions selected with the following
columns for each interaction:
1. SNP1...SNPx - Names of snps in interaction.
2. NH - Number of significant High risk genotypes in the interaction.
3. betaH - Regresion coeficient in step2 for High risk exposition.
4. WH - Wald statistic for High risk category.
5. PH - P-value of the Wald test for the High risk category.
6. NL - Number of significant Low risk genotypes in the interaction.
7. betaL - Regresion coeficient in step2 for Low risk exposition.
8. WL - Wald statistic for Low risk category.
9. PL - P-value of the Wald test for the Low risk category.
10. MIN.P - Minimun p-value (min(PH,PL)) for the interaction model.
1.3
Parameters
The MBMDR can contain the following arguments:
• order - dimension of interactions to be analyzed.
• covar - (Optional) a data frame containing the covariates for adjusting
regression models.
• exclude - (Optional) Value/s of missing data.
• risk.threshold - Threshold used to define the risk category of a multilocus genotype. The default value is 0.1.
• adjust - (Optional) Types of regression adjustment. Can be ”none”,
”covariates”, ”main effects” or ”both”. The default value is ”none”.
• f irst.model - Specifies the first interaction to be tested. Useful when
stoped before finishing the complete analysis.
• list.models - (Optional) Exhaustive list of models to be analyzed. Only
possible interactions in this list will be analyzed.
2
• use.logistf - Boolean value indicating wheter or not the logistf package
should be used. The default value is TRUE.
• printStep1 - Boolean value that prints every model obtained if the
value is TRUE. The default value is FALSE.
2
Experimental Settings
The datasets used in the experiments are characterized in Lab Note 1.
The limit number of interactions selected is 2, considering that the ground
truth is a pairwise interaction, and all of the SNPs are tested with each other
for pairwise interactions.
3
Results
The algorithm only outputs the statistical relevancy test of interactions between SNPs. Due to this, only epistatic disease model data sets will be used
for this experiment. Because of time constraints, several computers were used
to obtain results. This means that it is not possible to compare scalability
results.
The Figure 1 reveals a large increase with population size for data sets with
a minor allele frequency higher than 0.01. There is a big increase in data
sets with 2000 individuals from a minor allele frequency of 0.05 to 0.1. The
results from data sets with a smaller amount of population size has much
lower Power, having 0 Power for almost all data sets with 500 individuals.
There is also a clear increase with minor allele frequency.
According to Figure 2 the Type 1 Error Rate is very low across all allele
frequencies and data set sizes, having a maximum of 6% and 2% for 0.05
minor allele frequency with 2000 and 1000 individuals respectively. For other
allele frequencies, only 0.1 and 0.3 contain false positives for data sets with
2000 individuals.
Figure 3 and 6 show the same results as Figure 1, with a different prespective. Figure 4 also shows an increase in Power with the increase in odds ratio.
Figure 5 shows a smaller increase in Power with the increase in prevalence.
3
Power(%)
100
85
500 individuals
1000 individuals
2000 individuals
80
60
71
54
40
14
20
0
0
0
0
0
0
0.01
2
0
0.05
7
0
0.1
Allele frequency
12
1
0.3
0.5
Type 1 Error Rate(%)
Figure 1: Power by allele frequency. For each frequency, three sizes of data
sets were used to measure the Power, with an odds ratio of 2.0 and prevalence
of 0.02. The Power is measured by the amount of data sets where the ground
truth was amongst the most relevant results, out of all 100 data sets.
500 individuals
1000 individuals
2000 individuals
6
4
3
2
2
0
6
0
0
0.01
0
2
0
0
0.05
0
0.1
Allele frequency
0
0
0.3
0
0
0.5
Figure 2: Type 1 Error Rate by allele frequency and population size. The
Type 1 Error Rate is measured by the amount of data sets where the false
positives were amongst the most relevant results, out of all 100 data sets.
4
0
4
Summary
MBMDR is an algorithm based on the popular MDR approach, with a clustering of SNPs by high and low risk of determining the disease phenotype.
The results show very high Power for data sets with 2000 individuals, but
very low Power for all other configurations. The Type 1 Error Rate is very
low, reaching a maximum of only 6% for 0.05 allele frequency and 2000 individuals. Considering that there are no results concerning the scalability due
to the expected running time of the algorithm shows that it is not viable to
use this algorithm on big data sets that might contain thousands or millions
of SNPs.
References
[CLEP07] Yujin Chung, Seung Yeoun Lee, Robert C Elston, and Taesung Park. Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics
(Oxford, England), 23:71–76, 2007.
[MVV11] Jestinah M Mahachie John, Francois Van Lishout, and Kristel
Van Steen. Model-Based Multifactor Dimensionality Reduction
to detect epistasis for quantitative traits in the presence of errorfree and noisy data. Eur J Hum Genet, 19(6):696–703, June 2011.
A
Bar Graphs
5
Power by Population
Power(%)
100
54
50
0
0
500
2
1000
2000
Population
Figure 3: Distribution of the Power by population. The allele frequency is
0.1, the odds ratio is 2.0, and the prevalence is 0.02.
Power by Odds Ratio
Power(%)
100
54
50
0
0
1.1
1
1.5
Odds Ratio
2.0
Figure 4: Distribution of the Power by odds ratios. The allele frequency is
0.1, the number of individuals is 2000, and the prevalence is 0.02.
6
Power by Prevalence
Power(%)
100
54
50
36
0
0.0001
0.02
Prevalence
Figure 5: Distribution of the Power by prevalence. The allele frequency is
0.1, the number of individuals is 2000, and the odds ratio is 2.0.
Power by Frequency
92
95
0.05 0.1
0.3
Frequency
0.5
92
Power(%)
100
43
50
0
0
0.01
Figure 6: Distribution of the Power by allele frequency. The number of
individuals is 2000, the odds ratio is 2.0, and the prevalence is 0.02.
7
B
Table of Results
Table 2: A table containing the percentage of true positives and false positives in each configuration. The first column contains the description of the
configuration. The second and third columns contain the number of datasets
with true positives and false positives respectively, out of all 100 data sets
per configuration.
Configuration*
0.5,500,I,2.0,0.02
0.5,500,I,2.0,0.0001
0.5,500,I,1.5,0.0001
0.5,500,I,1.1,0.02
0.5,500,I,1.1,0.0001
0.5,2000,I,2.0,0.02
0.5,2000,I,2.0,0.0001
0.5,2000,I,1.5,0.02
0.5,2000,I,1.5,0.0001
0.5,2000,I,1.1,0.02
0.5,2000,I,1.1,0.0001
0.5,1000,I,2.0,0.02
0.5,1000,I,2.0,0.0001
0.5,1000,I,1.5,0.02
0.5,1000,I,1.5,0.0001
0.3,500,I,2.0,0.02
0.3,500,I,2.0,0.0001
0.3,500,I,1.5,0.02
0.3,500,I,1.5,0.0001
0.3,500,I,1.1,0.02
0.3,500,I,1.1,0.0001
0.3,2000,I,2.0,0.02
0.3,2000,I,2.0,0.0001
0.3,2000,I,1.5,0.02
0.3,2000,I,1.5,0.0001
0.3,2000,I,1.1,0.02
0.3,2000,I,1.1,0.0001
0.3,1000,I,2.0,0.02
0.3,1000,I,2.0,0.0001
0.3,1000,I,1.5,0.02
0.3,1000,I,1.5,0.0001
8
TP (%)
1
2
0
0
0
85
91
17
2
0
0
12
26
0
0
0
11
0
0
0
0
71
100
5
43
0
0
7
62
0
5
FP (%)
0
0
0
0
0
0
2
1
0
0
0
0
0
0
10
0
0
0
0
0
0
2
8
0
2
0
0
0
0
0
0
0.3,1000,I,1.1,0.02
0.3,1000,I,1.1,0.0001
0.1,500,I,2.0,0.02
0.1,500,I,2.0,0.0001
0.1,500,I,1.5,0.02
0.1,500,I,1.5,0.0001
0.1,500,I,1.1,0.02
0.1,500,I,1.1,0.0001
0.1,2000,I,2.0,0.02
0.1,2000,I,2.0,0.0001
0.1,2000,I,1.5,0.02
0.1,2000,I,1.5,0.0001
0.1,2000,I,1.1,0.02
0.1,2000,I,1.1,0.0001
0.1,1000,I,2.0,0.02
0.1,1000,I,2.0,0.0001
0.1,1000,I,1.5,0.02
0.1,1000,I,1.5,0.0001
0.1,1000,I,1.1,0.02
0.1,1000,I,1.1,0.0001
0.05,500,I,2.0,0.02
0.05,500,I,2.0,0.0001
0.05,500,I,1.5,0.02
0.05,500,I,1.5,0.0001
0.05,500,I,1.1,0.02
0.05,500,I,1.1,0.0001
0.05,2000,I,2.0,0.02
0.05,2000,I,2.0,0.0001
0.05,2000,I,1.5,0.02
0.05,2000,I,1.5,0.0001
0.05,2000,I,1.1,0.02
0.05,2000,I,1.1,0.0001
0.05,1000,I,2.0,0.02
0.05,1000,I,2.0,0.0001
0.05,1000,I,1.5,0.02
0.05,1000,I,1.5,0.0001
0.05,1000,I,1.1,0.02
0.05,1000,I,1.1,0.0001
9
0
0
0
0
0
0
0
0
54
36
1
0
0
0
2
1
0
0
0
0
0
0
0
0
0
0
14
3
7
17
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
1
3
7
0
0
2
0
0
0
0
1
0.01,500,I,2.0,0.02
0.01,500,I,2.0,0.0001
0.01,500,I,1.5,0.02
0.01,500,I,1.5,0.0001
0.01,500,I,1.1,0.02
0.01,500,I,1.1,0.0001
0.01,1000,I,1.5,0.0001
0.01,1000,I,1.1,0.0001
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
*MAF,POP,MOD,OR,PREV where MAF represents the minor allele frequency, POP is the number of individuals, MOD is the used model (with
or without main effect and with or without epistasis effect), OR is the odds
ratio and PREV is the prevalence of the disease.
10
Laboratory Note
Genetic Epistasis
IX - Comparative Assessment of the
Algorithms
LN-9-2014
Ricardo Pinho and Rui Camacho
FEUP
Rua Dr Roberto Frias, s/n,
4200-465 PORTO
Portugal
Fax: (+351) 22 508 1440
e-mail: [email protected]
www : http://www.fe.up.pt/∼ei09045 [email protected]
www : http://www.fe.up.pt/∼rcamacho
May 2014
Abstract
This lab note contains the results obtained from the algorithms
discussed in previous lab notes. All algorithms are compared by their
characteristics and by their Power, scalability, and Type 1 Error Rates
in epistatic detection, main effect detection, and full effect detection.
From the results obtained, we can see that the algorithm BOOST has
the highest Power in epistatic detection and main effect detection,
but has a high error rate. Screen and Clean has a contant but high
error rate overall, very low Power in epistatic detection and average
Power in other models.SNPHarvester and SNPRuler have relatively
low Power, but low error rates. TEAM has good Power, but high
error rate. MBMDR has good Power and low Type 1 Error Rate, but
very bad scalability. BEAM3 has high Power in main effect detection,
but also high error rate. In terms of scalability, BOOST is the most
scalable, with MBMDR being the least scalable.
1
Introduction
In this lab note, the epistasis detection algorithms used in earlier lab notes([PC14b]
[PC14c] [PC14d] [PC14e] [PC14f] [PC14g] [PC14h]) will be compared, using
the results from the data sets and measurements discussed in Lab Note LN1-2014 [PC14a].
The algorithms used in this empirical study are BEAM 3.0 [Zha12]; BOOST
[WYY+ 10a]; MBMDR [MVV11]; Screen and Clean [WDR+ 10]; SNPRuler
[WYY+ 10b]; SNPHarvester [YHW+ 09]; and TEAM [ZHZW10]. Table 1 and
Table 2 show the main characteristics of the search methods, scoring techniques, types of disease models detected, and the programming language of
the tested algorithms [SZS+ 11].
Table 1: Similarities and differences between BEAM3, BOOST MBMDR,
and Screen & Clean.
Features
BEAM 3
Stochastisc
√
BOOST
Exhaustive
−
√
MBMDR Screen & Clean
Exhaustive
Heuristic
√
−
−*
−*
−
−
√
−
√
√
Search
Permutation Test
Chi-square Test
−*
√
Tree/Graph Structure
−
√
Bonferroni Correction
−
√
√
Interactive Effect
√
√
√
√
Main Effect
√
√
√
√
Full Effect
Programming Language
C++
C
R
R
*Although BEAM3 can evaluate interactive and full effects, the evaluation
test is not comparable between methods. Only single SNPs are evaluated
with χ2 test. MBMDR and Screen & Clean results are comparable with
other algorithms.
1
Features
Search
Permutation Test
Chi-square Test
Tree Structure
Bonferroni Correction
Interactive Effect
Main Effect
Full Effect
Programming Language
2
SNPHarvester
Stochastic
−
√
−
√
√
√
√
Java
SNPRuler
TEAM
Heuristic Exhaustive
√
−
√
−
√
√
√
√
−
√
−
−
Java
−
−
C++
Comparative Assesmement
The measures used to assess the quality of each algorithm are: Power; Scalability; and Type 1 Error Rate.
2.1
Power
The Power of an algorithm is related to its ability to find the ground truth
of the disease. In this case, the Power is evaluated by the number of data
sets, out of 100, where the algorithm finds the ground truth and is measured
as a percentage for each data set configuration. In each data set, the most
significant interactions, i.e. α < 0.05, are selected.
2.2
Scalability
Scalability is determined by 3 main factors: execution time, cpu usage,
and memory usage. Execution time is measured in seconds, cpu usage is
measured in precentage of processor usage by the algorithm, and memory
usage is measured in Kilobytes of RAM memory used by the algorithm. All
measures are averaged over the 100 data sets in each data set configuration.
2.3
Type 1 Error Rate
Similar to the Power, the Type 1 Error Rate is determined by the amount of
false positives in the 100 data sets within the most significant interactions,
i.e. α < 0.05.
2
3
Experimental Procedure
As mentioned in Lab Note LN-1-2014 [PC14a], there are 270 different configurations of data sets, with different parameters: allele frequency (0.01,0.05,0.1,0.3,
and 0.5); population size (500,1000, and 2000); odds ratio (1.1,1.5, and 2.0);
prevalence (0.0001 and 0.02); and disease model (Epistasis, Main effect, and
Epistasis + Main Effect).
To test the Power and Type 1 Error Rate of algorithms, the outputs of each
algorithm is gathered for each data set configuration and the corresponding
confusion matrix is created. The output of each algorithm is filtered, selecting only interactions with a statistical relevancy of 5%. From these confusion
matrices, the number true positives and false positives of data sets within
each configuration is obtained and used as comparison for Power and Type
1 Error Rate respectively. For scalability, the built-in shell command time
was used to obtain al the scalability measures for all algorithms.
4
Results
To compare each criteria of the algorithms, Table 2, 3, and 4 were created
to represent the Power and Type 1 Error Rate of each algorithm, by number
of individuals and allele frequency for epistasis, main effect and full effect
respectively. Table 5 shows the results of the scalability measures used to
evaluate each algorithm.
For epistasis detection, we can see that, for data sets with 500 individuals,
no algorithm has a Power above 26%. This shows ta big dificulty in detecting epistasis with few individuals. The algorithm with best Power for these
data sets is BOOST, followed by TEAM and SNPRuler and SNPHarvester.
In error rate however, the algorithm with the lowest values is SNPRuler,
followed by TEAM, SNPHarvester and BOOST. For data sets with 1000
individuals, there is a big increase in Power in all algorithms, reaching a
maximum of 91%. BOOST has the best Power in all allele frequencies, followed by TEAM, SNPRuler and SNPHarvester. SNPRuler is once again the
algorithm with the lowest Type 1 Error Rate, followed by TEAM, BOOST
and SNPHarvester. In 2000 individuals, BOOST has the best Power with a
maximum of 100%, followed by TEAM and SNPHarvester, with SNPRuler
being better than SNPHarvester for 0.5 minor allele frequency. The lowest
error rate is achieved by SNPRuler. Each of the other algorithms has a high
Type 1 Error Rate in at least 1 setting. Screen and Clean is clearly the
worse algorithm due to its lack of Power and high Type 1 Error Rate across
all data set sizes. The Power shows an increase with allele frequency in each
3
algorithm, reaching their maximum Power in 0.5 allele frequency. There is
no clear correlation between error rate and allele frequency for any algorithm.
POP
MAF
BOOST
SnC
SNPH
SNPR
TEAM
POP
MAF
BOOST
SnC
SNPH
SNPR
TEAM
POP
MAF
BOOST
SnC
SNPH
SNPR
TEAM
0.01
P T1ER
0%
4%
0% 15%
0%
4%
0%
0%
0%
0%
P
0%
0%
0%
0%
0%
0.01
T1ER
7%
17%
4%
0%
2%
0.01
P T1ER
0%
2%
0% 18%
0%
2%
0%
0%
0%
1%
0.05
P
T1ER
0%
7%
0%
15%
0%
4%
0%
0%
0%
0%
P
0%
0%
0%
0%
1%
0.05
T1ER
4%
22%
13%
0%
4%
0.05
P
T1ER
7%
2%
0%
20%
18% 27%
0%
1%
43% 37%
500 individuals
0.1
0.3
P
T1ER
P
T1ER
1%
7%
14%
6%
0%
14%
0%
19%
0%
7%
4%
3%
0%
0%
3%
0%
0%
2%
6%
0%
1000 individuals
0.1
0.3
P
T1ER
P
T1ER
41%
5%
66%
2%
0%
16%
0%
15%
21%
9%
43%
9%
10%
0%
35%
0%
21%
5%
47%
1%
2000 individuals
0.1
0.3
P
T1ER
P
T1ER
94% 21% 100%
6%
6%
21%
2%
16%
85% 19%
70%
11%
32%
8%
44%
0%
92% 28%
92%
10%
0.5
P
T1ER
26%
4%
0%
18%
2%
4%
6%
0%
8%
1%
P
91%
0%
14%
71%
65%
0.5
T1ER
8%
22%
3%
0%
0%
0.5
P
T1ER
100%
8%
0%
14%
33%
5%
92%
0%
95%
1%
Table 2: This table contains the results for epistasis detection. A comparison
between the tested algorithms: BOOST, Screen and Clean, SNPHarvester,
SNPRuler, and TEAM. The table is organized by population size (POP) and
minor allele frequency (MAF), with an odds ratio of 2.0 and a prevalence
of 0.02. For each allele frequency, there are two columns: the Power (P)
obtained, and the Type 1 Error Rate (T1ER).
In main effect detection, for 500 individuals, the best algorithm is BEAM3,
closely followed by BOOST, SNPHarvester and Screen and Clean far behind.
The Type 1 Error Rate is lowest in MBMDR, and Screen and Clean, with
BEAM3, SNPHarvester, and BOOST very close to each other, with very
high error rates, BOOST having the highest error rate. for 1000 individu4
als, BOOST has better Power than BEAM3, followed by SNPHarvester, and
Screen and Clean with MBMDR far behind. The Type 1 Error Rate is higher
for BOOST, very closely followe by BEAM3, SNPHarvester, and Screen and
Clean, with MBMDR having the lowest error rate. For data sets with 2000
individuals, BOOST and MBMDR have a better Power for data sets with
allele frequency lower than 0.1, and BEAM3, BOOST and SNPHarvester
equally good in allele frequencies higher than 0.1. The error rate is lower
generally for MBMDR, followed by Scren and Clean.
Table 4 shows the full effect detection for BOOST, Screen and Clean,
and SNPHarvester. BOOST and SNPHarvester have the highest Power detection for all allele frequencies but have a high Type 1 Error Rate. Screen
and Clean has high Power for high allele frequencies, but 0 for configurations
belo 0.3 and with a higher Type 1 Error Rate for configurations below 0.1.
Screen and Clean has the lowest Type 1 Error Rate but also has the worst
Power detection. BOOST has the best ratio of Power to Type 1 Error Rate.
Table 5 shows the running time, CPU usage and memory usage of all
algorithms for scalability measure. Screen and Clean is the slowest recorded
algorithm, followed by SNPHarvester, TEAM, BEAM3 and SNPRuler, with
BOOST being the fastest algorithm. Screen and Clean also has the highest
increase in running time, followed by SNPHarvester, TEAM, with BOOST,
BEAM3, and SNPRuler far behind. SNPRuler is the algorithm with the
highest CPU usage, having to resort to more than 1 core to finish each
task. SNPHarvester, BOOST, BEAM3, and Screen and Clean are all close
to 100%, with TEAM being the algorithm with the least required CPU usage.
BEAM3, BOOST, and TEAM have an increase of CPU usage with data set
size, TEAM being the algorithm with the highest increase. In memory usage,
SNPRuler shows the highest usage of memory, closely followed by TEAM,
Screen and Clean, SNPHarvester, BEAM3, and finally BOOST far behind.
5
POP
MAF
BEAM3
BOOST
MBMDR
SnC
SNPH
POP
MAF
BEAM3
BOOST
MBMDR
SnC
SNPH
POP
MAF
BEAM3
BOOST
MBMDR
SnC
SNPH
0.01
P T1ER
0%
0%
0%
1%
0%
0%
0% 14%
0%
1%
P
0%
0%
0%
0%
0%
0.01
T1ER
6%
7%
0%
14%
10%
0.01
P T1ER
0%
1%
0%
1%
0%
0%
0% 13%
0%
1%
0.05
P
T1ER
0%
3%
0%
1%
0%
0%
0%
17%
0%
5%
P
0%
1%
0%
0%
0%
0.05
T1ER
3%
3%
2%
21%
4%
0.05
P
T1ER
1%
17%
14% 11%
14%
6%
0%
22%
1%
24%
500 individuals
0.1
0.3
P
T1ER
P
T1ER
0%
9%
100% 71%
2%
12% 100% 78%
0%
0%
0%
0%
0%
21%
20%
23%
0%
11% 100% 78%
1000 individuals
0.1
0.3
P
T1ER
P
T1ER
32% 18% 100% 99%
43% 23% 100% 99%
2%
0%
7%
0%
0%
23%
54%
28%
38% 22% 100% 99%
2000 individuals
0.1
0.3
P
T1ER
P
T1ER
92% 67% 100% 100%
97% 74% 100% 100%
54%
3%
71%
2%
39% 36%
58%
38%
92% 79% 100% 100%
0.5
P
T1ER
100% 99%
100% 97%
1%
0%
54%
15%
100% 99%
P
100%
100%
12%
70%
100%
0.5
T1ER
100%
100%
0%
30%
100%
0.5
P
T1ER
100% 100%
100% 100%
85%
0%
62%
48%
100% 100%
Table 3: This table contains the results for main effect detection. A comparison between the tested algorithms: BEAM3, BOOST, Screen and Clean, and
SNPHarvester. The table is organized by population size (POP) and minor
allele frequency (MAF), with an odds ratio of 2.0 and a prevalence of 0.02.
For each allele frequency, there are two columns: the Power (P) obtained,
and the Type 1 Error Rate (T1ER).
6
POP
MAF
BOOST
SnC
SNPH
POP
MAF
BOOST
SnC
SNPH
POP
MAF
BOOST
SnC
SNPH
0.01
P T1ER
0% 10%
0% 18%
0%
2%
0.05
P
T1ER
0%
4%
0%
15%
0%
8%
0.01
P T1ER
0% 11%
0% 14%
0%
4%
0.05
P
T1ER
2%
16%
0%
21%
0%
8%
P
0%
0%
0%
0.01
T1ER
7%
14%
1%
P
15%
0%
0%
0.05
T1ER
17%
21%
20%
500 individuals
0.1
0.3
P
T1ER
P
T1ER
1%
15% 100% 100%
0%
19%
30%
19%
0%
9%
100% 100%
1000 individuals
0.1
0.3
P
T1ER
P
T1ER
42% 38% 100% 100%
0%
28%
58%
35%
32% 27% 100% 100%
2000 individuals
0.1
0.3
P
T1ER
P
T1ER
98% 81% 100% 100%
0%
33%
40%
68%
95% 79% 100% 100%
0.5
P
T1ER
100% 100%
49%
37%
100% 100%
0.5
P
T1ER
100% 100%
73%
45%
100% 100%
P
100%
91%
100%
0.5
T1ER
100%
84%
100%
Table 4: This table contains the results for full effect detection. A comparison
between the tested algorithms: BOOST, Screen and Clean, and SNPHarvester. The table is organized by population size (POP) and minor allele
frequency (MAF), with an odds ratio of 2.0 and a prevalence of 0.02. For
each allele frequency, there are two columns: the Power (P) obtained, and
the Type 1 Error Rate (T1ER).
7
BEAM3
BOOST
MBMDR*
SnC
SNPHarvester
SNPRuler
TEAM
Running Time (s)
500 1000 2000
4.9
7
8
0.16 0.22
0.34
−
−
−
8.05 18.65 34.65
9.29 25.89
33
2.7
3.09
4.1
3.28 5.28
9.81
CPU Usage(%)
500
1000
2000
87.8
96.3
95.5
95.7 98.79 97.87
−
−
−
75.7 98.99 77.25
102.1 86.5
101.6
130.2 141.9 156.28
66.99 69.71 74.75
Memory Usage (MB)
500
1000 2000
4
4.3
5.8
0.98
1
1.2
−
−
−
129.8 137.2 152.5
68.35 71.3 76.86
312.7 316
320.2
162.7 176
228.1
Table 5: Scalability test containing the average running time, CPU usage,
and memory usage by data set population size. *MBMDR does not contain
scalability results because these were obtained from different computers with
different hardware settings from all other results. The data sets have a minor
allele frequency is 0.5, 2.0 odds ratio, 0.02 prevalence.
8
5
Results Discussion
The results obtained from all the different algorithms show interesting qualities among them. BOOST is clearly the algorithm with the highest Power,
but has high Type 1 Error Rate. SNPRuler has low Type 1 Error Rate,
but not very high Power and only works for epistasis detection. Screen and
Clean is ineffective in most settings, but has a relatively low Type 1 Error
Rate and high Power for main effect and full detection in data sets with high
allele frequency. BEAM3 only works for main effect detection but has high
Power with slightly lower error rate than BOOST. SNPHarvester has low
Power, but also low Type 1 Error Rate in all model types. MBMDR has
good Power for certain configurations and a very low Type 1 Error Rate,
however it has a very high running time for each data set. TEAM has good
Power, with slightly high Type I Error Rate in certain configurations.
BOOST is the most scalable algorithm, followed by SNPRuler and BEAM3.
This is specially important for large data sets and their ability to work in
an ensemble approach. In epistasis detection, considering the Power, Screen
and Clean and SNPHarvester show the worse potential. For main effect, The
Power is lowest for Scren and Clean and MBMDR. For full effect, Screen and
Clean is once again the weakest algorithm.
With this information, the best algorithms for each scenario can be used
together to maximize Power and lower Type 1 Error Rate.
This experiments use more configurations than any previous empirical studies. These configurations are processed by 7 of the state-of-the-art algorithms, which yielded interesting results. The contribution of these experiments is an unprecedented large comparison study, using various relevant
measures, which allows for a full understanding of each algorithm.
References
[MVV11]
Jestinah M Mahachie John, Francois Van Lishout, and Kristel
Van Steen. Model-Based Multifactor Dimensionality Reduction
to detect epistasis for quantitative traits in the presence of errorfree and noisy data. Eur J Hum Genet, 19(6):696–703, June
2011.
[PC14a]
Ricardo Pinho and Rui Camacho. Genetic Epistasis I - Materials and methods. 2014.
[PC14b]
Ricardo Pinho and Rui Camacho. Genetic Epistasis II - Assessing Algorithm BEAM 3.0. 2014.
9
[PC14c]
Ricardo Pinho and Rui Camacho. Genetic Epistasis III - Assessing Algorithm BOOST. 2014.
[PC14d]
Ricardo Pinho and Rui Camacho. Genetic Epistasis IV - Assessing Algorithm Screen and Clean. 2014.
[PC14e]
Ricardo Pinho and Rui Camacho. Genetic Epistasis V - Assessing Algorithm SNPRuler. 2014.
[PC14f]
Ricardo Pinho and Rui Camacho. Genetic Epistasis VI - Assessing Algorithm SNPHarvester. 2014.
[PC14g]
Ricardo Pinho and Rui Camacho. Genetic Epistasis VII - Assessing Algorithm TEAM. 2014.
[PC14h]
Ricardo Pinho and Rui Camacho. Genetic Epistasis VIII - Assessing Algorithm MBMDR. 2014.
[SZS+ 11]
Junliang Shang, Junying Zhang, Yan Sun, Dan Liu, Daojun
Ye, and Yaling Yin. Performance analysis of novel methods for
detecting epistasis, 2011.
[WDR+ 10]
Jing Wu, Bernie Devlin, Steven Ringquist, Massimo Trucco,
and Kathryn Roeder. Screen and clean: a tool for identifying
interactions in genome-wide association studies. Genetic epidemiology, 34:275–285, 2010.
[WYY+ 10a] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan,
Nelson L S Tang, and Weichuan Yu. BOOST: A fast approach
to detecting gene-gene interactions in genome-wide case-control
studies. American journal of human genetics, 87:325–340, 2010.
[WYY+ 10b] Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Nelson L S
Tang, and Weichuan Yu. Predictive rule inference for epistatic
interaction detection in genome-wide association studies. Bioinformatics (Oxford, England), 26:30–37, 2010.
[YHW+ 09]
Can Yang, Zengyou He, Xiang Wan, Qiang Yang, Hong Xue,
and Weichuan Yu. SNPHarvester: a filtering-based approach
for detecting epistatic interactions in genome-wide association
studies. Bioinformatics (Oxford, England), 25:504–511, 2009.
[Zha12]
Yu Zhang. A novel bayesian graphical model for genome-wide
multi-SNP association mapping. Genetic Epidemiology, 36:36–
47, 2012.
10
[ZHZW10]
A
A.1
Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang.
TEAM: efficient two-locus epistasis tests in human genome-wide
association study. Bioinformatics (Oxford, England), 26:i217–
i227, 2010.
Bar Graphs
Population size
11
Power
Type 1 Error Rate
60
40
20
0
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
M
O
S
BM T
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
0
Power/Type 1 Error(%)
10
BO
Power/Type 1 Error(%)
Power
Type 1 Error Rate
20
Algorithms
Algorithms
150
(b) 1000 individuals.
Power
Type 1 Error Rate
100
50
O
M ST
BM
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
0
BO
Power/Type 1 Error(%)
(a) 500 individuals.
Algorithms
(c) 2000 individuals.
Figure 1: These results correspond to epistasis detection by population size,
with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen and
Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains the
values for all algorithms in data sets with 500 individuals (a), 1000 individuals
(b), and 2000 individuals (c).
12
20
Algorithms
SN
PH
C
Algorithms
(a) 500 individuals.
(b) 1000 individuals.
150
Power
Type 1 Error Rate
100
50
PH
SN
C
Sn
BO
BE
O
M
ST
0
A
Power/Type 1 Error(%)
Sn
SN
ST
0
BE
A
PH
C
Sn
O
ST
BE
BO
M
0
40
BO
O
10
Power
Type 1 Error Rate
60
M
Power/Type 1 Error(%)
20
A
Power/Type 1 Error(%)
Power
Type 1 Error Rate
30
Algorithms
(c) 2000 individuals.
Figure 2: These results correspond to main effect detection by population
size, with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence.
The results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 500 individuals (a), 1000 individuals (b), and 2000 individuals (c).
13
0
Algorithms
PH
Algorithms
(a) 500 individuals.
(b) 1000 individuals.
Power
Type 1 Error Rate
150
100
50
PH
SN
C
Sn
O
ST
0
BO
Power/Type 1 Error(%)
SN
BO
SN
BO
20
C
PH
C
Sn
ST
0
40
Sn
10
Power
Type 1 Error Rate
60
O
ST
20
Power/Type 1 Error(%)
Power
Type 1 Error Rate
O
Power/Type 1 Error(%)
30
Algorithms
(c) 2000 individuals.
Figure 3: These results correspond to full effect detection by population size,
with a 0.1 minor allele frequency, 2.0 odds ratio, and 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen and
Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains the
values for all algorithms in data sets with 500 individuals (a), 1000 individuals
(b), and 2000 individuals (c).
14
A.2
P
T1ER
80
60
40
20
0
(a) 0.01 allele frequency.
(b) 0.05 allele frequency.
200
P
T1ER
P
T1ER
100
BO
M O
BM ST
D
R
Sn
SN C
SNPH
T PR
EA
M
0
BO
M O
BM ST
D
R
Sn
SN C
SNPH
T PR
EA
M
200
150
100
50
0
P
T1ER
BO
M O
BM ST
D
R
Sn
SN C
SNPH
T PR
EA
M
BO
M O
BM ST
D
R
Sn
SN C
SNPH
T PR
EA
M
40
30
20
10
0
Frequency
(c) 0.1 allele frequency.
200
(d) 0.3 allele frequency.
P
T1ER
100
BO
M O
BM ST
D
R
Sn
SN C
SNPH
T PR
EA
M
0
(e) 0.5 allele frequency.
Figure 4: These results correspond to epistasis detection by minor allele
frequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3
(d), and 0.5 (e) allele frequencies.
15
20
P
T1ER
40
20
0
0
BE
BE
A
BO M
O
ST
Sn
C
SN
PH
A
BO M
O
ST
Sn
C
SN
PH
10
(a) 0.01 allele frequency.
(b) 0.05 allele frequency.
200
P
T1ER
P
T1ER
100
BE
A
BO M
O
ST
Sn
C
SN
PH
0
BE
A
BO M
O
ST
Sn
C
SN
PH
200
150
100
50
0
P
T1ER
(c) 0.1 allele frequency.
200
(d) 0.3 allele frequency.
P
T1ER
100
BE
A
BO M
O
ST
Sn
C
SN
PH
0
(e) 0.5 allele frequency.
Figure 5: These results correspond to main effect detection by minor allele
frequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3
(d), and 0.5 (e) allele frequencies.
16
30
40
P
T1ER
20
P
T1ER
20
10
(a) 0.01 allele frequency.
200
150
100
50
0
PH
SN
C
(b) 0.05 allele frequency.
200
P
T1ER
P
T1ER
100
(c) 0.1 allele frequency.
200
PH
SN
C
Sn
ST
BO
O
PH
SN
Sn
C
0
ST
BO
O
Sn
BO
O
PH
SN
C
Sn
BO
ST
0
O
ST
0
(d) 0.3 allele frequency.
P
T1ER
100
PH
C
Sn
SN
BO
O
ST
0
(e) 0.5 allele frequency.
Figure 6: These results correspond to full effect detection by minor allele
frequency, with 2000 individuals, 2.0 odds ratio, and 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 0.01 (a), 0.05 (b), 0.1 (c), 0.3
(d), and 0.5 (e) allele frequencies.
17
20
M
O
S
BM T
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
0
150
Power
Type 1 Error Rate
100
50
0
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
Power
Type 1 Error Rate
40
Power/Type 1 Error(%)
Odds Ratio
BO
Power/Type 1 Error(%)
A.3
Algorithms
Algorithms
150
(b) 1.5 odds ratio.
Power
Type 1 Error Rate
100
50
0
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
Power/Type 1 Error(%)
(a) 1.1 odds ratio.
Algorithms
(c) 2.0 odds ratio.
Figure 7: These results correspond to epistatic detection by odds ratio, with
a minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)
odds ratio.
18
20
Algorithms
(a) 1.1 odds ratio.
(b) 1.5 odds ratio.
150
PH
Power
Type 1 Error Rate
100
50
PH
SN
Sn
C
BO
O
M
ST
0
BE
A
Power/Type 1 Error(%)
Algorithms
SN
C
Sn
SN
ST
0
BE
A
PH
C
Sn
O
ST
BE
BO
M
0
40
BO
O
10
Power
Type 1 Error Rate
M
Power/Type 1 Error(%)
20
A
Power/Type 1 Error(%)
Power
Type 1 Error Rate
30
Algorithms
(c) 2.0 odds ratio.
Figure 8: These results correspond to main effect detection by odds ratio,
with a minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence.
The results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)
odds ratio.
19
50
Algorithms
PH
Algorithms
(a) 1.1 odds ratio.
(b) 1.5 odds ratio.
Power
Type 1 Error Rate
150
100
50
PH
SN
C
Sn
O
ST
0
BO
Power/Type 1 Error(%)
SN
C
ST
0
Sn
SN
BO
Power
Type 1 Error Rate
100
BO
O
PH
C
Sn
ST
0
Power/Type 1 Error(%)
10
O
Power/Type 1 Error(%)
Power
Type 1 Error Rate
20
Algorithms
(c) 2.0 odds ratio.
Figure 9: These results correspond to full effect detection by odds ratio, with
a minor allele frequency of 0.1, 2000 individuals, and a 0.02 prevalence. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen
and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains
the values for all algorithms in data sets with 1.1 (a), 1.5 (b), and 2.0 (c)
odds ratio.
20
Power
Type 1 Error Rate
100
50
M
O
S
BM T
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
0
Algorithms
150
Power
Type 1 Error Rate
100
50
0
BO
O
M ST
BM
D
R
Sn
C
SN
PH
SN
P
T R
EA
M
150
Power/Type 1 Error(%)
Prevalence
BO
Power/Type 1 Error(%)
A.4
Algorithms
(a) 0.0001 prevalence.
(b) 0.02 prevalence.
Figure 10: These results correspond to epistasis detection by prevalence, with
a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ratio. The
results of the Power and Type 1 Error Rate of BOOST, MBMDR, Screen and
Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains the
values for all algorithms in data sets with 0.0001 (a), and 0.02 (b) prevalence.
21
Power
Type 1 Error Rate
100
50
PH
SN
C
Sn
BE
Algorithms
BO
O
ST
0
M
PH
SN
Sn
C
O
ST
BE
BO
M
0
150
A
50
Power/Type 1 Error(%)
100
A
Power/Type 1 Error(%)
Power
Type 1 Error Rate
150
Algorithms
(a) 0.0001 prevalence.
(b) 0.02 prevalence.
50
BO
Algorithms
PH
SN
C
0
Sn
PH
SN
C
BO
Sn
ST
0
100
ST
50
Power
Type 1 Error Rate
150
O
100
Power/Type 1 Error(%)
Power
Type 1 Error Rate
150
O
Power/Type 1 Error(%)
Figure 11: These results correspond to main effect detection by prevalence,
with a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ratio. The results of the Power and Type 1 Error Rate of BOOST, MBMDR,
Screen and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains the values for all algorithms in data sets with 0.0001 (a), and 0.02 (b)
prevalence.
Algorithms
(a) 0.0001 prevalence.
(b) 0.02 prevalence.
Figure 12: These results correspond to full effect detection by prevalence,
with a minor allele frequency of 0.1, 2000 individuals, and a 2.0 odds ratio. The results of the Power and Type 1 Error Rate of BOOST, MBMDR,
Screen and Clean, SNPHarvester, SNPRuler and TEAM. Each subfigure contains the values for all algorithms in data sets with 0.0001 (a), and 0.02 (b)
prevalence.
22

Download Report

machine learning methodologies in the discovery of the interaction

Paperzz.com

Your Paperzz