Prediction of Fetal Hemoglobin in Sickle Cell Anemia Using an

DOI: 10.1161/CIRCGENETICS.113.000387
Prediction of Fetal Hemoglobin in Sickle Cell Anemia Using an Ensemble of
Genetic Risk Prediction Models
Running title: Milton et al.; Predicting Fetal Hemoglobin in Sickle Cell Anemia
Jacqueline N. Milton, PhD1; Victor R. Gordeuk, MD2; James G. Taylor VI, MD3;
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
Mark T. Gladwin, MD4; Martin H. Steinberg, MD5; Paola Se
Seba
Sebastiani,
bast
ba
stia
st
iani
ia
n , Ph
ni
PhD
D1
1
Department
n of
nt
of Biostatistics,
Biosta
Bi
tati
t sticcs, Boston University Schooll ooff Public Healt
Health,
th, B
Boston,
os
oston,
MA; 2Center for Sickle
Cell Disease,
se, H
se
Howard
oward
a Uni
University,
niveersit
i y, Washington,
Was
ash
hing
ngttonn, DC
ng
DC;
C; 3Sick
Sickle
ckle C
Cell
elll V
Vascular
asscuular D
Disease
isea
eaase S
Section,
ecti
ec
t on,
ti
on He
Hem
Hematology
matt
Branch, National
a on
ation
onal
a Heart,
Heartt, Lung,
Lu
ung, and
an
nd Blood
Blood In
Bloo
Inst
Institute,
tituute,
e,, B
Bethesda,
ethhesdaa, MD
MD;
D; 4Di
Division
ivision
onn off Pulmonary,
Pulm
mon
onaary, A
Allergy
llergg and
Critical Care Medicine an
and
d Vascul
Vascular
ular
ul
a Medicine Institute,
Instittut
u e, Univ
University
iver
ve sity
y of Pittsburgh,
Piitttsb
s urgh
g , Pittsburgh, PA;
P
5
Department off Medicine,
M d
Me
diiciine
ne, Boston
B ston
Bo
stton U
University
nive
ni
ive
v rs
rsit
ityy School
Scchool
S
ol ooff Me
M
Medicine,
edi
dici
di
cine
ine
ne, Boston,
B ston, MA
Bo
Correspondence:
Jacqueline N. Milton
Department of Biostatistics
Boston University
801 Massachusetts Ave.
Boston, MA 02118
Tel: (617) 414-7944
Fax: (617) 638-6484
E-mail: [email protected]
Journal Subject Codes: [109] Clinical genetics, [146] Genomics, [98] Other research
1
DOI: 10.1161/CIRCGENETICS.113.000387
Abstract
Background - Fetal hemoglobin (HbF) is the major modifier of the clinical course of sickle cell
anemia. Its levels are highly heritable and its interpersonal variability is modulated in part by
three quantitative trait loci (QTL) that effect HbF gene expression. Genome-wide association
studies (GWAS) have identified single nucleotide polymorphisms (SNPs) in these QTLs that are
highly associated with HbF but explain only 10 to 12% of the variance of HbF. Combining SNPs
into a genetic risk score (GRS) can help to explain a larger amount of the variability of HbF level
but the challenge of this approach is to select the optimal number of SNPs to be included in the
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
GRS.
Methods and Results - We develop a collection of 14 models with GRS
RS composed
com
ompo
pose
po
sedd of different
se
diff
di
ffeer
ff
er
numbers of SNPs, and use the ensemble of these models to predict HbF
F in si
sick
sickle
ckle
ck
le ccell
elll an
el
anem
anemia
e
em
patients. The
h models
he
mode
mo
dels
de
l were
ls
werre trained in 841 sickle cell
lll anemia
annemia patients and
and were tested in three
three
independent
n co
nt
cohorts.
ohorts. The
Th en
ensemble
nsem
embl
em
blee off 114
bl
4m
models
ode
d ls explained
exp
xplaain
ined
ed
d 23.4%
23.
3.4%
4% ooff thee vvariability
ari
riiab
abil
iliit
il
ity in
ity
n HbF
Hb in
the discovery
ryy cohort,
co
while
w ilee the
wh
th
he correlation
corr
rrel
rr
elatio
el
on between
b tw
be
weeen pr
pred
predicted
dic
ictedd an
and
nd observed
obsserrved
ed
d HbF
F inn tthe
he 3
independent
n cohorts ranged
nt
rangeed be
bbetween
tw
wee
eenn 0.
00.28
28 and
andd 0.44.
0.4
. 4. The
The mod
models
odel
od
e s includ
included
uded
ud
ed SN
SNPs in BCL11A,
BCL11A
A the
HBS1L-MYB
Y interg
YB
intergenic
genic region
reg
gion andd th
the
he si
site
ite off th
the
he HB
HBB
B ge
ggene
ne cluster,, QT
Q
QTL
L pr
ppreviously
eviously
y associated
associi
with HbF.
Conclusions - An ensemble of 14 genetic risk models can predict HbF levels with accuracy
between 0.28 and 0.44 and the approach may prove useful in other applications.
Key words: sickle cell disease, hemoglobin, genetics, association studies, risk prediction, risk
factor
2
DOI: 10.1161/CIRCGENETICS.113.000387
Introduction
HbF is the major modifier of the clinical features of sickle cell anemia (homozygosity for HBB
glu6val) and ȕ thalassemia. HbF inhibits sickle hemoglobin (HbS) polymerization and
FRPSHQVDWHVIRUWKHGHILFLWRIQRUPDO+E$LQȕWKDODVVHPLD1. If it were possible to know at birth
the HbF level likely to be present after stabilization of this measurement at about age 5 years2,
then a better patient-specific prognosis might be given and HbF-inducing treatments better
tailored to the individual. 7KHȖ-globin chain of HbF is encoded by the linked HBG1 and HBG2
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
genes. Levels of HbF in adults are highly heritable and the production of HbF is ggenetically
eneticcal
ally
l
3-66
regulated by several quantitative trait loci that modulate HBG1 and HBG2
BG2 expression.
expressiion.3Genome
G
e
i on sstudies
iation
tudiies ((GWAS)
tu
tudi
G AS) in sickle cell anemia
GW
aneemi
m a have identified
ed
d single nucleotide
wide association
polymorphisms
ism
ism
ms (SNPs) in
n BC
B
BCL11A,
CL1
L11A
L1
A, the HBS1L-MYB
HBSS1L--MYB
HB
B intergenic
inte
terg
rgen
rg
e ic re
en
region
egio
on and
ndd eleme
elements
m ntss link
me
linked
nkked
d to
e clu
ene
lust
lu
ust
ster
e that
er
att jo
join
i tl
in
tly
l explain
expl
plai
pl
ain 10-12%
ai
100 12
12%
2% off tthe
he vvariability
he
aria
ar
iabi
iabi
biliity
y of
of HbF.
Hb 22,, 7 Howe
However,
H
oweve
verr, iitt is
the HBB gene
cluster
jointly
possible that
a SNPs that are si
at
significantly
ign
g if
ific
i antlly associated
d wi
with
ithh HbF
F levels
l but
but do
do not reach ge
ggenomenom
m
wide significance
may
ficance
fi
i
ma explain
e plain
pllaii additional
addi
dditii all variability
ariabilit
iabi
bili
lit andd be
be used
sed
d to predict
dict HbF
di
HbF levels.
le
l els
ell This
Th
Thi
h is in
part due to the difference in the goals of the analysis of GWAS and phenotype prediction; in
order to increase the amount of variability explained in a phenotype one may need to use SNPs
that fall below the genome-wide significance threshold.8
One approach to genetic risk prediction uses a summary of the risk alleles in the form of
a genetic risk score (GRS) as a covariate of the model.9-12 A GRS can summarize a large amount
of genetic information into a single covariate, but the challenge is to identify the optimal number
of SNPs to be included in the score. To overcome this challenge, we present a novel method of
creating an ensemble of models with the GRS composed of a different number of SNPs to
produce more robust predictions. This method extends the approach introduced in Sebastiani et
3
DOI: 10.1161/CIRCGENETICS.113.000387
al 2012 for building genetic risk prediction models from case control studies to predict
quantitative traits.13 We show that an ensemble of 14 models with GRSs comprising 1 to 14
SNPs that were trained in a set of 841 sickle cell anemia patients from the Cooperative Study of
Sickle Cell Disease (CSSCD) can predict HbF in three different cohorts of African Americans
with sickle cell anemia with correlation between observed and predicted values between 0.28 and
0.44.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
Methods
Participants
HbF levels were m
measured
e suureed in 841 African American ssubjects
ea
ubjects from thee CSSCD (NCT00005277)
(NCT00005
homozygous
us for
us
for the HbS
S ggene
en
ne or
or w
with
i h Hb
it
H
HbS-ȕ
S-ȕ0 tha
tthalassemia.
th
halasssemi
miaa.14 Th
mi
The
he va
validation
ali
l dati
tion
ti
on cohorts
coh
ohor
orrtss iincluded
nclu
nc
l d
181 patientss ffrom
rom
ro
m tthe
h P
he
Pulmonary
ulmo
ul
monnary Hypertension
mo
Hyp
yper
erte
er
t nssio
te
ionn an
andd Si
Sick
Sickle
ckle
ck
le Cell
Cel
elll Disease
Dise
Di
seaase wi
se
with
th S
Sildenafil
ilde
il
dena
de
nafi
na
fill Therapy
fi
Thee
Th
(Walk-PHaSST)
a
aSST)
Study (NCT00492531),
(N
NCT
C 00
0 499225531
31),
), 77
77 patients
paati
tien
en
nts from
fro
rom
m a st
sstudy
udyy of pulmonary
ud
pul
u mo
ona
nary hypertension
hypertensi in
children with
th si
sick
sickle
ckle
le ccell
elll disease
dise
di
seas
asee (P
((PUSH
PUS
USH
H NC
NCT
CT 00
0049
00495638),
49956
5 38
38),
), aand
nd 1127
27 ssickle
27
i kl
ic
klee ce
cell
ll anemia
ane
nemi
miaa patien
pati
pa
patients
tien
en
from the Comprehensive Sickle Cell Centers Collaborative Data (C-data) project. Subjects for
all cohorts were selected based on the following criteria: age >5 years for the HbF measurement,
no hydroxyurea use, and no recent transfusion. In addition, patients in the validation cohorts had
hemoglobin phenotypes similar to that of the discovery set. The demographics of these studies
have been described.14-16 Some characteristics of the sickle cell anemia patients are described in
Table 1. All studies were approved by the institutional review board (IRB) of each participating
institution.
HbF Values
HbF was measured by alkali denaturation in the CSSCD or by high-pressure liquid
chromatography. Within the range of observed HbF, both alkali denaturation and high-pressure
4
DOI: 10.1161/CIRCGENETICS.113.000387
liquid chromatography give similar results. For the CSSCD subjects, longitudinal HbF values
were gathered from phases 1 through 3 of the study, only steady-state values were used (ie,
measurements not taken during an acute event) and summarized by the median of the
longitudinal values.2 Because HbF is known to decrease in early years of life, we only used HbF
measurements at age 5 years or older. The cubic root transformation of HbF was used in all
statistical analyses, to remove asymmetry.
Genotyping
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
DNA from the CSSCD, PUSH, C-Data and Walk-PHaSST samples that
at formed the disc
discovery
cov
ove
and replication cohorts were genotyped at Boston University using Illumina
umina Human610-Quad
Human66100-Q
Qu
ithh ap
it
appr
prox
o im
mately 600,000 SNPs. Samples
Sam
ampl
am
p es were proces
ssed according the
SNP arrays wit
with
approximately
processed
rer’ss protocol and
re
an
nd BeadStudio
Beead
dSt
S udio
io
o Software
Sof
oftwar
are was
wa us
used
ed
d too ma
m
ake ggenotype
enot
otype ca
ot
call
ls util
lizzinn the
manufacturer’s
make
calls
utilizing
e fi
e-defi
fine
nedd clusters.
ne
clus
ustters
ters. Samples
Samp
Samp
mple
less withh lless
le
ess than
ess
than a 995%
5%
% ccall
all rate
al
rat
ra
ate
te were
wer
ere re
rem
moveed an
andd SN
N
Illumina pre-defined
removed
SNPs
r <97.5% were re-cl
lustered.
d After
Aft
f er re-cl
lustering,
g, S
NPs with
NP
wiith
h call
calll rates >97.5%,
>97.5%
with a call rate
re-clustered.
re-clustering,
SNPs
aration
tii score > 0.25,
0 225
5 excess
e cess heterozygosity
hetero
het
gosit
it between
bet
bet een -0.10
0 10
10 and
d 00.10,
10 and
d mi
minor
i
a
cluster separation
allele
frequency > 5% were retained in the analysis. We used the genome-wide identity by descent
analysis in PLINK to discover unknown relatedness.17 Pairs with IBD measurements greater than
0.2 were deemed to be related and related subjects within individual or different studies were
removed. We also removed samples with inconsistent gender findings defined by heterozygosity
of the X chromosome and gender recorded in the database.
Statistical Analysis
Genotype data from the CSSCD were used as the training data set to develop different GRSs.
Initially, the association of each SNP was tested using a linear regression model adjusted for
gender using the additive coding in PLINK17, and the p-values for each association were
5
DOI: 10.1161/CIRCGENETICS.113.000387
computed. No significant associations were found between HbF and the first ten principal
components (PCs) computed using EIGENSOFT.18 Lack of inflation of the associations was
confirmed by the genomic inflation factor 1.003.18 SNPs were sorted by increasing p-values,
starting from the most significant SNP associated with HbF (rs766432, p-value=2.61x10-21), and
the list of SNPs was pruned by removing those SNPs in high linkage disequilibrium (r2 > 0.8). If
two SNPs were found to be in high linkage disequilibrium, the SNP with lowest p-value was
kept and the other SNP was removed. This process was repeated until no SNPs in high linkage
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
disequilibrium remained which left 500,325 SNPs. This pruned and sorted
orted list of SNPs was
wa used
to generate a sequence of unweighted GRS for each subject in the CSSCD
SCD
D by
b cumulatively
cumulativ
l iv
ivel
ely
el
n mbe
num
berr off risk
riskk alleles
a leles (an allele that caus
al
uses
us
es a decrease in H
b ) for each SNP. Th
bF
h first
adding the number
causes
HbF)
The
ded only the m
de
ost ssignificant
iggnif
iffic
i antt SN
SNP
P, the
th
he second
seco
condd G
RS was
was generated
wa
geeneraate
ted by add
dddingg tthe
he
GRS included
most
SNP,
GRS
adding
P from
m th
the so
ortted
d llist
ist of
ist
o S
NPs to the
NP
the ffirst
irst
stt GRS
GRS
R and
and so
so onn using
usi
sin
ing
ng the
the iterative
itterat
attive
ive formula:
f rm
fo
m
second SNP
sorted
SNPs
௡
‫݁ݓ݊ݑ‬
‫݊ݑ‬
‫݃݅݁ݓ‬
‫݃݅݁ݓ‬
݄݅݃
݄‫݁ݐ‬
‫݀݁ݐ‬
݀ ‫ܴܩ‬
‫ܴܵܩ‬
ܵ௜௜,௡
ܴ݅‫݇ݏ‬
‫݈݈݈݁݁ܣ ݇ݏ‬
‫݈݈݁ܣ‬
‫݈ܣ‬
݈݈݁݁
݈݁௜௜,௝,௝௝
‫݀݁ݐ݄݃݅݁ݓ݊ݑ‬
,௡
௡ = ෍ ܴ݅
௝ୀଵ
where n is the number of SNPs, and Risk Allelei,j is the number of risk alleles carried by
individual i for the jth SNP. A sequence of weighted GRS was also generated by weighting each
risk allele as:
௡
‫ܴܵܩ ݀݁ݐ݄݃݅݁ݓ‬௜,௡ = ෍ ‫ݐ‬௝ ‫݈݈݈݁݁ܣ ݇ݏܴ݅ כ‬௜,௝
௝ୀଵ
where tj is the t-statistic to test the association of the jth SNP with HbF. This analysis was
repeated for the first 10,000 SNPs (p-value< .02185) and generated 10,000 GRS, for each of the
subjects in the CSSCD. Each of these GRS was included as covariate in a linear regression
model and the regression coefficients of the resultant 10,000 linear regression models were
6
DOI: 10.1161/CIRCGENETICS.113.000387
estimated using Least Squares methods in the CSSCD data in the R package. The fitted
regression models were used to predict HbF using the formula:
෣௜,௡ = ߚ෢
෢
‫ܨܾܪ‬
଴,௡ + ߚଵ,௡ ‫ܴܵܩ‬௜,௡
෣௜,௡ is the predicted HbF value for individual i using the nth GRS, and ߚ෢
෢
where ‫ܨܾܪ‬
଴,௡ , ߚଵ,௡ are the
estimate of the regression coefficients. Cumulative ensembles of the predictions13, 19-21 were
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
computed using the formula:
௡
1
෣
෍ ‫ܨܾܪ‬
‫ܨ‬పప,ఫ,ఫఫ
݊
௝ୀଵ
Thee ppredictive
reedictive value
vallue ooff the ggenetic
enetic rrisk
isk m
models
odeels an
and
nd th
the
their
eir eensembles
nsemb
m lees wass eevaluated
valluattedd iin the
nd in the three independent
in
nde
d pe
p ndden
ent cohorts. The pproportion
ropo
p rtio
on of variability
variaabili
biil ty
y explained in th
CSSCD, and
the
ensemble off GR
GRS
S models
mode
mo
dels
de
ls in
in the
the CSSCD
CSSC
CS
SCD
SC
D set
set was
was computed
comp
co
mput
mp
uted
ut
ed as
as the
the squared
sqqua
uare
redd Pearson
re
Pear
Pe
arso
ar
sonn co
so
corr
correlation
rrel
rr
elat
el
atii
at
between the predicted and observed values while the predictive accuracy in the independent sets
was evaluated by computing the Pearson correlation between the observed and predicted values
of HbF. The number of SNPs with GRS that maximized the correlation between observed and
predicted values in the three independent cohorts was selected as the optimal number of SNPs.
To evaluate the predictive value of genetic data relative to non-genetic risk factors, a non-genetic
prediction model based on age, gender and the presence of alpha thalassemia was estimated and
the genetic risk models were also adjusted by age, gender and alpha thalassemia. The accuracy
of these additional models was assessed by the correlation between the predicted and observed
HbF values.
7
DOI: 10.1161/CIRCGENETICS.113.000387
Results
Table 1 shows patient characteristics of the four studies. The percent male and HbF
concentration were approximately the same across the Walk-PhaSST, C-Data and CSSCD
cohorts. Average age varied across the four cohorts; the CSSCD studied both children and
adults, C-Data and PUSH were skewed toward pediatric age patients while the Walk-PHaSST
cohort consisted mainly of adults. Patients in the PUSH study were younger and had
significantly higher HbF levels (p-value=0.0016).
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
Figure 1 shows the correlation between the observed and predicted
icted HbF values uusing
sn
si
genetic risk models with unweighted GRS (left panel) and the ensemble
models
le off tthese
h se GR
he
GRS
S mo
mod
d
(right panel)
the
l for
l)
for th
he to
topp 500 SNPs, in each of the validation
vali
lida
li
d tion cohorts. The
da
The
h correlation ranged
between 0.2
and
with
SNP;
GRSs
2 an
nd 0.4 for pprediction
reediccti
tion w
ith
h oonly
nly
ly 1 S
NP;; it
it peaked
pea
eake
kedd att 00.45
ke
.45 ffor
orr pprediction
rediict
c io
i n with
th G
that includee 10 to 115
genetic
and
5 SN
SNPs
NPs
P in
in the
th
he PUSH
PU
USH
H data,
datta,
a with
wit
ithh both
b th the
bo
thee standard
sta
tand
ndar
nd
ard
ar
rd ge
gen
netiic risk
ri models
mod
odel
dells an
a
their ensembles,
numbers
m
mbles,
, and declines ffor
or llarger
arg
ger numbe
b rs off SNPs
SNP in the
th
he models.
mode
d ls
l . Inclusion
Incllusion of more than
50 SNPs decreased
new SNPs
GRS
ecreasedd th
the
h correlation
elati
l io eeven
en ffurther.
rther
th
h While
Whil
Wh
il the
th
h incl
iinclusion
ncll sion
io off ne
SNP in
in the
thhe GR
G
R
had substantial effects on the predictive accuracy of the genetic risk model, as shown by the up
and down pattern from one model to the next (left panel of Figure 1), the accuracy of the
ensemble of these GRS models was more stable. The results using the weighted GRS were
similar (Supplementary Figure 1).
An ensemble of the first 14 GRS models had the highest average correlation among all
three data sets and explained 23.4% of the variability in HbF in the CSSCD cohort. The
correlation between observed and predicted HbF using the ensemble of 14 GRS models was
0.44, 0.28 and 0.39 in the PUSH, Walk-PHaSST and C-Data cohorts, respectively. Of these 14
SNPs, 5 were located in BCL11A; other SNPs were located in the olfactory receptor region on
8
DOI: 10.1161/CIRCGENETICS.113.000387
chromosome 11p15 and the site of the HBB gene cluster, and in the HBS1L-MYB interval on
chromosome 6q and were found previously to be associated with HbF.2, 22, 23 Table 2 shows
details of these 14 SNPs.
Adding non-genetic risk factors age and gender to the GRS models did not increase the
amount of variability explained of HbF in the Howard and Walk-PHaSST cohorts. The nongenetic prediction model which included information on age, alpha thalassemia and gender only
explained 6.8% of the variability in HbF in the CSSCD cohort.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
It is noticeable that the genetic risk models had consistently higher
gher pr
ppredictive
edictive acc
accuracy
cur
u a in
the PUSH cohort in comparison with the C-Data and Walk-PHaSST cohorts.
ohortts. The
Th age
g
distributionn off the
t e CSSCD
th
C SC
CS
S D and PUSH cohorts was skewed
skeewed toward chil
sk
children
ildr
il
d en and young
adolescentss wh
while the agee ddistribution
isttriibu
butiion off Walk-PHaSST
W lk-P
Wa
PHaSS
SST cohort
coho
h rt was
waas skewed
skkew
ewed
ed
d tow
toward
wardd aadult
duultt pa
patients
a
e
entary
y Figure
Figgure
Fi
re 22).
). E
xact age
xa
gee of pa
ati
tien
ents
en
ts was
as not
ot available
ava
vaiilab
va
able
ab
le for
for the
the C-Data
C-D
Data
t cohort.
cohhor
ortt.
(Supplementary
Exact
patients
Discussion
In African Americans with sickle cell anemia, HbF level does not stabilize until the age of 5
years.2 Early prediction of stable adult HbF levels might help foresee some complications of
sickle cell anemia and aid in its clinical management by guiding the decision of how vigorously
to pursue HbF induction therapies. Our goal was to identify methods that could combine SNPs to
provide a better prediction of HbF levels than single SNP analysis. We developed an ensemble of
GRS that predicted HbF in 3 independent cohorts. In an ensemble of GRS models, as few as 14
SNPs explained a larger fraction of the variability in HbF and better predicted HbF levels when
compared with single SNP analysis, or a single GRS model. Even though the ensemble of GRS
explains 23.4% of the variability in HbF in the CSSCD cohort, it does not explain the totality of
the variability in HbF that is due to heritability, estimated to be between 60% and 90%.5, 24 This
9
DOI: 10.1161/CIRCGENETICS.113.000387
missing heritability could be due to gene-gene interactions, epigenetic factors or multiple rare
variants with small effects that GWAS are poorly designed to detect.25 Our GRS included SNPs
with a MAF > 5%; however, as sickle cell anemia is a rare disease it is possible that some major
genetic modifiers are rare variants with a lower allele frequency. Next generation sequencing
might discover rare alleles that could be incorporated into a GRS to increase prediction accuracy.
Increasing the number of genetic variants in the GRS may increase the total explained variability
of HbF26; however, this could lead to overfitting and as one continues to add more genetic
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
variants to the GRS the prediction accuracy in independent cohorts will
ll decrease (as sho
shown
own
w in
Figure 1). The prediction accuracy of our weighted GRS model was simila
similar
l r ttoo that
that off oour
ur
unweighted
d GRS
RS model.
moddel
e . We
We hypothesize that this is due to the fact tha
that
haat weighted GRS mo
models
perform better
tterr when theree is a ddifference
tt
ifffeerencce in
n tthe
he ge
genetic
eneetiic eeffects;
ffec
ff
ectss; ho
ec
hhowever,
oweve
veer, Table
T blee 2 sshows
Ta
hows that
ho
th the
regression coeff
coefficients
ffic
ff
icie
ic
cie
ient
nts an
nt
andd st
standard
tan
anda
d rdd eerrors
rrors fr
rr
ffrom
om
m tthe
he G
he
GWAS
W S off our
WA
ur ttop
op S
SNPs
NPss are
NP
ar al
all
ll ve
very
ry similar.
ssim
im
An interesting
i
g resul
result
lt off our analy
analysis
lysiis iiss th
the
he sy
systematically
ystematically
ly higher
hiiggher correlation betweenn
predicted and
ndd obser
ob
observed
b
ed
dH
HbF
bF values
all es in
in the
thhe PUSH
PUS
USH
H study.
st d This
Thi res
result
lltt might
ighht suggest
s ggest
st that
thhat the
th
h ggenetic
models predict more accurately in children and young adults. Blood counts decline with
advancing age in sickle cell disease and could reflect decades of bone marrow damage due to
sickle vasoocclusion with relative bone marrow “failure”.27-29 Perhaps the higher correlation
between the predicted and observed HbF levels in the younger patients of the PUSH cohort is a
result of gene x environment interaction, where the genetic elements regulating HbF production
are not impeded by the erythroid bone marrow injury associated with aging in the older cohorts.
However, the difference in prediction accuracy could also be due to unobservable patient
characteristics not accounted for in the model. For example, the CSSCD was conducted before
the establishment of hydroxyurea as standard therapy for sickle cell anemia patients, and there
10
DOI: 10.1161/CIRCGENETICS.113.000387
may be survivor effects in more contemporary cohorts that are not accounted for in older cohorts.
Testing these results in additional contemporary cohorts of sickle cell anemia will be necessary
to explain the result.
The 14 SNP model included SNPs in the BCL11A region, the HBS1L-MYB intergenic
region and SNPs in the olfactory receptor gene cluster 5’ to the HBB gene complex. BCL11A
down regulates HBG expression 30, 31, the HBS1L-MYB intergenic region might affect
erythropoiesis and modulate BCL11A expression while the olfactory receptor region might
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
control expression with the HBB gene-like cluster.32
There are alternate methods of genetic prediction that we did not
ot explore
expplore in
in this
th paper
pap
33 34
including combining
omb
omb
mbin
inin
in
ingg SNP
in
S P information into a haplotype-based
SN
haplo
oty
typpe-based analysis
analysis,
is,,33,
is
multivariate
regression models
moddels and machine
maachin
ne learning
lear
a ningg type
ar
typ
pe approaches
ap
pproaachess suc
such
uchh ass suppo
uc
support
poortt vvector
ecto
or m
machines
acchin
nes aand
5, 366
338-42
8-42
cross validation
a
ation
(CV)
(CV
CV))335,
, principal
prin
pr
i ci
in
cipal
i l co
component
omponnen
ent aanalysis
nal
allys
ysis
i 377, and
is
an
nd Ba
Baye
Bayesian
yesiian net
yesi
networks
etwo
et
w rkss38
wo
. In
In our
analysis, wee used traditionall regression
regr
g ession
i bbased
ased
dm
models
oddels
l and ensemble
ensemb
ble off th
these
hese models: we tra
trained
a
the models in
i the
th
h discovery
disco
di
er cohorts
hort andd examined
e amined
ined
d th
the
he prediction
dictiio res
di
results
llts
t in
in independent
indd
dent cohorts
coh
ohh
to determine which model predicts most accurately. We investigated using 10 fold cross
validation (CV) method to select the best number of SNPs in the discovery set. However, we
noted in a large simulation study that using 10 fold cross validation tends to underestimate the
true number of SNPs (Supplementary Information; Supplementary Figure 3), and noted that an
ensemble of regression models appear to be more robust for prediction. This result is consistent
with published literature.43. Since the 3 test sets were used to determine the optimal number of
SNPs, replication of the results in additional independent sets is necessary to confirm the results.
However, the substantial agreement of prediction of the ensemble of 14 models in the 3 sets is
reassuring that this result is robust.
11
DOI: 10.1161/CIRCGENETICS.113.000387
The ensemble of GRS models explained 23.4% of the variability in HbF in the CSSCD
data and the correlation between predicted and observed HbF values in the 3 independent sets
ranged from 0.28 to 0.44. These numbers are higher than results reported in the literature when
GRS have been used as a predictive tool. For example, a study of 3,575 subjects from the
Doetinchem Cohort Study computed a GRS to predict plasma total cholesterol levels which are
highly genetically determined with a heritability estimated to be 40 to 60%.44, 45 Using 12 SNPs
they were able to explain 6.9% of the total variability in total cholesterol levels.46 Participants in
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
the Atherosclerosis Risk in Communities cohort of 10,745 individuals were used to construct
const
stru
st
r a
GRS of obesity in order to predict BMI.47, 48 The obesity GRS showedd a correlation
correllatio
ti n wi
with
ith BMI
BM of
r=0.12 for the
t uunweighted
nwei
nw
e gh
ei
ghtedd GRS model and r=0.13 ffor
or the weighted mo
model. Our ensemble of
o
GRS models
ls of
ls
of HbF can explain
exxplain
in more
more phenotype
p en
ph
notyppe va
variability
ariiabil
ilit
il
ityy and
it
andd hhad
ad a high
higher
gh
her ppredictive
reedictivee
accuracy in
n comp
comparison
mp
par
aris
ison wi
is
with
ith
t ssingle
ingl
ingl
gle SN
SNP
P anal
analyses.
alys
al
ysees.
ys
Onee of the maj
major
jor ggoals
oalls off GW
G
GWAS
A was to id
AS
identify
dentiiffyy genetic
genetic
i variants
variiants that
th
hat are associated with
disease or meas
measures
severity
personalized
medicine.
Many
studies
res off disease
dis
se
erit
it to be
b used
sed
ed
d in
i personali
ali
li edd medicine
edi
diciin M
Man
a st
t di
dies have
shown the importance of including genetic variants beyond those that meet the genome-wide
association threshold of 5x10-08 but many of these SNPs associations may be false positives and
lower the accuracy of a prediction model.49, 50 Our study shows that the use of an ensemble of
genetic risk models is robust to inclusion of false positive associations and the approach may
prove useful in other applications.
Funding Sources: This work was supported by National Institutes of Health Grants
R21HL114237 (PS), R01 HL87681 (MHS), RC2 L101212 (MHS), 5T32 HL007501 (JNM),
2R25 HL003679-8 (VRG), R01 HL079912 (VRG), 2M01 RR10284-10 (VRG), R01HL098032
(MTG), R01HL096973 (MTG), P01HL103455 (MTG), the Institute for Transfusion Medicine
and the Hemophilia Center of Western Pennsylvania (MTG), U54 HL70583. The C-Data
Project of the Comprehensive Sickle Cell Centers, U54HL70583 and NIH intramural funding 1
ZIA HL005116 and 1 ZIA HL006016
12
DOI: 10.1161/CIRCGENETICS.113.000387
Conflict-of Interest Disclosure: The authors declare no competing financial interests.
References:
1. Steinberg MH, Nagel RL, Forget BG, Higgs DR, Weatherall DJ. Genetic Modulation of Sickle
Cell Disease and Thalassemia. In: M. H. Steinberg, B. G. Forget, D. R. Higgs and D. J.
Weatherall, eds. Disorders of hemoglobin: Genetics, Pathophysiology and Clinical Management
Cambridge: Cambridge University Press; 2009: 638-657.
2. Solovieff N, Milton JN, Hartley SW, Sherva R, Sebastiani P, Dworkis DA, et al. Fetal
hemoglobin in sickle cell anemia: genome-wide association studies suggest a regulatory region
in the 5' olfactory receptor gene cluster. Blood. 2010;115:1815-1822.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
e. Am J Hematol.
3. Steinberg MH, Sebastiani P. Genetic modifiers of sickle cell disease.
2012;87:795-803.
L et al.
al Gender
Gen
ende
derr an
de
andd
4. Steinberg MH, Hsu H, Nagel RL, Milner PF, Adams JG, Benjamin L,
eeffects
ect
ctss up
upon
on hem
ematological manifestations
em
ns of
of adult sickle cell
cel
ell anemia. Am J Hem
m
haplotype effe
hematological
Hematol.
75-18
5-- 81.
1995;48:175-181.
C T
atu T, R
at
e ttiee JE,
ei
E,, Lit
ttl
tlew
e oo
oodd T
arleyy J, C
ervi
vino
no S
l. G
enettic inf
nfluen
nf
ncess on F
5. Garner C,
Tatu
Reittie
Littlewood
T,, D
Darley
Cervino
S,, et aal.
Genetic
influences
t
ther
hhematologic
he
ema
m tolo
ma
ogiic va
riab
ri
ia les:
s: a twinn he
heri
r ta
ri
tabi
bili
bi
liity
t sstudy.
tudy
tu
dy. B
dy
lood.
d 20
22000;95:342-346.
0000;9
; 5:
5 34422 34
3466.
cells and other
variables:
heritability
Blood.
S, Thein SL. Genetic
Genetiic ar
chit
hi ecture of hhemoglobin
emogl
gllobin
b F control.
controll. C
urr Opin
Opi
p n Hematol.
6. Menzel S,
architecture
Curr
79-186.
2009;16:179-186.
7. Bae HT, Baldwin CT, Sebastiani P, Telen MJ, Ashley-Koch A, Garrett M, et al. Meta-analysis
of 2040 sickle cell anemia patients: BCL11A and HBS1L-MYB are the major modifiers of HbF
in African Americans. Blood. 2012;120:1961-1962.
8. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2011;13:135-145.
9. Meigs JB, Shrader P, Sullivan LM, McAteer JB, Fox CS, Dupuis J, et al. Genotype score in
addition to common risk factors for prediction of type 2 diabetes. N Engl J Med. 2008;359:22082219.
10. Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, et al. Common
polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature.
2009;460:748-752.
11. Paynter NP, Chasman DI, Pare G, Buring JE, Cook NR, Miletich JP, et al. Association
between a literature-based genetic risk score and cardiovascular events in women. JAMA.
2010;303:631-637.
13
DOI: 10.1161/CIRCGENETICS.113.000387
12. Sebastiani P, Solovieff N, Sun JX. Naive Bayesian Classifier and Genetic Risk Score for
Genetic Risk Prediction of a Categorical Trait: Not so Different after all! Front Genet.
2012;3:26.
13. Sebastiani P, Solovieff N, Dewan AT, Walsh KM, Puca A, Hartley SW, et al. Genetic
signatures of exceptional longevity in humans. PLoS ONE. 2012;7:e29848.
14. Gaston M, Rosse WF. The cooperative study of sickle cell disease: review of study design
and objectives. Am J Pediatr Hematol Oncol. 1982;4:197-201.
15. Dham N, Ensing G, Minniti C, Campbell A, Arteta M, Rana S, et al Prospective
echocardiography assessment of pulmonary hypertension and its potential etiologies in children
with sickle cell disease. Am J Cardiol. 2009;104:713-720.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
16. Machado RF, Barst RJ, Yovetich NA, Hassell KL, Kato GJ, Gordeuk
euk VR, et al.
Hospitalization for pain in patients with sickle cell disease treated with
h ssildenafil
ilde
il
dena
de
nafi
na
fill fo
fi
forr eelevated
lev
vatt
TRV and low exercise capacity. Blood. 2011;118:855-864.
17. Purcell S, Neale
B,, To
Todd-Brown
Bender
Nea
eale
lee B
Todd
d -Brown K, Thomas L, Ferreira
Fe
MA, Ben
nde
der D, et al. PLINK: A
Tool Set forr Whole-Genome
Association
Population-Based
Hum
Whole-G
Geennome
om
m As
A
sso
soociat
atio
at
ionn and
io
a d Po
an
P
p laationn-Ba
pu
Base
Ba
s d Linkage
Link
Li
nkag
nk
ag
age
g Analyses.
Anal
An
alys
al
y ess. Am J H
ys
Genet. 2007;81:559-575.
7;811:559-575.
7;
18. Price AL,
NJ,
Plenge
RM,
Principal
L, Pa
P
Patterson
att
ttter
ersson NJ
J, Pl
Plen
enge
ge R
M, Weinblatt
Wei
einb
ei
nbla
nb
bla
l tt ME,
ME, Shadick
Sha
hadi
dicck NA
di
NA an
andd Reich
Reiic
Re
ich D. P
ich
Principa
riinc
nciipa
ip
componentss analysis corrects
genome-wide
Nat Genet.
corre
r ectss fo
forr st
sstratification
rati
ra
tiifi
fica
caati
tion
on iin
n ge
geno
nome
no
m -w
me
wid
idee as
aassociation
sooci
c at
a io
ionn studies.
st
Genn
2006;38:904-9.
0
04-9.
19. Breiman
L. B
Predictors.
Machine
Learning.
nL
Bagging
gii Predictors
P di
dict
M
hi L
Learning
in 11996;24:123-140.
1996;24:123
9966;24
99
24:123
123 1140
40
20. Mevik B-H, Segtnan VH, Næs T. Ensemble methods and partial least squares regression. J
Chemom. 2004;18:498-507.
21. Rokach L. Ensembled-based classifiers. Art Intell Review. 2010;33:1-39.
22. Sedgewick AE, Timofeev N, Sebastiani P, So JC, Ma ES, Chan LC, et al. BCL11A is a
major HbF quantitative trait locus in three different populations with beta-hemoglobinopathies.
Blood Cells Mol Dis. 2008;41:255-258.
23. Uda M, Galanello R, Sanna S, Lettre G, Sankaran VG, Chen W, et al. Genome-wide
association study shows BCL11A associated with persistent fetal hemoglobin and amelioration
of the phenotype of beta-thalassemia. Proc Natl Acad Sci U S A. 2008;105:1620-1625.
24. Pilia G, Chen WM, Scuteri A, Orru M, Albai G, Dei M, et al. Heritability of cardiovascular
and personality traits in 6,148 Sardinians. PLoS Genet. 2006;2:e132.
14
DOI: 10.1161/CIRCGENETICS.113.000387
25. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential
etiologic and functional implications of genome-wide association loci for human diseases and
traits. Proc Natl Acad Sci U S A. 2009;106:9362-9367.
26. Galarneau G, Palmer CD, Sankaran VG, Orkin SH, Hirschhorn JN, Lettre G. Fine-mapping
at three loci known to affect fetal hemoglobin levels explains additional genetic variation. Nat
Genet. 2010;42:1049-1051.
27. West MS, Wethers D, Smith J, Steinberg M. Laboratory profile of sickle cell disease: a
cross-sectional analysis. The Cooperative Study of Sickle Cell Disease. J Clin Epidemiol.
1992;45:893-909.
28. Morris J, Dunn D, Beckford M, Grandison Y, Mason K, Higgs D, et al. The haematology of
homozygous sickle cell disease after the age of 40 years. Br J Haematol. 1991;77:382-385.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
29. Hayes RJ, Beckford M, Grandison Y, Mason K, Serjeant BE, Serjeant
GR.
eant
ea
nt G
R. The
The haematology
haema
m t
of steady state homozygous sickle cell disease: frequency distributions,
s, variation
varia
i ti
tion with
wit
ithh age
it
age and
a
sex, longitudinal observations. Br J Haematol. 1985;59:369-382.
30. Sankaran
VG,
Menne
TF,
Xu
Akie
TE,, Le
Lettre
Handel
B,, et aal.
Human
an V
an
G, Men
nnee T
F,, X
u J,
J A
kiee TE
ki
T
L
t re G,
tt
G, Van
Vaan Ha
and
n el B
l. H
u an ffetal
um
ettal
a
hemoglobinn expression
stage-specific
BCL11A.
exxpression iss regulated
regul
ullated
ed
d by
by the
t e developmental
th
deveelo
opm
mentaal st
stag
a e--sppecif
ag
i ic rrepressor
ep
presssor B
CL11A
CL
A
Science. 2008;322:1839-1842.
088;3
;322:183399 188422.
31. Chen Z,, Luo HY, Steinberg
MH,, Ch
BCL11A
transcription
K562
Steiinnbberrg MH
M
Chui
uii DH.
DH. B
CL11
CL
111A represses
reepr
p es
esse
s s HB
se
HBG
G tr
ran
anscription in K
cells. Blood
d Cells Mol Dis. 20
22009;42:144-149.
09;4
09
;4
42:14
44-1449.
32. Feingold
EA,
Penny LA
LA,
Nienhuis
BG.
olfactory
located
ld
dE
EA
A Penn
P
L
A Nienh
Nienhh is
i AW,
AW Forget
Fo et BG
B
G An
A olfactor
olf
lfact
receptor
pt gene iiss llocat
at in
the extended human beta-globin gene cluster and is expressed in erythroid cells. Genomics.
1999;61:15-23.
33. Hughes MF, Saarela O, Stritzke J, Kee F, Silander K, Klopp N, et al. Genetic Markers
Enhance Coronary Risk Prediction in Men: The MORGAM Prospective Cohorts. PLoS ONE.
2012;7:e40922.
34. Onuki R, Shibuya T, Kanehisa M. New kernel methods for phenotype prediction from
genotype data. Genome Inform. 2010;22:132-141.
35. Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, et al. From disease association to risk
assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS
Genet. 2009;5:e1000678.
36. Wu C, Walsh K, DeWan A, Hoh J, Wang Z. Disease risk prediction with rare and common
variants. BMC Proceedings. 2011;5:S61.
15
DOI: 10.1161/CIRCGENETICS.113.000387
37. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, et al. A flexible
computational framework for detecting, characterizing, and interpreting statistical patterns of
epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241:252-261.
38. Rodin AS, Boerwinkle E. Mining genetic epidemiology data with Bayesian networks I:
Bayesian networks and example application (plasma apoE levels). Bioinformatics.
2005;21:3273-3278.
39. Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH. Genetic dissection and
prognostic modeling of overt stroke in sickle cell anemia. Nat Genet. 2005;37:435-440.
40. Jiang X, Barmada MM, Cooper GF, Becich MJ. A Bayesian method for evaluating and
discovering disease loci associations. PLoS ONE. 2011;6:e22075.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
41. Kang J, Zheng W, Li L, Lee J, Yan X, Zhao H. Use of Bayesian networks
etworks to dissect the
the
h
complexity of genetic disease: application to the Genetic Analysis Workshop
orrks
orks
ksho
hopp 17 simulated
ho
sim
imul
ulat
ul
atted
ed data.
BMC Proceedings. 2011;5:S37.
42. Sebastiani
P,, Perls
data.
a P
ani
Perl
Pe
rlss TT.
rl
TT
T. Prediction models that iinclude
nclu
l de genetic dat
ta. Circ Cardiovasc Genet.
Gee
2010;3:1-2..
43. Hartley SW,
Monti
Liu
CT,
Steinberg
Sebastiani
P.. Bayesian
SW
W, Mont
ti S, L
iu
uC
T, S
t inbe
te
b rg MH,
H Seb
ebassti
tianii P
Bay
yessiaan methods
method
ods for
od
foor
multivariatee modeling
SNP
mod
odel
od
elin
el
i g off pleiotropic
in
pleio
leio
i trop
trop
opic S
NP associations
ass
ssoocia
ss
i ti
ia
tion
onss and
and genetic
geeneti
ticc ri
ti
risk
iskk pprediction.
redi
dict
di
ctiion. Front
ct
Frontt Genet.
G
2012;3:176.
6
6.
44. Lusis AJ,
part
genes
AJ, Fogelman
AJ
Fogelm
man AM,
AM,
M, Fonarow
Fonaro
Fo
Fona
naaro
row
w GC.
GC. Genetic
Geeneti
Gene
netic
tiic basis
baasi
basi
s s of atherosclerosis:
ath
ther
heros
oscl
os
cler
cler
eros
osis
os
iss: pa
is:
art II:: new ge
en
and pathways.
2004;110:1868-1873.
a s Circulation.
Ci l tii
2004;110:1868
2004
20
04;110
110:186
18688 1873
1873
45. Verschuren WM, Blokstra A, Picavet HS, Smit HA. Cohort profile: the Doetinchem Cohort
Study. Int J Epidemiol. 2008;37:1236-1241.
46. Lu Y, Feskens EJ, Boer JM, Imholz S, Verschuren WM, Wijmenga C, et al. Exploring
genetic determinants of plasma total cholesterol levels and their predictive value in a longitudinal
study. Atherosclerosis. 2010;213:200-205.
47. Belsky DW, Moffitt TE, Sugden K, Williams B, Houts R, McCarthy J, et al. Development
and evaluation of a genetic risk score for obesity. Biodemography Soc Biol. 2013;59:85-100.
48. Folsom AR, Chambless LE, Ballantyne CM, Coresh J, Heiss G, Wu KK, et al. An
assessment of incremental coronary risk prediction using C-reactive protein and other novel risk
markers: the atherosclerosis risk in communities study. Arch Intern Med. 2006;166:1368-1373.
49. Kooperberg C, LeBlanc M, Obenchain V. Risk prediction using genome-wide association
studies. Genet Epidemiol. 2009;34:643-652.
16
DOI: 10.1161/CIRCGENETICS.113.000387
50. Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, et al.
Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet.
2010;43:519-525.
Table 1 Patient characteristics. Summary statistics of patient characteristics in the CSSCD
cohort, and the PUSH, WALK-PHaSST and C-Data replication cohorts. For each cohort
statistics (mean and standard deviate or frequencies) are reported for all patients included in the
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
analysis. The last row reports the frequency and proportion of individuals
uals with gene
gene deletion
dele
leti
le
to
alpha thalassemia (at least one alpha gene deletion).
CSSCD
C
CS
S D
SC
PUSH
PU
USH
H
Walk-PHaSST
Walk
Wa
lk-P
lk
PHaS
aSST
aS
ST
T
C-Data
C-Da
CD t
(N=841)
(N=8
=841
=8
4 )
(N=77)
(N
N=77)
7
7)
((N=181)
(N
=1881))
(N=127)
(N=
=122
Mean
an (StD)
(StD))
Mean
an (StD)
(StD)
D))
Mean
n (StD)
(StD))
Mean (StD)
(S
S
Age (years)
17.1
17
17.19(10.69)
19(
9(10
100.69)
10.6
.6
69
12.4
12
12.49
.499 (4
.4
(4.6
(4.69)
69)
36
36.35
6.3
.355 (12.54)
(12.
(1
2.54
2.
54))
54
13-177
Gender (% male)
53.7%
50.6%
51.9%
55.9%
HbF (%)
6.65 (5.50)
9.81 (8.23)
6.07 (5.60)
7.59(5.09)
Hemoglobin (d/mL)
8.42 (1.33)
8.62 (1.25)
8.50 (1.69)
NA
ĮWKDODVVHPLD\HV
143 (31.4%)
21 (27.2%)
46 (25.4%)
NA
17
DOI: 10.1161/CIRCGENETICS.113.000387
Table 2 Summary of SNPs in prediction model. GWAS results of the 14 SNPs in the best GRS
prediction model in the CSSCD cohort. ȕ=estimate of genetic effect from additive model; SE=
standard error.
SNP
Gene
Coding
ȕ
SE
pvalue
Allele
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
rs766432
Chr 2; BCL11A
C
0.25
0.02
1.22E-22
rs10195871
Chr 2; BCL11A
A
0.21
0.02
3.30E-18
rs6706648
Chr 2; BCL11A
A
-0.19
0.02
1.30E-16
rs6709302
Chr 2; BCL11A
A
-0.14
0.02
0 02
0.
3.69E-08
3.69
3.
69
rs9494145
Chr 6; intergenic
G
0.24
0.05
0.05
1.80E-07
1 80
1.
rs6732518
Chrr 2; BC
C
Ch
BCL11A
CL1
L 1A
G
0.13
0.02
1.86E-07
1.866
rs6446085
Chr 3; FHIT
FH T
A
-0.11
-0.
0 11
1
0.02
0.02
02
9.43E-07
9.433
rs10152034
4
Chr 14;
14 in
interg
intergenic
genicc
A
-0.11
-0.11
1
11
0.033
1.21E-05
1.211
rs17114175
5
Chr 14; in
intergenic
nte
terg
rggen
e icc
C
-0.16
-0.
0.16
16
0.02
0 02
0.
1.59E-06
1.599
rs2855039
Chrr 11;
Ch
11; HBG1,
HBG1
HB
G1,, HBG2
HBG2
G
0.18
0.18
0.04
0.04
2.24E-06
2.24
2.
244
rs2239580
Chrr 14;
Ch
14; CO
COCH
CH
A
-0.13
-00.13
.13
13
0.03
0.03
03
4.46E-06
44.46
.446
rs5006883
Chr 11; OR51B5,OR51B6
G
0.17
0.04
5.32E-06
rs9525079
Chr 13; UGGT2
G
0.13
0.03
6.28E-06
rs416586
Chr 11; OR51A
G
0.02
6.37E-06
rs11794652
Chr 9; FUBP3
A
-0.10
0.02
6.51E-06
rs12469604
Chr 2; intergenic
A
0.19
0.04
8.31E-06
rs6932510
Chr 6; RPS6KA2
A
0.16
0.04
8.52E-06
rs7113817
Chr 11; intergenic
A
0.17
0.04
9.88E-06
rs10837814
Chr 11; OR51B2,OR51B3P
A
0.14
0.03
1.39E-05
rs2021966
Chr 6; intergenic
G
0.10
0.02
1.79E-05
18
0.13
DOI: 10.1161/CIRCGENETICS.113.000387
Figure Legend
Figure 1. Correlation between observed and predicted values for increasing number of SNPs in
the unweighted GRS. Plot of the correlation between the observed and predicted HbF in the
three independent cohorts versus the number of SNPs in the top 50 unweighted GRS models (left
panel) and ensemble of unweighted GRS models (right panel). The vertical bars are at No.
SNP=14.
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
19
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
Prediction of Fetal Hemoglobin in Sickle Cell Anemia Using an Ensemble of Genetic Risk
Prediction Models
Jacqueline N. Milton, Victor R. Gordeuk, James G. Taylor, Mark T. Gladwin, Martin H. Steinberg
and Paola Sebastiani
Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017
Circ Cardiovasc Genet. published online March 1, 2014;
Circulation: Cardiovascular Genetics is published by the American Heart Association, 7272 Greenville Avenue, Dallas,
TX 75231
Copyright © 2014 American Heart Association, Inc. All rights reserved.
Print ISSN: 1942-325X. Online ISSN: 1942-3268
The online version of this article, along with updated information and services, is located on the
World Wide Web at:
http://circgenetics.ahajournals.org/content/early/2014/02/28/CIRCGENETICS.113.000387
Data Supplement (unedited) at:
http://circgenetics.ahajournals.org/content/suppl/2014/03/01/CIRCGENETICS.113.000387.DC1
Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published in
Circulation: Cardiovascular Genetics can be obtained via RightsLink, a service of the Copyright Clearance Center,
not the Editorial Office. Once the online version of the published article for which permission is being requested is
located, click Request Permissions in the middle column of the Web page under Services. Further information about
this process is available in the Permissions and Rights Question and Answer document.
Reprints: Information about reprints can be found online at:
http://www.lww.com/reprints
Subscriptions: Information about subscribing to Circulation: Cardiovascular Genetics is online at:
http://circgenetics.ahajournals.org//subscriptions/
SUPPLEMENTAL MATERIAL
Supplementary Figure 1. Scatterplot of the correlation versus the Number of SNPs in the
weighted GRS model. Plot of the correlation between the observed and predicted HbF in the
three independent cohorts versus the number of SNPs in the top 50 weighted GRS and ensemble
weighted GRS models.
1
Supplementary Figure 2. Histograms of the age distributions for the CSSCD, PUSH and WalkPHaSST cohorts.
2
Supplement Figure 3. Side by side boxplots of the correlation between the observed and
predicted phenotype (y-axis) versus the number of SNPs in the model (x-axis) for the GRS
model (left panel) and ensemble of GRS models (middle panel). The right panel shows the
distribution of the optimal number of SNPs chosen to maximize prediction using 10-fold cross
validation. The results are from a simulation study with 1,000 simulated genotype data sets, each
3
with 1,000 individuals. In each dataset, a continuous phenotype with a heritability of 0.40 was
simulated with 30 causal SNPs contributing equally to the genetic effect. The continuous
phenotype was simulated to have a variability equal to 1/4 the mean of the phenotype (low
variability, row 1), 1/2 the mean of the phenotype (medium variability, row 2) and 3/4 the mean
of the phenotype (high variability, row 3). To evaluate the predictive accuracy, each set of 1,000
individuals was randomly split into a training set (N=900) and test set (N=100), and each
training set was used to build the GRS and the ensemble of GRS models that were tested in the
test set. When 10-fold cross-validation was used, each dataset of 1,000 individuals was
randomly partitioned into 10 equal parts where each part was used as test set of the models
trained in the other 9 parts, and the results were averaged over the 10 folds. The simulation
study shows that for both the GRS and ensemble GRS methods the correlation peaks at 30 SNPs;
however, when more than 30 SNPs are included in the GRS model, the correlation rapidly
decreases while the decrease in correlation is more gradual for the ensemble of GRS
models. The results suggest that if one were to incorrectly choose the wrong model, the
ensemble of GRS models would result in smaller loss of prediction accuracy than a single GRS
model. In addition, selection of the optimal number of SNPs using cross-validation may be too
conservative and produce sub-optimal models. As expected, the prediction accuracy decreases
as the variability of the phenotype increases.
4