DOI: 10.1161/CIRCGENETICS.113.000387 Prediction of Fetal Hemoglobin in Sickle Cell Anemia Using an Ensemble of Genetic Risk Prediction Models Running title: Milton et al.; Predicting Fetal Hemoglobin in Sickle Cell Anemia Jacqueline N. Milton, PhD1; Victor R. Gordeuk, MD2; James G. Taylor VI, MD3; Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 Mark T. Gladwin, MD4; Martin H. Steinberg, MD5; Paola Se Seba Sebastiani, bast ba stia st iani ia n , Ph ni PhD D1 1 Department n of nt of Biostatistics, Biosta Bi tati t sticcs, Boston University Schooll ooff Public Healt Health, th, B Boston, os oston, MA; 2Center for Sickle Cell Disease, se, H se Howard oward a Uni University, niveersit i y, Washington, Was ash hing ngttonn, DC ng DC; C; 3Sick Sickle ckle C Cell elll V Vascular asscuular D Disease isea eaase S Section, ecti ec t on, ti on He Hem Hematology matt Branch, National a on ation onal a Heart, Heartt, Lung, Lu ung, and an nd Blood Blood In Bloo Inst Institute, tituute, e,, B Bethesda, ethhesdaa, MD MD; D; 4Di Division ivision onn off Pulmonary, Pulm mon onaary, A Allergy llergg and Critical Care Medicine an and d Vascul Vascular ular ul a Medicine Institute, Instittut u e, Univ University iver ve sity y of Pittsburgh, Piitttsb s urgh g , Pittsburgh, PA; P 5 Department off Medicine, M d Me diiciine ne, Boston B ston Bo stton U University nive ni ive v rs rsit ityy School Scchool S ol ooff Me M Medicine, edi dici di cine ine ne, Boston, B ston, MA Bo Correspondence: Jacqueline N. Milton Department of Biostatistics Boston University 801 Massachusetts Ave. Boston, MA 02118 Tel: (617) 414-7944 Fax: (617) 638-6484 E-mail: [email protected] Journal Subject Codes: [109] Clinical genetics, [146] Genomics, [98] Other research 1 DOI: 10.1161/CIRCGENETICS.113.000387 Abstract Background - Fetal hemoglobin (HbF) is the major modifier of the clinical course of sickle cell anemia. Its levels are highly heritable and its interpersonal variability is modulated in part by three quantitative trait loci (QTL) that effect HbF gene expression. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) in these QTLs that are highly associated with HbF but explain only 10 to 12% of the variance of HbF. Combining SNPs into a genetic risk score (GRS) can help to explain a larger amount of the variability of HbF level but the challenge of this approach is to select the optimal number of SNPs to be included in the Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 GRS. Methods and Results - We develop a collection of 14 models with GRS RS composed com ompo pose po sedd of different se diff di ffeer ff er numbers of SNPs, and use the ensemble of these models to predict HbF F in si sick sickle ckle ck le ccell elll an el anem anemia e em patients. The h models he mode mo dels de l were ls werre trained in 841 sickle cell lll anemia annemia patients and and were tested in three three independent n co nt cohorts. ohorts. The Th en ensemble nsem embl em blee off 114 bl 4m models ode d ls explained exp xplaain ined ed d 23.4% 23. 3.4% 4% ooff thee vvariability ari riiab abil iliit il ity in ity n HbF Hb in the discovery ryy cohort, co while w ilee the wh th he correlation corr rrel rr elatio el on between b tw be weeen pr pred predicted dic ictedd an and nd observed obsserrved ed d HbF F inn tthe he 3 independent n cohorts ranged nt rangeed be bbetween tw wee eenn 0. 00.28 28 and andd 0.44. 0.4 . 4. The The mod models odel od e s includ included uded ud ed SN SNPs in BCL11A, BCL11A A the HBS1L-MYB Y interg YB intergenic genic region reg gion andd th the he si site ite off th the he HB HBB B ge ggene ne cluster,, QT Q QTL L pr ppreviously eviously y associated associi with HbF. Conclusions - An ensemble of 14 genetic risk models can predict HbF levels with accuracy between 0.28 and 0.44 and the approach may prove useful in other applications. Key words: sickle cell disease, hemoglobin, genetics, association studies, risk prediction, risk factor 2 DOI: 10.1161/CIRCGENETICS.113.000387 Introduction HbF is the major modifier of the clinical features of sickle cell anemia (homozygosity for HBB glu6val) and ȕ thalassemia. HbF inhibits sickle hemoglobin (HbS) polymerization and FRPSHQVDWHVIRUWKHGHILFLWRIQRUPDO+E$LQȕWKDODVVHPLD1. If it were possible to know at birth the HbF level likely to be present after stabilization of this measurement at about age 5 years2, then a better patient-specific prognosis might be given and HbF-inducing treatments better tailored to the individual. 7KHȖ-globin chain of HbF is encoded by the linked HBG1 and HBG2 Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 genes. Levels of HbF in adults are highly heritable and the production of HbF is ggenetically eneticcal ally l 3-66 regulated by several quantitative trait loci that modulate HBG1 and HBG2 BG2 expression. expressiion.3Genome G e i on sstudies iation tudiies ((GWAS) tu tudi G AS) in sickle cell anemia GW aneemi m a have identified ed d single nucleotide wide association polymorphisms ism ism ms (SNPs) in n BC B BCL11A, CL1 L11A L1 A, the HBS1L-MYB HBSS1L--MYB HB B intergenic inte terg rgen rg e ic re en region egio on and ndd eleme elements m ntss link me linked nkked d to e clu ene lust lu ust ster e that er att jo join i tl in tly l explain expl plai pl ain 10-12% ai 100 12 12% 2% off tthe he vvariability he aria ar iabi iabi biliity y of of HbF. Hb 22,, 7 Howe However, H oweve verr, iitt is the HBB gene cluster jointly possible that a SNPs that are si at significantly ign g if ific i antlly associated d wi with ithh HbF F levels l but but do do not reach ge ggenomenom m wide significance may ficance fi i ma explain e plain pllaii additional addi dditii all variability ariabilit iabi bili lit andd be be used sed d to predict dict HbF di HbF levels. le l els ell This Th Thi h is in part due to the difference in the goals of the analysis of GWAS and phenotype prediction; in order to increase the amount of variability explained in a phenotype one may need to use SNPs that fall below the genome-wide significance threshold.8 One approach to genetic risk prediction uses a summary of the risk alleles in the form of a genetic risk score (GRS) as a covariate of the model.9-12 A GRS can summarize a large amount of genetic information into a single covariate, but the challenge is to identify the optimal number of SNPs to be included in the score. To overcome this challenge, we present a novel method of creating an ensemble of models with the GRS composed of a different number of SNPs to produce more robust predictions. This method extends the approach introduced in Sebastiani et 3 DOI: 10.1161/CIRCGENETICS.113.000387 al 2012 for building genetic risk prediction models from case control studies to predict quantitative traits.13 We show that an ensemble of 14 models with GRSs comprising 1 to 14 SNPs that were trained in a set of 841 sickle cell anemia patients from the Cooperative Study of Sickle Cell Disease (CSSCD) can predict HbF in three different cohorts of African Americans with sickle cell anemia with correlation between observed and predicted values between 0.28 and 0.44. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 Methods Participants HbF levels were m measured e suureed in 841 African American ssubjects ea ubjects from thee CSSCD (NCT00005277) (NCT00005 homozygous us for us for the HbS S ggene en ne or or w with i h Hb it H HbS-ȕ S-ȕ0 tha tthalassemia. th halasssemi miaa.14 Th mi The he va validation ali l dati tion ti on cohorts coh ohor orrtss iincluded nclu nc l d 181 patientss ffrom rom ro m tthe h P he Pulmonary ulmo ul monnary Hypertension mo Hyp yper erte er t nssio te ionn an andd Si Sick Sickle ckle ck le Cell Cel elll Disease Dise Di seaase wi se with th S Sildenafil ilde il dena de nafi na fill Therapy fi Thee Th (Walk-PHaSST) a aSST) Study (NCT00492531), (N NCT C 00 0 499225531 31), ), 77 77 patients paati tien en nts from fro rom m a st sstudy udyy of pulmonary ud pul u mo ona nary hypertension hypertensi in children with th si sick sickle ckle le ccell elll disease dise di seas asee (P ((PUSH PUS USH H NC NCT CT 00 0049 00495638), 49956 5 38 38), ), aand nd 1127 27 ssickle 27 i kl ic klee ce cell ll anemia ane nemi miaa patien pati pa patients tien en from the Comprehensive Sickle Cell Centers Collaborative Data (C-data) project. Subjects for all cohorts were selected based on the following criteria: age >5 years for the HbF measurement, no hydroxyurea use, and no recent transfusion. In addition, patients in the validation cohorts had hemoglobin phenotypes similar to that of the discovery set. The demographics of these studies have been described.14-16 Some characteristics of the sickle cell anemia patients are described in Table 1. All studies were approved by the institutional review board (IRB) of each participating institution. HbF Values HbF was measured by alkali denaturation in the CSSCD or by high-pressure liquid chromatography. Within the range of observed HbF, both alkali denaturation and high-pressure 4 DOI: 10.1161/CIRCGENETICS.113.000387 liquid chromatography give similar results. For the CSSCD subjects, longitudinal HbF values were gathered from phases 1 through 3 of the study, only steady-state values were used (ie, measurements not taken during an acute event) and summarized by the median of the longitudinal values.2 Because HbF is known to decrease in early years of life, we only used HbF measurements at age 5 years or older. The cubic root transformation of HbF was used in all statistical analyses, to remove asymmetry. Genotyping Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 DNA from the CSSCD, PUSH, C-Data and Walk-PHaSST samples that at formed the disc discovery cov ove and replication cohorts were genotyped at Boston University using Illumina umina Human610-Quad Human66100-Q Qu ithh ap it appr prox o im mately 600,000 SNPs. Samples Sam ampl am p es were proces ssed according the SNP arrays wit with approximately processed rer’ss protocol and re an nd BeadStudio Beead dSt S udio io o Software Sof oftwar are was wa us used ed d too ma m ake ggenotype enot otype ca ot call ls util lizzinn the manufacturer’s make calls utilizing e fi e-defi fine nedd clusters. ne clus ustters ters. Samples Samp Samp mple less withh lless le ess than ess than a 995% 5% % ccall all rate al rat ra ate te were wer ere re rem moveed an andd SN N Illumina pre-defined removed SNPs r <97.5% were re-cl lustered. d After Aft f er re-cl lustering, g, S NPs with NP wiith h call calll rates >97.5%, >97.5% with a call rate re-clustered. re-clustering, SNPs aration tii score > 0.25, 0 225 5 excess e cess heterozygosity hetero het gosit it between bet bet een -0.10 0 10 10 and d 00.10, 10 and d mi minor i a cluster separation allele frequency > 5% were retained in the analysis. We used the genome-wide identity by descent analysis in PLINK to discover unknown relatedness.17 Pairs with IBD measurements greater than 0.2 were deemed to be related and related subjects within individual or different studies were removed. We also removed samples with inconsistent gender findings defined by heterozygosity of the X chromosome and gender recorded in the database. Statistical Analysis Genotype data from the CSSCD were used as the training data set to develop different GRSs. Initially, the association of each SNP was tested using a linear regression model adjusted for gender using the additive coding in PLINK17, and the p-values for each association were 5 DOI: 10.1161/CIRCGENETICS.113.000387 computed. No significant associations were found between HbF and the first ten principal components (PCs) computed using EIGENSOFT.18 Lack of inflation of the associations was confirmed by the genomic inflation factor 1.003.18 SNPs were sorted by increasing p-values, starting from the most significant SNP associated with HbF (rs766432, p-value=2.61x10-21), and the list of SNPs was pruned by removing those SNPs in high linkage disequilibrium (r2 > 0.8). If two SNPs were found to be in high linkage disequilibrium, the SNP with lowest p-value was kept and the other SNP was removed. This process was repeated until no SNPs in high linkage Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 disequilibrium remained which left 500,325 SNPs. This pruned and sorted orted list of SNPs was wa used to generate a sequence of unweighted GRS for each subject in the CSSCD SCD D by b cumulatively cumulativ l iv ivel ely el n mbe num berr off risk riskk alleles a leles (an allele that caus al uses us es a decrease in H b ) for each SNP. Th bF h first adding the number causes HbF) The ded only the m de ost ssignificant iggnif iffic i antt SN SNP P, the th he second seco condd G RS was was generated wa geeneraate ted by add dddingg tthe he GRS included most SNP, GRS adding P from m th the so ortted d llist ist of ist o S NPs to the NP the ffirst irst stt GRS GRS R and and so so onn using usi sin ing ng the the iterative itterat attive ive formula: f rm fo m second SNP sorted SNPs ݁ݓ݊ݑ ݊ݑ ݃݅݁ݓ ݃݅݁ݓ ݄݅݃ ݄݁ݐ ݀݁ݐ ݀ ܴܩ ܴܵܩ ܵ, ܴ݅݇ݏ ݈݈݈݁݁ܣ ݇ݏ ݈݈݁ܣ ݈ܣ ݈݈݁݁ ݈݁,, ݀݁ݐ݄݃݅݁ݓ݊ݑ , = ܴ݅ ୀଵ where n is the number of SNPs, and Risk Allelei,j is the number of risk alleles carried by individual i for the jth SNP. A sequence of weighted GRS was also generated by weighting each risk allele as: ܴܵܩ ݀݁ݐ݄݃݅݁ݓ, = ݐ ݈݈݈݁݁ܣ ݇ݏܴ݅ כ, ୀଵ where tj is the t-statistic to test the association of the jth SNP with HbF. This analysis was repeated for the first 10,000 SNPs (p-value< .02185) and generated 10,000 GRS, for each of the subjects in the CSSCD. Each of these GRS was included as covariate in a linear regression model and the regression coefficients of the resultant 10,000 linear regression models were 6 DOI: 10.1161/CIRCGENETICS.113.000387 estimated using Least Squares methods in the CSSCD data in the R package. The fitted regression models were used to predict HbF using the formula: , = ߚ ܨܾܪ , + ߚଵ, ܴܵܩ, , is the predicted HbF value for individual i using the nth GRS, and ߚ where ܨܾܪ , , ߚଵ, are the estimate of the regression coefficients. Cumulative ensembles of the predictions13, 19-21 were Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 computed using the formula: 1 ܨܾܪ ܨపప,ఫ,ఫఫ ݊ ୀଵ Thee ppredictive reedictive value vallue ooff the ggenetic enetic rrisk isk m models odeels an and nd th the their eir eensembles nsemb m lees wass eevaluated valluattedd iin the nd in the three independent in nde d pe p ndden ent cohorts. The pproportion ropo p rtio on of variability variaabili biil ty y explained in th CSSCD, and the ensemble off GR GRS S models mode mo dels de ls in in the the CSSCD CSSC CS SCD SC D set set was was computed comp co mput mp uted ut ed as as the the squared sqqua uare redd Pearson re Pear Pe arso ar sonn co so corr correlation rrel rr elat el atii at between the predicted and observed values while the predictive accuracy in the independent sets was evaluated by computing the Pearson correlation between the observed and predicted values of HbF. The number of SNPs with GRS that maximized the correlation between observed and predicted values in the three independent cohorts was selected as the optimal number of SNPs. To evaluate the predictive value of genetic data relative to non-genetic risk factors, a non-genetic prediction model based on age, gender and the presence of alpha thalassemia was estimated and the genetic risk models were also adjusted by age, gender and alpha thalassemia. The accuracy of these additional models was assessed by the correlation between the predicted and observed HbF values. 7 DOI: 10.1161/CIRCGENETICS.113.000387 Results Table 1 shows patient characteristics of the four studies. The percent male and HbF concentration were approximately the same across the Walk-PhaSST, C-Data and CSSCD cohorts. Average age varied across the four cohorts; the CSSCD studied both children and adults, C-Data and PUSH were skewed toward pediatric age patients while the Walk-PHaSST cohort consisted mainly of adults. Patients in the PUSH study were younger and had significantly higher HbF levels (p-value=0.0016). Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 Figure 1 shows the correlation between the observed and predicted icted HbF values uusing sn si genetic risk models with unweighted GRS (left panel) and the ensemble models le off tthese h se GR he GRS S mo mod d (right panel) the l for l) for th he to topp 500 SNPs, in each of the validation vali lida li d tion cohorts. The da The h correlation ranged between 0.2 and with SNP; GRSs 2 an nd 0.4 for pprediction reediccti tion w ith h oonly nly ly 1 S NP;; it it peaked pea eake kedd att 00.45 ke .45 ffor orr pprediction rediict c io i n with th G that includee 10 to 115 genetic and 5 SN SNPs NPs P in in the th he PUSH PU USH H data, datta, a with wit ithh both b th the bo thee standard sta tand ndar nd ard ar rd ge gen netiic risk ri models mod odel dells an a their ensembles, numbers m mbles, , and declines ffor or llarger arg ger numbe b rs off SNPs SNP in the th he models. mode d ls l . Inclusion Incllusion of more than 50 SNPs decreased new SNPs GRS ecreasedd th the h correlation elati l io eeven en ffurther. rther th h While Whil Wh il the th h incl iinclusion ncll sion io off ne SNP in in the thhe GR G R had substantial effects on the predictive accuracy of the genetic risk model, as shown by the up and down pattern from one model to the next (left panel of Figure 1), the accuracy of the ensemble of these GRS models was more stable. The results using the weighted GRS were similar (Supplementary Figure 1). An ensemble of the first 14 GRS models had the highest average correlation among all three data sets and explained 23.4% of the variability in HbF in the CSSCD cohort. The correlation between observed and predicted HbF using the ensemble of 14 GRS models was 0.44, 0.28 and 0.39 in the PUSH, Walk-PHaSST and C-Data cohorts, respectively. Of these 14 SNPs, 5 were located in BCL11A; other SNPs were located in the olfactory receptor region on 8 DOI: 10.1161/CIRCGENETICS.113.000387 chromosome 11p15 and the site of the HBB gene cluster, and in the HBS1L-MYB interval on chromosome 6q and were found previously to be associated with HbF.2, 22, 23 Table 2 shows details of these 14 SNPs. Adding non-genetic risk factors age and gender to the GRS models did not increase the amount of variability explained of HbF in the Howard and Walk-PHaSST cohorts. The nongenetic prediction model which included information on age, alpha thalassemia and gender only explained 6.8% of the variability in HbF in the CSSCD cohort. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 It is noticeable that the genetic risk models had consistently higher gher pr ppredictive edictive acc accuracy cur u a in the PUSH cohort in comparison with the C-Data and Walk-PHaSST cohorts. ohortts. The Th age g distributionn off the t e CSSCD th C SC CS S D and PUSH cohorts was skewed skeewed toward chil sk children ildr il d en and young adolescentss wh while the agee ddistribution isttriibu butiion off Walk-PHaSST W lk-P Wa PHaSS SST cohort coho h rt was waas skewed skkew ewed ed d tow toward wardd aadult duultt pa patients a e entary y Figure Figgure Fi re 22). ). E xact age xa gee of pa ati tien ents en ts was as not ot available ava vaiilab va able ab le for for the the C-Data C-D Data t cohort. cohhor ortt. (Supplementary Exact patients Discussion In African Americans with sickle cell anemia, HbF level does not stabilize until the age of 5 years.2 Early prediction of stable adult HbF levels might help foresee some complications of sickle cell anemia and aid in its clinical management by guiding the decision of how vigorously to pursue HbF induction therapies. Our goal was to identify methods that could combine SNPs to provide a better prediction of HbF levels than single SNP analysis. We developed an ensemble of GRS that predicted HbF in 3 independent cohorts. In an ensemble of GRS models, as few as 14 SNPs explained a larger fraction of the variability in HbF and better predicted HbF levels when compared with single SNP analysis, or a single GRS model. Even though the ensemble of GRS explains 23.4% of the variability in HbF in the CSSCD cohort, it does not explain the totality of the variability in HbF that is due to heritability, estimated to be between 60% and 90%.5, 24 This 9 DOI: 10.1161/CIRCGENETICS.113.000387 missing heritability could be due to gene-gene interactions, epigenetic factors or multiple rare variants with small effects that GWAS are poorly designed to detect.25 Our GRS included SNPs with a MAF > 5%; however, as sickle cell anemia is a rare disease it is possible that some major genetic modifiers are rare variants with a lower allele frequency. Next generation sequencing might discover rare alleles that could be incorporated into a GRS to increase prediction accuracy. Increasing the number of genetic variants in the GRS may increase the total explained variability of HbF26; however, this could lead to overfitting and as one continues to add more genetic Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 variants to the GRS the prediction accuracy in independent cohorts will ll decrease (as sho shown own w in Figure 1). The prediction accuracy of our weighted GRS model was simila similar l r ttoo that that off oour ur unweighted d GRS RS model. moddel e . We We hypothesize that this is due to the fact tha that haat weighted GRS mo models perform better tterr when theree is a ddifference tt ifffeerencce in n tthe he ge genetic eneetiic eeffects; ffec ff ectss; ho ec hhowever, oweve veer, Table T blee 2 sshows Ta hows that ho th the regression coeff coefficients ffic ff icie ic cie ient nts an nt andd st standard tan anda d rdd eerrors rrors fr rr ffrom om m tthe he G he GWAS W S off our WA ur ttop op S SNPs NPss are NP ar al all ll ve very ry similar. ssim im An interesting i g resul result lt off our analy analysis lysiis iiss th the he sy systematically ystematically ly higher hiiggher correlation betweenn predicted and ndd obser ob observed b ed dH HbF bF values all es in in the thhe PUSH PUS USH H study. st d This Thi res result lltt might ighht suggest s ggest st that thhat the th h ggenetic models predict more accurately in children and young adults. Blood counts decline with advancing age in sickle cell disease and could reflect decades of bone marrow damage due to sickle vasoocclusion with relative bone marrow “failure”.27-29 Perhaps the higher correlation between the predicted and observed HbF levels in the younger patients of the PUSH cohort is a result of gene x environment interaction, where the genetic elements regulating HbF production are not impeded by the erythroid bone marrow injury associated with aging in the older cohorts. However, the difference in prediction accuracy could also be due to unobservable patient characteristics not accounted for in the model. For example, the CSSCD was conducted before the establishment of hydroxyurea as standard therapy for sickle cell anemia patients, and there 10 DOI: 10.1161/CIRCGENETICS.113.000387 may be survivor effects in more contemporary cohorts that are not accounted for in older cohorts. Testing these results in additional contemporary cohorts of sickle cell anemia will be necessary to explain the result. The 14 SNP model included SNPs in the BCL11A region, the HBS1L-MYB intergenic region and SNPs in the olfactory receptor gene cluster 5’ to the HBB gene complex. BCL11A down regulates HBG expression 30, 31, the HBS1L-MYB intergenic region might affect erythropoiesis and modulate BCL11A expression while the olfactory receptor region might Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 control expression with the HBB gene-like cluster.32 There are alternate methods of genetic prediction that we did not ot explore expplore in in this th paper pap 33 34 including combining omb omb mbin inin in ingg SNP in S P information into a haplotype-based SN haplo oty typpe-based analysis analysis, is,,33, is multivariate regression models moddels and machine maachin ne learning lear a ningg type ar typ pe approaches ap pproaachess suc such uchh ass suppo uc support poortt vvector ecto or m machines acchin nes aand 5, 366 338-42 8-42 cross validation a ation (CV) (CV CV))335, , principal prin pr i ci in cipal i l co component omponnen ent aanalysis nal allys ysis i 377, and is an nd Ba Baye Bayesian yesiian net yesi networks etwo et w rkss38 wo . In In our analysis, wee used traditionall regression regr g ession i bbased ased dm models oddels l and ensemble ensemb ble off th these hese models: we tra trained a the models in i the th h discovery disco di er cohorts hort andd examined e amined ined d th the he prediction dictiio res di results llts t in in independent indd dent cohorts coh ohh to determine which model predicts most accurately. We investigated using 10 fold cross validation (CV) method to select the best number of SNPs in the discovery set. However, we noted in a large simulation study that using 10 fold cross validation tends to underestimate the true number of SNPs (Supplementary Information; Supplementary Figure 3), and noted that an ensemble of regression models appear to be more robust for prediction. This result is consistent with published literature.43. Since the 3 test sets were used to determine the optimal number of SNPs, replication of the results in additional independent sets is necessary to confirm the results. However, the substantial agreement of prediction of the ensemble of 14 models in the 3 sets is reassuring that this result is robust. 11 DOI: 10.1161/CIRCGENETICS.113.000387 The ensemble of GRS models explained 23.4% of the variability in HbF in the CSSCD data and the correlation between predicted and observed HbF values in the 3 independent sets ranged from 0.28 to 0.44. These numbers are higher than results reported in the literature when GRS have been used as a predictive tool. For example, a study of 3,575 subjects from the Doetinchem Cohort Study computed a GRS to predict plasma total cholesterol levels which are highly genetically determined with a heritability estimated to be 40 to 60%.44, 45 Using 12 SNPs they were able to explain 6.9% of the total variability in total cholesterol levels.46 Participants in Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 the Atherosclerosis Risk in Communities cohort of 10,745 individuals were used to construct const stru st r a GRS of obesity in order to predict BMI.47, 48 The obesity GRS showedd a correlation correllatio ti n wi with ith BMI BM of r=0.12 for the t uunweighted nwei nw e gh ei ghtedd GRS model and r=0.13 ffor or the weighted mo model. Our ensemble of o GRS models ls of ls of HbF can explain exxplain in more more phenotype p en ph notyppe va variability ariiabil ilit il ityy and it andd hhad ad a high higher gh her ppredictive reedictivee accuracy in n comp comparison mp par aris ison wi is with ith t ssingle ingl ingl gle SN SNP P anal analyses. alys al ysees. ys Onee of the maj major jor ggoals oalls off GW G GWAS A was to id AS identify dentiiffyy genetic genetic i variants variiants that th hat are associated with disease or meas measures severity personalized medicine. Many studies res off disease dis se erit it to be b used sed ed d in i personali ali li edd medicine edi diciin M Man a st t di dies have shown the importance of including genetic variants beyond those that meet the genome-wide association threshold of 5x10-08 but many of these SNPs associations may be false positives and lower the accuracy of a prediction model.49, 50 Our study shows that the use of an ensemble of genetic risk models is robust to inclusion of false positive associations and the approach may prove useful in other applications. Funding Sources: This work was supported by National Institutes of Health Grants R21HL114237 (PS), R01 HL87681 (MHS), RC2 L101212 (MHS), 5T32 HL007501 (JNM), 2R25 HL003679-8 (VRG), R01 HL079912 (VRG), 2M01 RR10284-10 (VRG), R01HL098032 (MTG), R01HL096973 (MTG), P01HL103455 (MTG), the Institute for Transfusion Medicine and the Hemophilia Center of Western Pennsylvania (MTG), U54 HL70583. The C-Data Project of the Comprehensive Sickle Cell Centers, U54HL70583 and NIH intramural funding 1 ZIA HL005116 and 1 ZIA HL006016 12 DOI: 10.1161/CIRCGENETICS.113.000387 Conflict-of Interest Disclosure: The authors declare no competing financial interests. References: 1. Steinberg MH, Nagel RL, Forget BG, Higgs DR, Weatherall DJ. Genetic Modulation of Sickle Cell Disease and Thalassemia. In: M. H. Steinberg, B. G. Forget, D. R. Higgs and D. J. Weatherall, eds. Disorders of hemoglobin: Genetics, Pathophysiology and Clinical Management Cambridge: Cambridge University Press; 2009: 638-657. 2. Solovieff N, Milton JN, Hartley SW, Sherva R, Sebastiani P, Dworkis DA, et al. Fetal hemoglobin in sickle cell anemia: genome-wide association studies suggest a regulatory region in the 5' olfactory receptor gene cluster. Blood. 2010;115:1815-1822. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 e. Am J Hematol. 3. Steinberg MH, Sebastiani P. Genetic modifiers of sickle cell disease. 2012;87:795-803. L et al. al Gender Gen ende derr an de andd 4. Steinberg MH, Hsu H, Nagel RL, Milner PF, Adams JG, Benjamin L, eeffects ect ctss up upon on hem ematological manifestations em ns of of adult sickle cell cel ell anemia. Am J Hem m haplotype effe hematological Hematol. 75-18 5-- 81. 1995;48:175-181. C T atu T, R at e ttiee JE, ei E,, Lit ttl tlew e oo oodd T arleyy J, C ervi vino no S l. G enettic inf nfluen nf ncess on F 5. Garner C, Tatu Reittie Littlewood T,, D Darley Cervino S,, et aal. Genetic influences t ther hhematologic he ema m tolo ma ogiic va riab ri ia les: s: a twinn he heri r ta ri tabi bili bi liity t sstudy. tudy tu dy. B dy lood. d 20 22000;95:342-346. 0000;9 ; 5: 5 34422 34 3466. cells and other variables: heritability Blood. S, Thein SL. Genetic Genetiic ar chit hi ecture of hhemoglobin emogl gllobin b F control. controll. C urr Opin Opi p n Hematol. 6. Menzel S, architecture Curr 79-186. 2009;16:179-186. 7. Bae HT, Baldwin CT, Sebastiani P, Telen MJ, Ashley-Koch A, Garrett M, et al. Meta-analysis of 2040 sickle cell anemia patients: BCL11A and HBS1L-MYB are the major modifiers of HbF in African Americans. Blood. 2012;120:1961-1962. 8. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2011;13:135-145. 9. Meigs JB, Shrader P, Sullivan LM, McAteer JB, Fox CS, Dupuis J, et al. Genotype score in addition to common risk factors for prediction of type 2 diabetes. N Engl J Med. 2008;359:22082219. 10. Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748-752. 11. Paynter NP, Chasman DI, Pare G, Buring JE, Cook NR, Miletich JP, et al. Association between a literature-based genetic risk score and cardiovascular events in women. JAMA. 2010;303:631-637. 13 DOI: 10.1161/CIRCGENETICS.113.000387 12. Sebastiani P, Solovieff N, Sun JX. Naive Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical Trait: Not so Different after all! Front Genet. 2012;3:26. 13. Sebastiani P, Solovieff N, Dewan AT, Walsh KM, Puca A, Hartley SW, et al. Genetic signatures of exceptional longevity in humans. PLoS ONE. 2012;7:e29848. 14. Gaston M, Rosse WF. The cooperative study of sickle cell disease: review of study design and objectives. Am J Pediatr Hematol Oncol. 1982;4:197-201. 15. Dham N, Ensing G, Minniti C, Campbell A, Arteta M, Rana S, et al Prospective echocardiography assessment of pulmonary hypertension and its potential etiologies in children with sickle cell disease. Am J Cardiol. 2009;104:713-720. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 16. Machado RF, Barst RJ, Yovetich NA, Hassell KL, Kato GJ, Gordeuk euk VR, et al. Hospitalization for pain in patients with sickle cell disease treated with h ssildenafil ilde il dena de nafi na fill fo fi forr eelevated lev vatt TRV and low exercise capacity. Blood. 2011;118:855-864. 17. Purcell S, Neale B,, To Todd-Brown Bender Nea eale lee B Todd d -Brown K, Thomas L, Ferreira Fe MA, Ben nde der D, et al. PLINK: A Tool Set forr Whole-Genome Association Population-Based Hum Whole-G Geennome om m As A sso soociat atio at ionn and io a d Po an P p laationn-Ba pu Base Ba s d Linkage Link Li nkag nk ag age g Analyses. Anal An alys al y ess. Am J H ys Genet. 2007;81:559-575. 7;811:559-575. 7; 18. Price AL, NJ, Plenge RM, Principal L, Pa P Patterson att ttter ersson NJ J, Pl Plen enge ge R M, Weinblatt Wei einb ei nbla nb bla l tt ME, ME, Shadick Sha hadi dicck NA di NA an andd Reich Reiic Re ich D. P ich Principa riinc nciipa ip componentss analysis corrects genome-wide Nat Genet. corre r ectss fo forr st sstratification rati ra tiifi fica caati tion on iin n ge geno nome no m -w me wid idee as aassociation sooci c at a io ionn studies. st Genn 2006;38:904-9. 0 04-9. 19. Breiman L. B Predictors. Machine Learning. nL Bagging gii Predictors P di dict M hi L Learning in 11996;24:123-140. 1996;24:123 9966;24 99 24:123 123 1140 40 20. Mevik B-H, Segtnan VH, Næs T. Ensemble methods and partial least squares regression. J Chemom. 2004;18:498-507. 21. Rokach L. Ensembled-based classifiers. Art Intell Review. 2010;33:1-39. 22. Sedgewick AE, Timofeev N, Sebastiani P, So JC, Ma ES, Chan LC, et al. BCL11A is a major HbF quantitative trait locus in three different populations with beta-hemoglobinopathies. Blood Cells Mol Dis. 2008;41:255-258. 23. Uda M, Galanello R, Sanna S, Lettre G, Sankaran VG, Chen W, et al. Genome-wide association study shows BCL11A associated with persistent fetal hemoglobin and amelioration of the phenotype of beta-thalassemia. Proc Natl Acad Sci U S A. 2008;105:1620-1625. 24. Pilia G, Chen WM, Scuteri A, Orru M, Albai G, Dei M, et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2006;2:e132. 14 DOI: 10.1161/CIRCGENETICS.113.000387 25. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362-9367. 26. Galarneau G, Palmer CD, Sankaran VG, Orkin SH, Hirschhorn JN, Lettre G. Fine-mapping at three loci known to affect fetal hemoglobin levels explains additional genetic variation. Nat Genet. 2010;42:1049-1051. 27. West MS, Wethers D, Smith J, Steinberg M. Laboratory profile of sickle cell disease: a cross-sectional analysis. The Cooperative Study of Sickle Cell Disease. J Clin Epidemiol. 1992;45:893-909. 28. Morris J, Dunn D, Beckford M, Grandison Y, Mason K, Higgs D, et al. The haematology of homozygous sickle cell disease after the age of 40 years. Br J Haematol. 1991;77:382-385. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 29. Hayes RJ, Beckford M, Grandison Y, Mason K, Serjeant BE, Serjeant GR. eant ea nt G R. The The haematology haema m t of steady state homozygous sickle cell disease: frequency distributions, s, variation varia i ti tion with wit ithh age it age and a sex, longitudinal observations. Br J Haematol. 1985;59:369-382. 30. Sankaran VG, Menne TF, Xu Akie TE,, Le Lettre Handel B,, et aal. Human an V an G, Men nnee T F,, X u J, J A kiee TE ki T L t re G, tt G, Van Vaan Ha and n el B l. H u an ffetal um ettal a hemoglobinn expression stage-specific BCL11A. exxpression iss regulated regul ullated ed d by by the t e developmental th deveelo opm mentaal st stag a e--sppecif ag i ic rrepressor ep presssor B CL11A CL A Science. 2008;322:1839-1842. 088;3 ;322:183399 188422. 31. Chen Z,, Luo HY, Steinberg MH,, Ch BCL11A transcription K562 Steiinnbberrg MH M Chui uii DH. DH. B CL11 CL 111A represses reepr p es esse s s HB se HBG G tr ran anscription in K cells. Blood d Cells Mol Dis. 20 22009;42:144-149. 09;4 09 ;4 42:14 44-1449. 32. Feingold EA, Penny LA LA, Nienhuis BG. olfactory located ld dE EA A Penn P L A Nienh Nienhh is i AW, AW Forget Fo et BG B G An A olfactor olf lfact receptor pt gene iiss llocat at in the extended human beta-globin gene cluster and is expressed in erythroid cells. Genomics. 1999;61:15-23. 33. Hughes MF, Saarela O, Stritzke J, Kee F, Silander K, Klopp N, et al. Genetic Markers Enhance Coronary Risk Prediction in Men: The MORGAM Prospective Cohorts. PLoS ONE. 2012;7:e40922. 34. Onuki R, Shibuya T, Kanehisa M. New kernel methods for phenotype prediction from genotype data. Genome Inform. 2010;22:132-141. 35. Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, et al. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009;5:e1000678. 36. Wu C, Walsh K, DeWan A, Hoh J, Wang Z. Disease risk prediction with rare and common variants. BMC Proceedings. 2011;5:S61. 15 DOI: 10.1161/CIRCGENETICS.113.000387 37. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241:252-261. 38. Rodin AS, Boerwinkle E. Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apoE levels). Bioinformatics. 2005;21:3273-3278. 39. Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH. Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nat Genet. 2005;37:435-440. 40. Jiang X, Barmada MM, Cooper GF, Becich MJ. A Bayesian method for evaluating and discovering disease loci associations. PLoS ONE. 2011;6:e22075. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 41. Kang J, Zheng W, Li L, Lee J, Yan X, Zhao H. Use of Bayesian networks etworks to dissect the the h complexity of genetic disease: application to the Genetic Analysis Workshop orrks orks ksho hopp 17 simulated ho sim imul ulat ul atted ed data. BMC Proceedings. 2011;5:S37. 42. Sebastiani P,, Perls data. a P ani Perl Pe rlss TT. rl TT T. Prediction models that iinclude nclu l de genetic dat ta. Circ Cardiovasc Genet. Gee 2010;3:1-2.. 43. Hartley SW, Monti Liu CT, Steinberg Sebastiani P.. Bayesian SW W, Mont ti S, L iu uC T, S t inbe te b rg MH, H Seb ebassti tianii P Bay yessiaan methods method ods for od foor multivariatee modeling SNP mod odel od elin el i g off pleiotropic in pleio leio i trop trop opic S NP associations ass ssoocia ss i ti ia tion onss and and genetic geeneti ticc ri ti risk iskk pprediction. redi dict di ctiion. Front ct Frontt Genet. G 2012;3:176. 6 6. 44. Lusis AJ, part genes AJ, Fogelman AJ Fogelm man AM, AM, M, Fonarow Fonaro Fo Fona naaro row w GC. GC. Genetic Geeneti Gene netic tiic basis baasi basi s s of atherosclerosis: ath ther heros oscl os cler cler eros osis os iss: pa is: art II:: new ge en and pathways. 2004;110:1868-1873. a s Circulation. Ci l tii 2004;110:1868 2004 20 04;110 110:186 18688 1873 1873 45. Verschuren WM, Blokstra A, Picavet HS, Smit HA. Cohort profile: the Doetinchem Cohort Study. Int J Epidemiol. 2008;37:1236-1241. 46. Lu Y, Feskens EJ, Boer JM, Imholz S, Verschuren WM, Wijmenga C, et al. Exploring genetic determinants of plasma total cholesterol levels and their predictive value in a longitudinal study. Atherosclerosis. 2010;213:200-205. 47. Belsky DW, Moffitt TE, Sugden K, Williams B, Houts R, McCarthy J, et al. Development and evaluation of a genetic risk score for obesity. Biodemography Soc Biol. 2013;59:85-100. 48. Folsom AR, Chambless LE, Ballantyne CM, Coresh J, Heiss G, Wu KK, et al. An assessment of incremental coronary risk prediction using C-reactive protein and other novel risk markers: the atherosclerosis risk in communities study. Arch Intern Med. 2006;166:1368-1373. 49. Kooperberg C, LeBlanc M, Obenchain V. Risk prediction using genome-wide association studies. Genet Epidemiol. 2009;34:643-652. 16 DOI: 10.1161/CIRCGENETICS.113.000387 50. Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2010;43:519-525. Table 1 Patient characteristics. Summary statistics of patient characteristics in the CSSCD cohort, and the PUSH, WALK-PHaSST and C-Data replication cohorts. For each cohort statistics (mean and standard deviate or frequencies) are reported for all patients included in the Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 analysis. The last row reports the frequency and proportion of individuals uals with gene gene deletion dele leti le to alpha thalassemia (at least one alpha gene deletion). CSSCD C CS S D SC PUSH PU USH H Walk-PHaSST Walk Wa lk-P lk PHaS aSST aS ST T C-Data C-Da CD t (N=841) (N=8 =841 =8 4 ) (N=77) (N N=77) 7 7) ((N=181) (N =1881)) (N=127) (N= =122 Mean an (StD) (StD)) Mean an (StD) (StD) D)) Mean n (StD) (StD)) Mean (StD) (S S Age (years) 17.1 17 17.19(10.69) 19( 9(10 100.69) 10.6 .6 69 12.4 12 12.49 .499 (4 .4 (4.6 (4.69) 69) 36 36.35 6.3 .355 (12.54) (12. (1 2.54 2. 54)) 54 13-177 Gender (% male) 53.7% 50.6% 51.9% 55.9% HbF (%) 6.65 (5.50) 9.81 (8.23) 6.07 (5.60) 7.59(5.09) Hemoglobin (d/mL) 8.42 (1.33) 8.62 (1.25) 8.50 (1.69) NA ĮWKDODVVHPLD\HV 143 (31.4%) 21 (27.2%) 46 (25.4%) NA 17 DOI: 10.1161/CIRCGENETICS.113.000387 Table 2 Summary of SNPs in prediction model. GWAS results of the 14 SNPs in the best GRS prediction model in the CSSCD cohort. ȕ=estimate of genetic effect from additive model; SE= standard error. SNP Gene Coding ȕ SE pvalue Allele Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 rs766432 Chr 2; BCL11A C 0.25 0.02 1.22E-22 rs10195871 Chr 2; BCL11A A 0.21 0.02 3.30E-18 rs6706648 Chr 2; BCL11A A -0.19 0.02 1.30E-16 rs6709302 Chr 2; BCL11A A -0.14 0.02 0 02 0. 3.69E-08 3.69 3. 69 rs9494145 Chr 6; intergenic G 0.24 0.05 0.05 1.80E-07 1 80 1. rs6732518 Chrr 2; BC C Ch BCL11A CL1 L 1A G 0.13 0.02 1.86E-07 1.866 rs6446085 Chr 3; FHIT FH T A -0.11 -0. 0 11 1 0.02 0.02 02 9.43E-07 9.433 rs10152034 4 Chr 14; 14 in interg intergenic genicc A -0.11 -0.11 1 11 0.033 1.21E-05 1.211 rs17114175 5 Chr 14; in intergenic nte terg rggen e icc C -0.16 -0. 0.16 16 0.02 0 02 0. 1.59E-06 1.599 rs2855039 Chrr 11; Ch 11; HBG1, HBG1 HB G1,, HBG2 HBG2 G 0.18 0.18 0.04 0.04 2.24E-06 2.24 2. 244 rs2239580 Chrr 14; Ch 14; CO COCH CH A -0.13 -00.13 .13 13 0.03 0.03 03 4.46E-06 44.46 .446 rs5006883 Chr 11; OR51B5,OR51B6 G 0.17 0.04 5.32E-06 rs9525079 Chr 13; UGGT2 G 0.13 0.03 6.28E-06 rs416586 Chr 11; OR51A G 0.02 6.37E-06 rs11794652 Chr 9; FUBP3 A -0.10 0.02 6.51E-06 rs12469604 Chr 2; intergenic A 0.19 0.04 8.31E-06 rs6932510 Chr 6; RPS6KA2 A 0.16 0.04 8.52E-06 rs7113817 Chr 11; intergenic A 0.17 0.04 9.88E-06 rs10837814 Chr 11; OR51B2,OR51B3P A 0.14 0.03 1.39E-05 rs2021966 Chr 6; intergenic G 0.10 0.02 1.79E-05 18 0.13 DOI: 10.1161/CIRCGENETICS.113.000387 Figure Legend Figure 1. Correlation between observed and predicted values for increasing number of SNPs in the unweighted GRS. Plot of the correlation between the observed and predicted HbF in the three independent cohorts versus the number of SNPs in the top 50 unweighted GRS models (left panel) and ensemble of unweighted GRS models (right panel). The vertical bars are at No. SNP=14. Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 19 Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 Prediction of Fetal Hemoglobin in Sickle Cell Anemia Using an Ensemble of Genetic Risk Prediction Models Jacqueline N. Milton, Victor R. Gordeuk, James G. Taylor, Mark T. Gladwin, Martin H. Steinberg and Paola Sebastiani Downloaded from http://circgenetics.ahajournals.org/ by guest on June 18, 2017 Circ Cardiovasc Genet. published online March 1, 2014; Circulation: Cardiovascular Genetics is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231 Copyright © 2014 American Heart Association, Inc. All rights reserved. Print ISSN: 1942-325X. Online ISSN: 1942-3268 The online version of this article, along with updated information and services, is located on the World Wide Web at: http://circgenetics.ahajournals.org/content/early/2014/02/28/CIRCGENETICS.113.000387 Data Supplement (unedited) at: http://circgenetics.ahajournals.org/content/suppl/2014/03/01/CIRCGENETICS.113.000387.DC1 Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published in Circulation: Cardiovascular Genetics can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial Office. Once the online version of the published article for which permission is being requested is located, click Request Permissions in the middle column of the Web page under Services. Further information about this process is available in the Permissions and Rights Question and Answer document. Reprints: Information about reprints can be found online at: http://www.lww.com/reprints Subscriptions: Information about subscribing to Circulation: Cardiovascular Genetics is online at: http://circgenetics.ahajournals.org//subscriptions/ SUPPLEMENTAL MATERIAL Supplementary Figure 1. Scatterplot of the correlation versus the Number of SNPs in the weighted GRS model. Plot of the correlation between the observed and predicted HbF in the three independent cohorts versus the number of SNPs in the top 50 weighted GRS and ensemble weighted GRS models. 1 Supplementary Figure 2. Histograms of the age distributions for the CSSCD, PUSH and WalkPHaSST cohorts. 2 Supplement Figure 3. Side by side boxplots of the correlation between the observed and predicted phenotype (y-axis) versus the number of SNPs in the model (x-axis) for the GRS model (left panel) and ensemble of GRS models (middle panel). The right panel shows the distribution of the optimal number of SNPs chosen to maximize prediction using 10-fold cross validation. The results are from a simulation study with 1,000 simulated genotype data sets, each 3 with 1,000 individuals. In each dataset, a continuous phenotype with a heritability of 0.40 was simulated with 30 causal SNPs contributing equally to the genetic effect. The continuous phenotype was simulated to have a variability equal to 1/4 the mean of the phenotype (low variability, row 1), 1/2 the mean of the phenotype (medium variability, row 2) and 3/4 the mean of the phenotype (high variability, row 3). To evaluate the predictive accuracy, each set of 1,000 individuals was randomly split into a training set (N=900) and test set (N=100), and each training set was used to build the GRS and the ensemble of GRS models that were tested in the test set. When 10-fold cross-validation was used, each dataset of 1,000 individuals was randomly partitioned into 10 equal parts where each part was used as test set of the models trained in the other 9 parts, and the results were averaged over the 10 folds. The simulation study shows that for both the GRS and ensemble GRS methods the correlation peaks at 30 SNPs; however, when more than 30 SNPs are included in the GRS model, the correlation rapidly decreases while the decrease in correlation is more gradual for the ensemble of GRS models. The results suggest that if one were to incorrectly choose the wrong model, the ensemble of GRS models would result in smaller loss of prediction accuracy than a single GRS model. In addition, selection of the optimal number of SNPs using cross-validation may be too conservative and produce sub-optimal models. As expected, the prediction accuracy decreases as the variability of the phenotype increases. 4
© Copyright 2026 Paperzz