UKBiobankPhasingandImputationDocumentation Version1.2 13November2015 documentationauthorJonathanMarchini DepartmentofStatistics,UniversityofOxford onbehalfofUKBiobank ContributorstoUKBiobankPhasingandImputation JonathanMarchini(StatisticsDept,Oxford),JaredO’Connell(WTCHG,Oxford),Olivier Delaneau(UniversityofGeneva),KevinSharp(StatisticsDept,Oxford),Warren Kretzschmar(WTCHG,Oxford),GavinBand(WTCHG,Oxford),ShaneMcCarthy(WTSI, Hinxton),DesislavaPetkova(WTCHG,Oxford),ClaireBycroft(WTCHG,Oxford),Colin Freeman(WTCHG,Oxford),PeterDonnelly(WTCHG,Oxford). 1 TableofContents Introduction.............................................................................................................................3 Phasing......................................................................................................................................4 Filteringbeforephasing...............................................................................................................4 Phasingmethoddescription.......................................................................................................4 Validationofthephasingmethod.............................................................................................5 Wholegenomephasing.................................................................................................................5 Genotypeimputation............................................................................................................6 AssessmentoftheUKBiobankArrayforimputation........................................................6 Referencepanelusedforimputation......................................................................................7 Imputationmethoddescription................................................................................................8 Wholegenomeimputation..........................................................................................................8 Informationscores,minorallelefrequenciesandfiltering.............................................8 Imputedgenotypefiles.................................................................................................................9 Samplefiles....................................................................................................................................................10 Differencesbetweenrawgenotypesandimputedfiles...................................................10 Anexemplargenomewideassociationstudy...........................................................11 Samplefiltering.............................................................................................................................11 Takingaccountofthedifferentarraysused.......................................................................11 Associationtesting.......................................................................................................................11 Results..............................................................................................................................................12 Fileprocessing.....................................................................................................................12 References.............................................................................................................................13 2 Introduction Thisdocumentdescribestheanalysiscarriedouttoperformgenotypeimputationfor theinterimreleaseoftheUKBiobank(UKB)genotypedata.Italsoprovidesadvice aboutusingtheimputeddatatocarryoutgenomewideassociationstudies(GWAS)or forextractinggenotypesforuseascovariatesinothertypesofassociationstudy. Genotypeimputation1,2istheprocessofpredictinggenotypesthatarenotdirectly assayedinasampleofindividuals.Areferencepanelofhaplotypesatadensesetof SNPs,indelsandstructuralvariants,isusedtoimputegenotypesintoastudysampleof individualsthathavebeengenotypedatasubsetoftheSNPs.These‘insilico’ genotypescanthenbeusedtoboostthenumberofSNPsthatcanbetestedfor association.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthe causalvariantsandfacilitatesmeta-analysis.Theresultoftheimputationprocessisa datasetwith73,355,667SNPs,shortindelsandlargestructuralvariantsin152,249 individuals.SeeBox1of1foraquickvisualoverviewofhowgenotypeimputation works. Theprocessofimputationisdividedintotwosteps(i)pre-phasing,and(ii)imputation. Inthefirststep,thesamplestobeimputedare‘pre-phased’i.eastatisticalmethodis appliedtogenotypedatatoinfertheunderlyinghaplotypesofeachindividual.Inthe secondstep,adifferentstatisticalmethodisusedtocombinetheinferredhaplotypes withareferencepanelofhaplotypesandimputetheunobservedgenotypesineach sample.Thefollowingtwosectionsofthisdocumentdescribehowthepre-phasingand imputationwascarriedoutonthe~150,000samples. Phasingandimputationcanbeacomputationallyintensiveprocess.Toavoidmany differentresearchgroupshavingtocarrythisoutindependently,phasingand imputationwasbeencarriedoutcentrally. QuestionsaboutusingtheimputedgenotypesshouldbesenttotheUKBGeneticsmail listsetupforthispurpose.Youcansubscribetothemaillisthere https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS 3 Phasing Filteringbeforephasing TocreateaninputdataforthephasingweappliedSNPQCfiltersasdescribedinUK BiobankQCdocumention3.Thesamplesweregenotypedontwoslightlydifferent chips.Approximately50,000weregenotypedaspartoftheULBiLEVEstudyusinga chipdesignedforthatstudy(denotedUKBL),withtheremainingsamples(~100,000) genotypedontheUKBchip.Therefore,weapplieddifferentmissingnessfiltersonSNPs dependentuponchip.SNPswereremovedbasedonthenumberofbatchesinwhich theyarecompletelymissing: i. SNPsonbothUKBchipandUKBLchip-removethemiftheyaremissinginmore than3batches(outof33batches) ii. SNPsontheUKBchipandnottheUKBLchip-removethemiftheyaremissing inmorethan2batches(outof22batches) iii. SNPsontheUKBLchipandnottheUKBchip-removethemiftheyaremissing inmorethan1batch(outof11batches) 1,037sampleoutliers3wereremoved.Multi-allelicSNPsandSNPswithaminorallele frequency(MAF)<1%werethenremovedfromthedataset.Thesefiltersresultedina datasetwith641,018autosomalSNPsin152,256samples.ChromosomeXphasingand imputationwillbecarriedoutatalaterdate. Phasingmethoddescription PhasingontheautosomeswascarriedoutusingamodifiedversionoftheSHAPEIT24 programmodifiedtoallowforverylargesamplesizes.Thisnewmethod(whichwe refertoasSHAPEIT3)modifiesSHAPEIT2’ssurrogatefamilyapproachtoremovea quadraticcomplexitycomponentofthealgorithm5.Insmallsamplesizesofafew thousandsamples,thispartofthealgorithm,whichinvolvescalculatingHamming distancesbetweencurrenthaplotypesestimates,contributesonlyarelativelysmall parttothecomputationalcost.Assamplesizesincreaseover10,000samplesthenthis componentbecomessignificant.Thenewalgorithmusesadivisiveclusteringalgorithm toidentifyclustersofhaplotypes,andthencalculatesHammingdistancesonly betweenpairsofhaplotypeswithineachcluster.Onlyhaplotypeswithineachcluster areusedascandidatesforthesurrogatefamilycopyingstatesintheHMMmodel.The resultingalgorithmhascomplexityO(NlogN)whereNisthenumberofhaplotypesin thedatasetbeingphased.Inpractice,wehaveobservedthatthemethodexhibits scalingclosetolinear.Thisisacrucialfeatureofthemethod,especiallyforverylarge samplesizes,andapropertynotsharedbyotherapproaches6,7.Thedevelopmentof thisapproachisongoingandthereissubstantialscopetomakefurtherimprovements inspeedandaccuracy.Anewerversionislikelytoofferanorderofmagnitude reductioninspeed. 4 Validationofthephasingmethod Theaccuracyofthisnewmethodwasassessedbytakingadvantageof72motherfather-childtriosthatwereidentifiedintheUKBdataset3.Thisfamilyinformationcan beusedtoinferthephaseofalargenumberofSNPsinthetrioparents.Thesefamily inferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature4. Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswere estimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof 16,762autosomalSNPs.Theinferredhaplotypeswerethencomparedtothetruthset usingtheswitcherrormetric4.Weobtainedanexceptionallylowswitcherrorrateof 0.4%acrossthetriochildrenreportingBritishancestry.Byadjustingparametersofthe methodwehaveobservedswitcherrorrateslowerthan0.3%. Withswitcherrorratesthislow,longchunksofsequenceofmanymegabaseswillbe inferredcorrectly.Downstreamimputationfromsuchhaplotypeswillbehighly accurate. Toassesstheperformancegainofphasingall152,112samplestogether,versus phasinginsmallersubsetsofsamplestwoothertestdatasetsofsize1,072and10,072 sampleswerecreated,alsocontainingthetriochildren.Theresultsareshowninfull detailinTable1andhighlightthebenefitsofjointphasingofallthesamples.These resultsclearlydemonstratetheclosetolinearscalingoftheSHAPEIT3algorithm. Samplesize Method SwitchError Runtime(hrs) Run Sample Threads (%) Time Size Scaling Scaling 1,072 SHAPEIT3 2.6 0.25 1 1 10 10,072 SHAPEIT3 1.3 2.5 10 9.4 10 152,112 SHAPEIT3 0.4 38.5 154 142 10 Table1:PhasingperformanceonUKBsamples. Wholegenomephasing Phasingwascarriedoutinchunksof5,000SNPs,withanoverlapof250SNPsbetween chunks.SHAPEIT3wasrunoneachchunkusing4coresperjobandS=200copying states.Asapartofthephasingprocessanyremainingmissinggenotypeswereimputed duringthephasing.Chunkswereligatedusingamodifiedversionofthehapfuse program. 5 Genotypeimputation AssessmentoftheUKBiobankArrayforimputation TheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimize imputationperformanceinGWASstudies8.Anexperimentwascarriedouttoassess theimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompare performancetosomeothercommerciallyavailablearrays. Performancewasassessedusinghigh-coverage,whole-genomesequencedatamade publiclyavailablebyCompleteGenomics(CG). Datafrom10samplesfromtheEuropeanancestry(CEU)populationwasused.All variantsiteswithacallratebelow90%werefilteredoutinordertoonlyconsidervery reliablesitesintheanalysis.Onlydatafromchromosome20wasused. Tomimicatypicalimputationanalysis,apseudo-GWASdatasetwasconstructedby extractingtheCGSNPgenotypesatallthesitesincludedonagivenarray.Allsitesnot onthearraywerethenimputedusingtheUK10Kreferencepanel9.Imputationwas carriedoutusingIMPUTE210whichchoosesacustomreferencepanelforeachstudy individualineach1Mbsegmentofthegenome.ThekhapparameterofIMPUTE2was setto1,000.Allotherparametersweresettodefaultvalues.Thisexperimentwas repeatedfor4differentgenome-wideSNParrays(a)AffymetrixUKBiobankAxiom array(b)IlluminaOmni2.5Marray(c)IlluminaOmni1MQuad(d)IlluminaOmni Express. Variantswerestratifiedintoallelefrequencybinsandthesquaredcorrelation(R2)was calculatedbetweenthealleledosagesatvariantsineachbinwiththemaskedCG genotypes.Sincedifferentarrayscontaindifferentnumbersofvariantsitisimportant tomakesurethatimputationperformanceismeasuredatthesamesetofvariants whencomparingchips.Toachievethis,bothimputedandarrayvariantswereincluded intheR2analysis,sothatthecomparisonmeasurestheoverallperformanceofeach array.Asaconsequence,anarraywithmorevariantswillgainanadvantage,asitis reasonabletoexpectthatdirectlygenotypingavariantwillyieldmoreaccurate genotypesthanimputation.Figure1showstheresultsofthisanalysis.Thex-axisis non-referenceallelefrequency(%)onalogscale,whichfocusesinonrarervariants. They-axisisimputationperformance(R2). Thesalientpointsare a. theUKBiobankchip(purple)outperformstheIlluminaOmni1MQuad(blue) andIlluminaOmniExpress(green),bothwhichhavecomparablenumbersof variants. b. TheUKBiobankchipperformsalmostaswellastheIllumina2.5Mchip(red), whichhas~3timesthenumberofSNPs.ItisworthnotingthattheUKBchipand IlluminaOmni2.5Mchipareverycloseinthe1-5%range.Alikelyconsequence ofthechipdesignprocessfocusinginpartonthisfrequencyrange8. 6 TheoverallconclusionofthisanalysisisthattheAffymetrixUKBarrayisaverygood arrayfromwhichtocarryoutgenotypeimputation.Thecaveatisthatthisanalysisis focusedonsampleswithEuropeanancestry. Genotyping accuracy after imputation from UK10k (7562 haplotypes) Samples: 10 EUR CG2 Comparison at 219303 sites on chr20 (includes genotyped SNPs) Allele frequency calculated from reference panel 1.0 0.9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.7 ● ● ● Aggregate R2 ● ● ● 0.6 ● ● ● 0.5 ● ● ● 0.4 ● Genotyping array ● 0.3 Illumina Omni 2.5M ● ● ● Illumina Omni 1M Quad 0.2 Illumina Omni Express Affy UK Biobank 0.1 0.0 0.02 0.05 0.1 0.2 0.5 1 2 5 10 20 50 100 Figure1:ComparisonofimputationperformanceoftheUKBiobankArrayandseveral othercommerciallyavailablegenotypingarrays. non reference allele frequency (%) Referencepanelusedforimputation Thereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation1, butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanel growsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryof thereferencepanelhaplotypes.TheUKBdatasetconsistsofsampleswithadiverse rangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean) ancestry.Forthisreasonitwasdesirabletouseareferencepanelwithalargenumber ofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypes fromotherworld-widepopulations.ToachievethistheUK10Khaplotypereference panelwasmergedtogetherwiththe1000GenomesPhase3referencepanelusingthe –merge_ref_panelsoptionintheIMPUTE2software(link). Usingthismergedpanelhasbeenshowntoproduceahigh-qualityreferencepanelfor imputation9.AnadvantageofthisreferencepanelisthatitincludesSNPs,shortindels andlargerstructuralvariants.Thereferencepanelconsistsof87,696,888bi-allelic variantsin12,570haplotypes. 7 Imputationmethoddescription Imputationwascarriedoutusingthesamealgorithmasisimplementedinthe IMPUTE2program.ThecurrentIMPUTE2programisaveryflexibletoolforphasingand imputationthatimplementsageneralsetofoptions.AnewC++programwaswritten fromscratchtofocusexclusivelyonhaploidimputationneededwhensampleshave beenpre-phased.Thisnewversionisbothmemoryandcomputationallyefficient comparedtoIMPUTE2.Themethodtakesadvantageofhighcorrelationsbetween inferredcopyingstatesintheHMMtoreducecomputation.Werefertothisprogram asIMPUTE3. Wholegenomeimputation Imputationwascarriedoutinchunksof2Mbwitha250kbbufferregion.Asetof2,000 haplotypecopyingstateswereusedtoimputeeachsample.Imputedvariantsineach non-overlappingpartofeachchunkwereconcatenatedintoper-chromosomefiles. Informationscores,minorallelefrequenciesandfiltering QCTOOLwasusedtocalculatetheminorallelefrequency(MAF)andimputation informationscoreofeachimputedvariant.Theimputationinformationisametric between0and1.Avalueof1indicatesthatthereisnouncertaintyintheimputed genotypeswhereasavalueof0meansthatthereiscompleteuncertaintyaboutthe genotypes.AvalueofαinasampleofNindividualsindicatesthattheamountofdata attheimputedSNPisapproximatelyequivalenttoasetofperfectlyobservedgenotype datainasamplesizeofαN. ManyGWAScarriedouttodatehaveusedfiltersonMAFandinformationscoreby applyingathresholdonthesemetrics.Thereisnosinglecorrectthresholdtouse. However,asMAFdecreasesitisgenerallythecasethatimputationqualitydecreases. Previousstudieshavetendedtouseafilteroninformationbetween0.3-0.5.Since thesestudieshavetypicallyconsistedofhundredsorlowthousandsofsamplesan informationof0.3correspondstoaneffectivesamplesizewithlimitedpowertodetect associations.However,theUKBiobankdatasetisconsiderablylargerinsizethanmost previousGWAS.Aninformationmeasureof0.3in~150,000samplesroughly correspondstoaneffectivesamplesizeof~45,000,whichwouldbeexpectedtoyield verygoodpowertodetectassociation. Somevariantsareimputedasmonomorphic,orclosetomonomorphici.e.nooralmost novariationinthegenotypes.SuchsiteswereremovedusingQCTOOLusingafilteron MAFof0.001%.Inaddition,7sampleswereremovedfromthedatasetduetothese individualshavingrequestedtheirdataberemovedfromthestudy.Theresulting datasetconsistsof73,355,667variantsin152,249individuals. Thedistributionofinformationscoresatthese73,355,667variantsisshowninFigure2 (a).PlotsstratifiedbyMAFarealsoshown(b)MAF>5%(c)1%<=MAF<5%(d) 0.1%<=MAF<1%(e)0.01%<=MAF<0.1%(f)0.001%<=MAF<0.01%. 8 0.4 0.6 0.8 1.0 5e+06 2e+06 1e+06 0e+00 1e+06 0e+00 0.2 3e+06 Frequency 4e+06 2e+06 3e+06 Frequency 4e+06 5e+06 1e+07 8e+06 6e+06 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (d) 0.1% <= MAF < 1% : #SNPs = 10051623 (e) 0.01% <= MAF < 0.1% : #SNPs = 26262886 (f) 0.001% <= MAF < 0.01% : #SNPs = 26140277 0.0 0.2 0.4 0.6 0.8 1.0 Information 5e+06 0e+00 1e+06 2e+06 3e+06 Frequency 4e+06 5e+06 4e+06 2e+06 1e+06 0e+00 0e+00 1e+06 2e+06 3e+06 Frequency 4e+06 5e+06 6e+06 Information 6e+06 Information 6e+06 Information 3e+06 Frequency 4e+06 2e+06 0e+00 0.0 Frequency (c) 1% <= MAF < 5% : #SNPs = 2889302 6e+06 (b) MAF >= 5% : #SNPs = 7011470 6e+06 (a) All variants 0.0 0.2 0.4 0.6 0.8 1.0 Information 0.0 0.2 0.4 0.6 0.8 Information 1.0 Figure2:Distributionofinformationscoresatvariantsintheimputeddataset.Thexaxisshowstheinformationscoreonthescale0to1. Imputedgenotypefiles LetGijdenotethegenotypeoftheithsampleatthejthvariant.Theprocessofgenotype imputationproducesaprobabilitydistributionforeachgenotypei.e. pij0=P(Gij=AA) pij1=P(Gij=AB) pij2=P(Gij=BB) whereAandBarethetwoallelesatthevariant.Thisprobabilitytriple(whichsumsto 1)isprovidedintheimputedgenotypefilesforeachimputedvariantsinallsamples. SNPvariantsincludedinthephaseddatasetalsooccurintheimputedfilesinthis format. TheimputeddataisprovidedinacompressedbinaryBGENfileformat.TheBGENfile formatisabinaryversionoftheGENfileformat. TheBGENfileformatwaschosentoprovidegoodcompressionoftheimputeddata andeaseofuseforgeneticassociationtestingagainsttraitsandphenotypes.For example,programscommonlyusedsuchasSNPTESTandPLINKalreadyreadBGEN files,andQCTOOLcanbeusedtofilter,summarize,manipulateandconvertthefilesto otherformats. Theformatstoresonevariantatatime(i.e.perrow).AsMAFdecreasesmore compressionispossibleduetoincreasedsimilaritybetweenimputedgenotypesacross 9 samples.ThetotalsizeoftheUKBInterimreleasedatasetis1.3Tb,witheach chromosomefileranginginsizefrom20Gbto109Gb.Asthefileformatisbinarythe filesarenotviewableinnormaltexteditors.Laterinthisdocumentthereisadviceand guidanceonworkingwiththesefiles. Thefilesarenamedas chrNimpv1.bgen whereNisthenumberoftheautosome(N=1,….,22). RSIDswereaddedintotheBGENfilesforasmanyvariantsaspossibleusingavailable RSIDlistsavailablefromtheUK10Kwebsiteandthe1000Genomeswebsite. RSIDsareuseful,uniqueidentifiersofSNPsandothervariantsandcanbelookedupin thedbSNPdatabase.Whenresearchersreportassociationsofvariantswithdiseases andtraitstheynormallyreporttheresultsusingtheRSID. VariantpositionsarereportedinGenomeReferenceConsortiumHumangenomebuild 37co-ordinates(GRChb37). Samplefiles Inadditiontothe22autosomalBGENfiles,thereisfilecalledimpv1.sample Thisfile(referedtoasthe`samplefile’)isthepartoftheBGENfileformatthatstores informationabouteachsampleinthedataset.Theformatofthisfileisdescribedon theGENfileformatwebpage. Thesamplefilehastwoheaderlines,followedby1lineforeachindividualintheBGEN file.Theorderoftheindividualsinthesamplefilematchestheorderoftheindividuals intheBGENfile.Theorderisimportant.Programsthatreadbgen/samplepairsassume thattheordermatchesbetweenthefiles. Thesamplefilecanbeusedtostoreinformationabouteachindividuali.e.phenotypes andcovariates.Ifphenotypesandcovariatesareaddedintothesamplefilethen SNPTESTcanbeusedtocarryoutassociationtestingateachvariant.Careshouldbe takeninmakingsurethatsuchinformationiscorrectlyaddedtosamplefiles.The formatallowsdiscreteandcontinuousphenotypesandcovariates,aswellasmissing values(seefileformatwebpagelinkabove). Differencesbetweenrawgenotypesandimputedfiles SNPsbelow1%MAFwerefilteredoutbeforethephasingstep,howevermanyofthese SNPswillhavebeenimputed.ThereforetheseSNPswillappearintherawgenotype files,andtheimputedfiles,butmayhavedifferentgenotypes.Assuch,researchers shouldnotbesurprisediftheresultsofanalysisattheseSNPsdifferdependentupon whichfilesareused. 10 Anexemplargenomewideassociationstudy AGWASforthephenotypeofheightwascarriedouttoassesstheuseoftheUK Biobankgeneticdataasaresourceforgeneticassociationstudies.Therearealreadya substantialnumberofreplicatedassociations11.Thepurposeofthisanalysiswasnotto reportnewassociations,butrathertocheckthatareasonablystandardGWASpipeline producedvalidresults. Samplefiltering Principalcomponentanalysisandtheself-declaredethnicitywereusedtoderivea “WhiteBritish”subsetofsamples.Inaddition,sampleswereexcludediftheyhad (a) atleastonerelatedsample (b) ageneticallyinferredgenderthatdidnotmatchtheself-reportedgender. (c) ~500extremeoutliers3. Thesefiltersresultedinadatasetwith112,338samples. Takingaccountofthedifferentarraysused SomeSNPsareonlyincludedononeoftheUKBBorUKBLarrays.AtsuchSNPs,missing genotypeswillhavebeenimputedaspartofthephasingprocess,sothattheseSNPs willconsistofamixtureofgenotypedandimputedSNPs.Thiscanleadtobiasin associationtestingifthereissomecorrelationbetweenthephenotypeandwhicharray asamplewasassayedon.ThesamplesinvolvedintheUKBLstudywereselectedbased onphenotypesassociatedwithlungfunction12,thusitmaybepossibleforsuch associationstooccur.Thereareatleast2solutionstoameliorateanypossible confoundingduetoarray a. carryoutassociationtestsconditioningonabinaryindicatorofarray. b. carryoutseparatetestsofassociationinUKBBsamplesandUKBLsamplesand combinetheresultsusingmeta-analysis. Associationtesting GWASwasperformedatallvariantsusingSNPTEST.Anadditivegeneticmodelwas fittedateachSNP,usinggender,age,arrayand10principalcomponentsascovariates. Thatis,theexampleusesoption(a)above. Theprogramoption–methodexpectedwasusedintheSNPTESTsoftware,which convertsthegenotypeprobabilitytripletoanexpectedgenotype,dij,(oftencalledthe dosage),calculatedas ! 𝑘𝑝!"# 𝑑!" = !!! 11 Results TheGWASforheightproducedasubstantialnumberofassociatedregions.These regionshadahighcorrespondencetothosegeneticregionsthathavepreviouslybeen replicatedforheightanddescribedintheNHGRIGWASCatalog11.Theanalysis suggestedasignificantnumberofnovellocicouldbeidentified.Figure3showsaplot ofthe–log10p-valuesfortheheightandBMIscansonchromosome4. Figure3:Chromosome4GWASforheight.Thex-axisshowsphysicalposition.Theyaxisis–log10p-valueforeachtestedvariant.Variantsonthearrayareshownasblack dots,imputedvariantsareshownasgreydots.ReportedassociationsfromtheNHGRI GWASCatalogareshownasredcrosses.Theblueandredhorizontallinesaredrawnat a–log10p-valueof5and7.5respectively. Fileprocessing WerecommendthatresearchersusetheQCTOOLprogramtohandletheBGENfiles. Thisprogramhasoptionsforextractionorremovalofsubsetsofthedata(SNPsand/or samples),andforfileformatconversion.SeetheQCTOOLexamplespagefor informationoncommandlinesusedtoperformspecifictasks. TheprogramSNPTESTcanprocessBGENfiles.ItwillautomaticallydetecttheBGENfile formatifdatafilesarenamedwiththe.bgenextension. PLINKv1.9canprocessBGENfiles;atthetimeofwritingBGENfilesarespecifiedusing the--bgenoption. ForfurtherinformationontoolssupportingtheBGENformat,seetheBGENfileformat website. 12 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Marchini,J.&Howie,B.Genotypeimputationforgenome-wideassociation studies.Nat.Rev.Genet.11,499–511(2010). Howie,B.,Fuchsberger,C.,Stephens,M.,Marchini,J.&Abecasis,G.R.Fastand accurategenotypeimputationingenome-wideassociationstudiesthroughprephasing.Nat.Genet.44,955–959(2012). TheUKBiobank.UKBiobankGenotypingQCdocumentation.(2015). Delaneau,O.,Zagury,J.-F.&Marchini,J.Improvedwhole-chromosomephasing fordiseaseandpopulationgeneticstudies.Nat.Methods10,5–6(2013). O'Connell,J.,Sharp,K.,Delaneau,O.&Marchini,J.Haplotypeestimationfor biobankscaledatasets.(2015)(submitted) Kong,A.etal.Detectionofsharingbydescent,long-rangephasingand haplotypeimputation.Nat.Genet.40,1068–1075(2008). Williams,A.L.,Patterson,N.,Glessner,J.,Hakonarson,H.&Reich,D.Phasingof manythousandsofgenotypedsamples.Am.J.Hum.Genet.91,238–251(2012). TheUKBiobankArrayDesignGroup.UKBiobankAxiomArrayContentSummary. (2014). Huang,J.etal.Improvedimputationoflow-frequencyandrarevariantsusing theUK10Khaplotypereferencepanel.NatureCommunications6,8111(2015). Howie,B.,Marchini,J.&Stephens,M.Genotypeimputationwiththousandsof genomes.G3(Bethesda)1,457–470(2011). Welter,D.etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-trait associations.Nucl.AcidsRes.42,D1001–6(2014). Wain,L.V.etal.Novelinsightsintothegeneticsofsmokingbehaviour,lung function,andchronicobstructivepulmonarydisease(UKBiLEVE):agenetic associationstudyinUKBiobank.LancetRespirMed3,769–781(2015). 13
© Copyright 2026 Paperzz