http://​biobank.​ctsu.​ox.​ac.​uk/​crystal/​docs/​impute_​ukb_​v1.​pdf

UKBiobankPhasingandImputationDocumentation
Version1.2
13November2015
documentationauthorJonathanMarchini
DepartmentofStatistics,UniversityofOxford
onbehalfofUKBiobank
ContributorstoUKBiobankPhasingandImputation
JonathanMarchini(StatisticsDept,Oxford),JaredO’Connell(WTCHG,Oxford),Olivier
Delaneau(UniversityofGeneva),KevinSharp(StatisticsDept,Oxford),Warren
Kretzschmar(WTCHG,Oxford),GavinBand(WTCHG,Oxford),ShaneMcCarthy(WTSI,
Hinxton),DesislavaPetkova(WTCHG,Oxford),ClaireBycroft(WTCHG,Oxford),Colin
Freeman(WTCHG,Oxford),PeterDonnelly(WTCHG,Oxford).
1
TableofContents
Introduction.............................................................................................................................3
Phasing......................................................................................................................................4
Filteringbeforephasing...............................................................................................................4
Phasingmethoddescription.......................................................................................................4
Validationofthephasingmethod.............................................................................................5
Wholegenomephasing.................................................................................................................5
Genotypeimputation............................................................................................................6
AssessmentoftheUKBiobankArrayforimputation........................................................6
Referencepanelusedforimputation......................................................................................7
Imputationmethoddescription................................................................................................8
Wholegenomeimputation..........................................................................................................8
Informationscores,minorallelefrequenciesandfiltering.............................................8
Imputedgenotypefiles.................................................................................................................9
Samplefiles....................................................................................................................................................10
Differencesbetweenrawgenotypesandimputedfiles...................................................10
Anexemplargenomewideassociationstudy...........................................................11
Samplefiltering.............................................................................................................................11
Takingaccountofthedifferentarraysused.......................................................................11
Associationtesting.......................................................................................................................11
Results..............................................................................................................................................12
Fileprocessing.....................................................................................................................12
References.............................................................................................................................13
2
Introduction
Thisdocumentdescribestheanalysiscarriedouttoperformgenotypeimputationfor
theinterimreleaseoftheUKBiobank(UKB)genotypedata.Italsoprovidesadvice
aboutusingtheimputeddatatocarryoutgenomewideassociationstudies(GWAS)or
forextractinggenotypesforuseascovariatesinothertypesofassociationstudy.
Genotypeimputation1,2istheprocessofpredictinggenotypesthatarenotdirectly
assayedinasampleofindividuals.Areferencepanelofhaplotypesatadensesetof
SNPs,indelsandstructuralvariants,isusedtoimputegenotypesintoastudysampleof
individualsthathavebeengenotypedatasubsetoftheSNPs.These‘insilico’
genotypescanthenbeusedtoboostthenumberofSNPsthatcanbetestedfor
association.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthe
causalvariantsandfacilitatesmeta-analysis.Theresultoftheimputationprocessisa
datasetwith73,355,667SNPs,shortindelsandlargestructuralvariantsin152,249
individuals.SeeBox1of1foraquickvisualoverviewofhowgenotypeimputation
works.
Theprocessofimputationisdividedintotwosteps(i)pre-phasing,and(ii)imputation.
Inthefirststep,thesamplestobeimputedare‘pre-phased’i.eastatisticalmethodis
appliedtogenotypedatatoinfertheunderlyinghaplotypesofeachindividual.Inthe
secondstep,adifferentstatisticalmethodisusedtocombinetheinferredhaplotypes
withareferencepanelofhaplotypesandimputetheunobservedgenotypesineach
sample.Thefollowingtwosectionsofthisdocumentdescribehowthepre-phasingand
imputationwascarriedoutonthe~150,000samples.
Phasingandimputationcanbeacomputationallyintensiveprocess.Toavoidmany
differentresearchgroupshavingtocarrythisoutindependently,phasingand
imputationwasbeencarriedoutcentrally.
QuestionsaboutusingtheimputedgenotypesshouldbesenttotheUKBGeneticsmail
listsetupforthispurpose.Youcansubscribetothemaillisthere
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS
3
Phasing
Filteringbeforephasing
TocreateaninputdataforthephasingweappliedSNPQCfiltersasdescribedinUK
BiobankQCdocumention3.Thesamplesweregenotypedontwoslightlydifferent
chips.Approximately50,000weregenotypedaspartoftheULBiLEVEstudyusinga
chipdesignedforthatstudy(denotedUKBL),withtheremainingsamples(~100,000)
genotypedontheUKBchip.Therefore,weapplieddifferentmissingnessfiltersonSNPs
dependentuponchip.SNPswereremovedbasedonthenumberofbatchesinwhich
theyarecompletelymissing:
i.
SNPsonbothUKBchipandUKBLchip-removethemiftheyaremissinginmore
than3batches(outof33batches)
ii.
SNPsontheUKBchipandnottheUKBLchip-removethemiftheyaremissing
inmorethan2batches(outof22batches)
iii.
SNPsontheUKBLchipandnottheUKBchip-removethemiftheyaremissing
inmorethan1batch(outof11batches)
1,037sampleoutliers3wereremoved.Multi-allelicSNPsandSNPswithaminorallele
frequency(MAF)<1%werethenremovedfromthedataset.Thesefiltersresultedina
datasetwith641,018autosomalSNPsin152,256samples.ChromosomeXphasingand
imputationwillbecarriedoutatalaterdate.
Phasingmethoddescription
PhasingontheautosomeswascarriedoutusingamodifiedversionoftheSHAPEIT24
programmodifiedtoallowforverylargesamplesizes.Thisnewmethod(whichwe
refertoasSHAPEIT3)modifiesSHAPEIT2’ssurrogatefamilyapproachtoremovea
quadraticcomplexitycomponentofthealgorithm5.Insmallsamplesizesofafew
thousandsamples,thispartofthealgorithm,whichinvolvescalculatingHamming
distancesbetweencurrenthaplotypesestimates,contributesonlyarelativelysmall
parttothecomputationalcost.Assamplesizesincreaseover10,000samplesthenthis
componentbecomessignificant.Thenewalgorithmusesadivisiveclusteringalgorithm
toidentifyclustersofhaplotypes,andthencalculatesHammingdistancesonly
betweenpairsofhaplotypeswithineachcluster.Onlyhaplotypeswithineachcluster
areusedascandidatesforthesurrogatefamilycopyingstatesintheHMMmodel.The
resultingalgorithmhascomplexityO(NlogN)whereNisthenumberofhaplotypesin
thedatasetbeingphased.Inpractice,wehaveobservedthatthemethodexhibits
scalingclosetolinear.Thisisacrucialfeatureofthemethod,especiallyforverylarge
samplesizes,andapropertynotsharedbyotherapproaches6,7.Thedevelopmentof
thisapproachisongoingandthereissubstantialscopetomakefurtherimprovements
inspeedandaccuracy.Anewerversionislikelytoofferanorderofmagnitude
reductioninspeed.
4
Validationofthephasingmethod
Theaccuracyofthisnewmethodwasassessedbytakingadvantageof72motherfather-childtriosthatwereidentifiedintheUKBdataset3.Thisfamilyinformationcan
beusedtoinferthephaseofalargenumberofSNPsinthetrioparents.Thesefamily
inferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature4.
Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswere
estimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof
16,762autosomalSNPs.Theinferredhaplotypeswerethencomparedtothetruthset
usingtheswitcherrormetric4.Weobtainedanexceptionallylowswitcherrorrateof
0.4%acrossthetriochildrenreportingBritishancestry.Byadjustingparametersofthe
methodwehaveobservedswitcherrorrateslowerthan0.3%.
Withswitcherrorratesthislow,longchunksofsequenceofmanymegabaseswillbe
inferredcorrectly.Downstreamimputationfromsuchhaplotypeswillbehighly
accurate.
Toassesstheperformancegainofphasingall152,112samplestogether,versus
phasinginsmallersubsetsofsamplestwoothertestdatasetsofsize1,072and10,072
sampleswerecreated,alsocontainingthetriochildren.Theresultsareshowninfull
detailinTable1andhighlightthebenefitsofjointphasingofallthesamples.These
resultsclearlydemonstratetheclosetolinearscalingoftheSHAPEIT3algorithm.
Samplesize Method
SwitchError Runtime(hrs) Run Sample
Threads
(%)
Time
Size
Scaling Scaling
1,072
SHAPEIT3
2.6
0.25
1
1
10
10,072
SHAPEIT3
1.3
2.5
10
9.4
10
152,112 SHAPEIT3
0.4
38.5
154
142
10
Table1:PhasingperformanceonUKBsamples.
Wholegenomephasing
Phasingwascarriedoutinchunksof5,000SNPs,withanoverlapof250SNPsbetween
chunks.SHAPEIT3wasrunoneachchunkusing4coresperjobandS=200copying
states.Asapartofthephasingprocessanyremainingmissinggenotypeswereimputed
duringthephasing.Chunkswereligatedusingamodifiedversionofthehapfuse
program.
5
Genotypeimputation
AssessmentoftheUKBiobankArrayforimputation
TheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimize
imputationperformanceinGWASstudies8.Anexperimentwascarriedouttoassess
theimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompare
performancetosomeothercommerciallyavailablearrays.
Performancewasassessedusinghigh-coverage,whole-genomesequencedatamade
publiclyavailablebyCompleteGenomics(CG).
Datafrom10samplesfromtheEuropeanancestry(CEU)populationwasused.All
variantsiteswithacallratebelow90%werefilteredoutinordertoonlyconsidervery
reliablesitesintheanalysis.Onlydatafromchromosome20wasused.
Tomimicatypicalimputationanalysis,apseudo-GWASdatasetwasconstructedby
extractingtheCGSNPgenotypesatallthesitesincludedonagivenarray.Allsitesnot
onthearraywerethenimputedusingtheUK10Kreferencepanel9.Imputationwas
carriedoutusingIMPUTE210whichchoosesacustomreferencepanelforeachstudy
individualineach1Mbsegmentofthegenome.ThekhapparameterofIMPUTE2was
setto1,000.Allotherparametersweresettodefaultvalues.Thisexperimentwas
repeatedfor4differentgenome-wideSNParrays(a)AffymetrixUKBiobankAxiom
array(b)IlluminaOmni2.5Marray(c)IlluminaOmni1MQuad(d)IlluminaOmni
Express.
Variantswerestratifiedintoallelefrequencybinsandthesquaredcorrelation(R2)was
calculatedbetweenthealleledosagesatvariantsineachbinwiththemaskedCG
genotypes.Sincedifferentarrayscontaindifferentnumbersofvariantsitisimportant
tomakesurethatimputationperformanceismeasuredatthesamesetofvariants
whencomparingchips.Toachievethis,bothimputedandarrayvariantswereincluded
intheR2analysis,sothatthecomparisonmeasurestheoverallperformanceofeach
array.Asaconsequence,anarraywithmorevariantswillgainanadvantage,asitis
reasonabletoexpectthatdirectlygenotypingavariantwillyieldmoreaccurate
genotypesthanimputation.Figure1showstheresultsofthisanalysis.Thex-axisis
non-referenceallelefrequency(%)onalogscale,whichfocusesinonrarervariants.
They-axisisimputationperformance(R2).
Thesalientpointsare
a. theUKBiobankchip(purple)outperformstheIlluminaOmni1MQuad(blue)
andIlluminaOmniExpress(green),bothwhichhavecomparablenumbersof
variants.
b. TheUKBiobankchipperformsalmostaswellastheIllumina2.5Mchip(red),
whichhas~3timesthenumberofSNPs.ItisworthnotingthattheUKBchipand
IlluminaOmni2.5Mchipareverycloseinthe1-5%range.Alikelyconsequence
ofthechipdesignprocessfocusinginpartonthisfrequencyrange8.
6
TheoverallconclusionofthisanalysisisthattheAffymetrixUKBarrayisaverygood
arrayfromwhichtocarryoutgenotypeimputation.Thecaveatisthatthisanalysisis
focusedonsampleswithEuropeanancestry.
Genotyping accuracy after imputation from UK10k (7562 haplotypes)
Samples: 10 EUR CG2
Comparison at 219303 sites on chr20 (includes genotyped SNPs)
Allele frequency calculated from reference panel
1.0
0.9
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.7
●
●
●
Aggregate R2
●
●
●
0.6
●
●
●
0.5
●
●
●
0.4
●
Genotyping array
●
0.3
Illumina Omni 2.5M
●
●
●
Illumina Omni 1M Quad
0.2
Illumina Omni Express
Affy UK Biobank
0.1
0.0
0.02
0.05
0.1
0.2
0.5
1
2
5
10
20
50
100
Figure1:ComparisonofimputationperformanceoftheUKBiobankArrayandseveral
othercommerciallyavailablegenotypingarrays.
non reference allele frequency (%)
Referencepanelusedforimputation
Thereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation1,
butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanel
growsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryof
thereferencepanelhaplotypes.TheUKBdatasetconsistsofsampleswithadiverse
rangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean)
ancestry.Forthisreasonitwasdesirabletouseareferencepanelwithalargenumber
ofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypes
fromotherworld-widepopulations.ToachievethistheUK10Khaplotypereference
panelwasmergedtogetherwiththe1000GenomesPhase3referencepanelusingthe
–merge_ref_panelsoptionintheIMPUTE2software(link).
Usingthismergedpanelhasbeenshowntoproduceahigh-qualityreferencepanelfor
imputation9.AnadvantageofthisreferencepanelisthatitincludesSNPs,shortindels
andlargerstructuralvariants.Thereferencepanelconsistsof87,696,888bi-allelic
variantsin12,570haplotypes.
7
Imputationmethoddescription
Imputationwascarriedoutusingthesamealgorithmasisimplementedinthe
IMPUTE2program.ThecurrentIMPUTE2programisaveryflexibletoolforphasingand
imputationthatimplementsageneralsetofoptions.AnewC++programwaswritten
fromscratchtofocusexclusivelyonhaploidimputationneededwhensampleshave
beenpre-phased.Thisnewversionisbothmemoryandcomputationallyefficient
comparedtoIMPUTE2.Themethodtakesadvantageofhighcorrelationsbetween
inferredcopyingstatesintheHMMtoreducecomputation.Werefertothisprogram
asIMPUTE3.
Wholegenomeimputation
Imputationwascarriedoutinchunksof2Mbwitha250kbbufferregion.Asetof2,000
haplotypecopyingstateswereusedtoimputeeachsample.Imputedvariantsineach
non-overlappingpartofeachchunkwereconcatenatedintoper-chromosomefiles.
Informationscores,minorallelefrequenciesandfiltering
QCTOOLwasusedtocalculatetheminorallelefrequency(MAF)andimputation
informationscoreofeachimputedvariant.Theimputationinformationisametric
between0and1.Avalueof1indicatesthatthereisnouncertaintyintheimputed
genotypeswhereasavalueof0meansthatthereiscompleteuncertaintyaboutthe
genotypes.AvalueofαinasampleofNindividualsindicatesthattheamountofdata
attheimputedSNPisapproximatelyequivalenttoasetofperfectlyobservedgenotype
datainasamplesizeofαN.
ManyGWAScarriedouttodatehaveusedfiltersonMAFandinformationscoreby
applyingathresholdonthesemetrics.Thereisnosinglecorrectthresholdtouse.
However,asMAFdecreasesitisgenerallythecasethatimputationqualitydecreases.
Previousstudieshavetendedtouseafilteroninformationbetween0.3-0.5.Since
thesestudieshavetypicallyconsistedofhundredsorlowthousandsofsamplesan
informationof0.3correspondstoaneffectivesamplesizewithlimitedpowertodetect
associations.However,theUKBiobankdatasetisconsiderablylargerinsizethanmost
previousGWAS.Aninformationmeasureof0.3in~150,000samplesroughly
correspondstoaneffectivesamplesizeof~45,000,whichwouldbeexpectedtoyield
verygoodpowertodetectassociation.
Somevariantsareimputedasmonomorphic,orclosetomonomorphici.e.nooralmost
novariationinthegenotypes.SuchsiteswereremovedusingQCTOOLusingafilteron
MAFof0.001%.Inaddition,7sampleswereremovedfromthedatasetduetothese
individualshavingrequestedtheirdataberemovedfromthestudy.Theresulting
datasetconsistsof73,355,667variantsin152,249individuals.
Thedistributionofinformationscoresatthese73,355,667variantsisshowninFigure2
(a).PlotsstratifiedbyMAFarealsoshown(b)MAF>5%(c)1%<=MAF<5%(d)
0.1%<=MAF<1%(e)0.01%<=MAF<0.1%(f)0.001%<=MAF<0.01%.
8
0.4
0.6
0.8
1.0
5e+06
2e+06
1e+06
0e+00
1e+06
0e+00
0.2
3e+06
Frequency
4e+06
2e+06
3e+06
Frequency
4e+06
5e+06
1e+07
8e+06
6e+06
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(d) 0.1% <= MAF < 1% : #SNPs = 10051623
(e) 0.01% <= MAF < 0.1% : #SNPs = 26262886
(f) 0.001% <= MAF < 0.01% : #SNPs = 26140277
0.0
0.2
0.4
0.6
0.8
1.0
Information
5e+06
0e+00
1e+06
2e+06
3e+06
Frequency
4e+06
5e+06
4e+06
2e+06
1e+06
0e+00
0e+00
1e+06
2e+06
3e+06
Frequency
4e+06
5e+06
6e+06
Information
6e+06
Information
6e+06
Information
3e+06
Frequency
4e+06
2e+06
0e+00
0.0
Frequency
(c) 1% <= MAF < 5% : #SNPs = 2889302
6e+06
(b) MAF >= 5% : #SNPs = 7011470
6e+06
(a) All variants
0.0
0.2
0.4
0.6
0.8
1.0
Information
0.0
0.2
0.4
0.6
0.8
Information
1.0
Figure2:Distributionofinformationscoresatvariantsintheimputeddataset.Thexaxisshowstheinformationscoreonthescale0to1.
Imputedgenotypefiles
LetGijdenotethegenotypeoftheithsampleatthejthvariant.Theprocessofgenotype
imputationproducesaprobabilitydistributionforeachgenotypei.e.
pij0=P(Gij=AA)
pij1=P(Gij=AB)
pij2=P(Gij=BB)
whereAandBarethetwoallelesatthevariant.Thisprobabilitytriple(whichsumsto
1)isprovidedintheimputedgenotypefilesforeachimputedvariantsinallsamples.
SNPvariantsincludedinthephaseddatasetalsooccurintheimputedfilesinthis
format.
TheimputeddataisprovidedinacompressedbinaryBGENfileformat.TheBGENfile
formatisabinaryversionoftheGENfileformat.
TheBGENfileformatwaschosentoprovidegoodcompressionoftheimputeddata
andeaseofuseforgeneticassociationtestingagainsttraitsandphenotypes.For
example,programscommonlyusedsuchasSNPTESTandPLINKalreadyreadBGEN
files,andQCTOOLcanbeusedtofilter,summarize,manipulateandconvertthefilesto
otherformats.
Theformatstoresonevariantatatime(i.e.perrow).AsMAFdecreasesmore
compressionispossibleduetoincreasedsimilaritybetweenimputedgenotypesacross
9
samples.ThetotalsizeoftheUKBInterimreleasedatasetis1.3Tb,witheach
chromosomefileranginginsizefrom20Gbto109Gb.Asthefileformatisbinarythe
filesarenotviewableinnormaltexteditors.Laterinthisdocumentthereisadviceand
guidanceonworkingwiththesefiles.
Thefilesarenamedas
chrNimpv1.bgen
whereNisthenumberoftheautosome(N=1,….,22).
RSIDswereaddedintotheBGENfilesforasmanyvariantsaspossibleusingavailable
RSIDlistsavailablefromtheUK10Kwebsiteandthe1000Genomeswebsite.
RSIDsareuseful,uniqueidentifiersofSNPsandothervariantsandcanbelookedupin
thedbSNPdatabase.Whenresearchersreportassociationsofvariantswithdiseases
andtraitstheynormallyreporttheresultsusingtheRSID.
VariantpositionsarereportedinGenomeReferenceConsortiumHumangenomebuild
37co-ordinates(GRChb37).
Samplefiles
Inadditiontothe22autosomalBGENfiles,thereisfilecalledimpv1.sample
Thisfile(referedtoasthe`samplefile’)isthepartoftheBGENfileformatthatstores
informationabouteachsampleinthedataset.Theformatofthisfileisdescribedon
theGENfileformatwebpage.
Thesamplefilehastwoheaderlines,followedby1lineforeachindividualintheBGEN
file.Theorderoftheindividualsinthesamplefilematchestheorderoftheindividuals
intheBGENfile.Theorderisimportant.Programsthatreadbgen/samplepairsassume
thattheordermatchesbetweenthefiles.
Thesamplefilecanbeusedtostoreinformationabouteachindividuali.e.phenotypes
andcovariates.Ifphenotypesandcovariatesareaddedintothesamplefilethen
SNPTESTcanbeusedtocarryoutassociationtestingateachvariant.Careshouldbe
takeninmakingsurethatsuchinformationiscorrectlyaddedtosamplefiles.The
formatallowsdiscreteandcontinuousphenotypesandcovariates,aswellasmissing
values(seefileformatwebpagelinkabove).
Differencesbetweenrawgenotypesandimputedfiles
SNPsbelow1%MAFwerefilteredoutbeforethephasingstep,howevermanyofthese
SNPswillhavebeenimputed.ThereforetheseSNPswillappearintherawgenotype
files,andtheimputedfiles,butmayhavedifferentgenotypes.Assuch,researchers
shouldnotbesurprisediftheresultsofanalysisattheseSNPsdifferdependentupon
whichfilesareused.
10
Anexemplargenomewideassociationstudy
AGWASforthephenotypeofheightwascarriedouttoassesstheuseoftheUK
Biobankgeneticdataasaresourceforgeneticassociationstudies.Therearealreadya
substantialnumberofreplicatedassociations11.Thepurposeofthisanalysiswasnotto
reportnewassociations,butrathertocheckthatareasonablystandardGWASpipeline
producedvalidresults.
Samplefiltering
Principalcomponentanalysisandtheself-declaredethnicitywereusedtoderivea
“WhiteBritish”subsetofsamples.Inaddition,sampleswereexcludediftheyhad
(a) atleastonerelatedsample
(b) ageneticallyinferredgenderthatdidnotmatchtheself-reportedgender.
(c) ~500extremeoutliers3.
Thesefiltersresultedinadatasetwith112,338samples.
Takingaccountofthedifferentarraysused
SomeSNPsareonlyincludedononeoftheUKBBorUKBLarrays.AtsuchSNPs,missing
genotypeswillhavebeenimputedaspartofthephasingprocess,sothattheseSNPs
willconsistofamixtureofgenotypedandimputedSNPs.Thiscanleadtobiasin
associationtestingifthereissomecorrelationbetweenthephenotypeandwhicharray
asamplewasassayedon.ThesamplesinvolvedintheUKBLstudywereselectedbased
onphenotypesassociatedwithlungfunction12,thusitmaybepossibleforsuch
associationstooccur.Thereareatleast2solutionstoameliorateanypossible
confoundingduetoarray
a. carryoutassociationtestsconditioningonabinaryindicatorofarray.
b. carryoutseparatetestsofassociationinUKBBsamplesandUKBLsamplesand
combinetheresultsusingmeta-analysis.
Associationtesting
GWASwasperformedatallvariantsusingSNPTEST.Anadditivegeneticmodelwas
fittedateachSNP,usinggender,age,arrayand10principalcomponentsascovariates.
Thatis,theexampleusesoption(a)above.
Theprogramoption–methodexpectedwasusedintheSNPTESTsoftware,which
convertsthegenotypeprobabilitytripletoanexpectedgenotype,dij,(oftencalledthe
dosage),calculatedas
!
𝑘𝑝!"# 𝑑!" =
!!!
11
Results
TheGWASforheightproducedasubstantialnumberofassociatedregions.These
regionshadahighcorrespondencetothosegeneticregionsthathavepreviouslybeen
replicatedforheightanddescribedintheNHGRIGWASCatalog11.Theanalysis
suggestedasignificantnumberofnovellocicouldbeidentified.Figure3showsaplot
ofthe–log10p-valuesfortheheightandBMIscansonchromosome4.
Figure3:Chromosome4GWASforheight.Thex-axisshowsphysicalposition.Theyaxisis–log10p-valueforeachtestedvariant.Variantsonthearrayareshownasblack
dots,imputedvariantsareshownasgreydots.ReportedassociationsfromtheNHGRI
GWASCatalogareshownasredcrosses.Theblueandredhorizontallinesaredrawnat
a–log10p-valueof5and7.5respectively.
Fileprocessing
WerecommendthatresearchersusetheQCTOOLprogramtohandletheBGENfiles.
Thisprogramhasoptionsforextractionorremovalofsubsetsofthedata(SNPsand/or
samples),andforfileformatconversion.SeetheQCTOOLexamplespagefor
informationoncommandlinesusedtoperformspecifictasks.
TheprogramSNPTESTcanprocessBGENfiles.ItwillautomaticallydetecttheBGENfile
formatifdatafilesarenamedwiththe.bgenextension.
PLINKv1.9canprocessBGENfiles;atthetimeofwritingBGENfilesarespecifiedusing
the--bgenoption.
ForfurtherinformationontoolssupportingtheBGENformat,seetheBGENfileformat
website.
12
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Marchini,J.&Howie,B.Genotypeimputationforgenome-wideassociation
studies.Nat.Rev.Genet.11,499–511(2010).
Howie,B.,Fuchsberger,C.,Stephens,M.,Marchini,J.&Abecasis,G.R.Fastand
accurategenotypeimputationingenome-wideassociationstudiesthroughprephasing.Nat.Genet.44,955–959(2012).
TheUKBiobank.UKBiobankGenotypingQCdocumentation.(2015).
Delaneau,O.,Zagury,J.-F.&Marchini,J.Improvedwhole-chromosomephasing
fordiseaseandpopulationgeneticstudies.Nat.Methods10,5–6(2013).
O'Connell,J.,Sharp,K.,Delaneau,O.&Marchini,J.Haplotypeestimationfor
biobankscaledatasets.(2015)(submitted)
Kong,A.etal.Detectionofsharingbydescent,long-rangephasingand
haplotypeimputation.Nat.Genet.40,1068–1075(2008).
Williams,A.L.,Patterson,N.,Glessner,J.,Hakonarson,H.&Reich,D.Phasingof
manythousandsofgenotypedsamples.Am.J.Hum.Genet.91,238–251(2012).
TheUKBiobankArrayDesignGroup.UKBiobankAxiomArrayContentSummary.
(2014).
Huang,J.etal.Improvedimputationoflow-frequencyandrarevariantsusing
theUK10Khaplotypereferencepanel.NatureCommunications6,8111(2015).
Howie,B.,Marchini,J.&Stephens,M.Genotypeimputationwiththousandsof
genomes.G3(Bethesda)1,457–470(2011).
Welter,D.etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-trait
associations.Nucl.AcidsRes.42,D1001–6(2014).
Wain,L.V.etal.Novelinsightsintothegeneticsofsmokingbehaviour,lung
function,andchronicobstructivepulmonarydisease(UKBiLEVE):agenetic
associationstudyinUKBiobank.LancetRespirMed3,769–781(2015).
13