Zhang

Big Data Training for Translational Omics Research
Validation Discussion
Week 2, Day 2
Big Data Training for Translational Omics Research
GSE6532
• The link to this dataset
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532
•
•
•
•
•
Sample size:87
Number of total markers: 54675
Gene HOXB13,IL17RB and ESTs are included in this dataset.
We use this dataset as validation.
Result: They are not significant on this independent set.
Big Data Training for Translational Omics Research
Simple Association Test
Big Data Training for Translational Omics Research
PLINK
• PLINK is a free, open-source whole genome association
analysis toolset
• Designed to perform a range of basic, large-scale analyses in
a computationally efficient manner, see
http://pngu.mgh.harvard.edu/~purcell/plink/.
Big Data Training for Translational Omics Research
PLINK Input Files
• PED file: 5SNPs.ped
• MAP file: 5SNPs.map
• Alternate phenotype file: conty.txt/biny.txt
• Optional covariate file: covariate0.txt
Big Data Training for Translational Omics Research
PLINK Input Files
• PED file
Col 1: Family ID
Col 2: Individual ID
Col 3: Paternal ID
Col 4: Maternal ID
Col 5: Sex (1=male; 2=female; other character =unknown)
Col 6: Phenotype (The missing phenotype value for quantitative traits is, by default, -9)
Col 7-: Genotypes (The missing genotype value is denoted as 0, by default)
• Example:
FAM001 1 0 0 1 3.4 A A G G A C C C
FAM001 2 0 0 1 2.5 A A A G 0 0 A C
Big Data Training for Translational Omics Research
PLINK Input Files
• MAP file
Col 1: Chromosome (1-22, X, Y or 0 if unplaced)
Col 2: rs# or SNP identifier
Col 3: Genetic distance (morgans)
Col 4: Base-pair position (bp units)
• Example:
1 rs123456 0 1234555
1 rs234567 0 1237793
1 rs224534 0 -1237697
1 rs233556 0 1337456
Big Data Training for Translational Omics Research
PLINK Input Files
• To specify an alternate phenotype for analysis, i.e. other than
the one in the *.ped file, use the option --pheno.
• Alternate phenotype file
Col 1: Family ID
Col 2: Individual ID
Col 3: Phenotype
• Example
F1
1110
2.3 22.22 2
F2
2202
34.12 18.23 1
Big Data Training for Translational Omics Research
PLINK Input Files
• The phenotype can be either a quantitative trait or an
affection status column
• PLINK will automatically detect which type (i.e. based
on whether a value other than 0, 1, 2 or the missing
genotype code is observed).
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association
Test for a Continuous Trait and 5 SNPs
• Genotype data: 5SNPs.ped & 5SNPs.map
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association
Test for a Continuous Trait and 5 SNPs
• Alternate phenotype data: conty.txt
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association
Test for a Continuous Trait and 5 SNPs
• PLINK commands:
plink --noweb --file 5SNPs --assoc --pheno conty.txt --out younameit
• Usage
--file specifies .ped and .map files, --assoc performs case/control or QTL association, --pheno specifies
alternate phenotype, --out specifies output filename
• This will generate the files younameit.qassoc with fields as follows:
CHR Chromosome number
SNP SNP identifier
BP
Physical position (base-pair)
NMISS Number of non-missing genotypes
BETA Regression coefficient
SE
Standard error
R2
Regression r-squared
T
Wald test (based on t-distribution)
P
Wald test asymptotic p-value
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association
Test for a Continuous Trait and 5 SNPs
• Results:
CHR
SNP
BP
NMISS
BETA
SE
R2
T
P
1
rs3094315
752566
342
0.05021
0.1407
0.000375
0.3569
0.7214
1
rs2073813
753541
330
0.1148
0.09175
0.004754
1.252
0.2116
1
rs3131969
754182
333
0.1631
0.08759
0.01037
1.863
0.06341
1
rs3131967
754334
341
0.1407
0.09679
0.006198
1.454
0.1469
1
rs77598327
757120
334
0.1532
0.122
0.004731
1.256
0.2099
Big Data Training for Translational Omics Research
Perform a Case/Control Association Test
for a Binary Trait and 5 SNPs
• Genotype data: 5SNPs.ped & 5SNPs.map
• Alternate phenotype data: biny.txt
Big Data Training for Translational Omics Research
Perform a Case/Control Association Test
for a Binary Trait and 5 SNPs
• PLINK commands:
plink --noweb --file 5SNPs --assoc --pheno biny.txt --out younameit
• This will generate the files younameit.assoc with fields as follows:
CHR
Chromosome
SNP
SNP ID
BP
Physical position (base-pair)
A1
Minor allele name (based on whole sample)
F_A
Frequency of this allele in cases
F_U
Frequency of this allele in controls
A2
Major allele name
CHISQ
Basic allelic test chi-square (1df)
P
Asymptotic p-value for this test
OR
Estimated odds ratio (for A1, i.e. A2 is reference)
Big Data Training for Translational Omics Research
Perform a Case/Control Association Test
for a Binary Trait and 5 SNPs
• Results:
CHR
SNP
BP
A1
F_A
F_U
A2
CHISQ
P
1
rs3094315
752566
G
0.07353
0.0814
A
0.148
0.7004
1
rs2073813
753541
A
0.2126
0.2362
G
0.5292
0.467
1
rs3131969
754182
A
0.2134
0.2633
G
2.281
0.131
1
rs3131967
754334
A
0.1845
0.2312
G
2.254
0.1333
1
rs77598327
757120
G
0.1054
0.1399
A
1.841
0.1748
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association Test for
a Continuous Trait and 5 SNPs with 3 Covariates
• Genotype data: 5SNPs.ped & 5SNPs.map
• Phenotype data: conty0.txt
• Covariates: covariates0.txt (age, gender, bmi)
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association Test for
a Continuous Trait and 5 SNPs with 3 Covariates
• Covariates adjustments (using R):
pheno=read.table("conty0.txt")
covar=read.table("covariates0.txt")
pheno=as.matrix(pheno)
covar=as.matrix(covar)
n=dim(pheno)[1]
p=dim(pheno)[2]
fit=list()
residpheno=matrix(0, n, p)
for (i in 1:p){
fit[[i]]=lm(pheno[,i]~covar)
residpheno[,i]=resid(fit[[i]])
}
write.table(residpheno, "resid_phenotype0.txt", row.names=F, col.names=F, quote=F, sep=" ")
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association Test for
a Continuous Trait and 5 SNPs with 3 Covariates
• Residual phenotype (alternate) data after covariates
adjustments: resid_phenotype.txt
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association Test for
a Continuous Trait and 5 SNPs with 3 Covariates
• PLINK commands:
plink --noweb --file 5SNPs --assoc --pheno resid_phenotype.txt --out younameit
• This will generate the files younameit.qassoc with fields as follows:
CHR Chromosome number
SNP SNP identifier
BP
Physical position (base-pair)
NMISS Number of non-missing genotypes
BETA Regression coefficient
SE
Standard error
R2
Regression r-squared
T
Wald test (based on t-distribtion)
P
Wald test asymptotic p-value
Big Data Training for Translational Omics Research
Perform a Quantitative Trait Association Test for
a Continuous Trait and 5 SNPs with 3 Covariates
• Results:
CHR
SNP
BP
NMISS
BETA
SE
R2
T
P
1
rs3094315
752566
342
0.04844
0.1405
0.000349
0.3446
0.7306
1
rs2073813
753541
330
0.1181
0.09161
0.005039
1.289
0.1984
1
rs3131969
754182
333
0.1665
0.08749
0.01083
1.903
0.05786
1
rs3131967
754334
341
0.1411
0.09669
0.006241
1.459
0.1454
1
rs77598327
757120
334
0.1577
0.1218
0.005026
1.295
0.1962
Big Data Training for Translational Omics Research
Quality Control Using PLINK
Big Data Training for Translational Omics Research
PLINK Input Files
• PED file
Col 1: Family ID
Col 2: Individual ID
Col 3: Paternal ID
Col 4: Maternal ID
Col 5: Sex (1=male; 2=female; other character=unknown)
Col 6: Phenotype (The missing phenotype value for quantitative traits is, by default, -9)
Col 7-: Genotypes (The missing genotype value is denoted as 0, by default)
• Example:
FAM001 1 0 0 1 3.4 A A G G A C C C
FAM001 2 0 0 1 2.5 A A A G 0 0 A C
Big Data Training for Translational Omics Research
PLINK Input Files
• MAP file
Col 1: Chromosome (1-22, X, Y or 0 if unplaced)
Col 2: rs# or SNP identifier
Col 3: Genetic distance (morgans)
Col 4: Base-pair position (bp units)
• Example:
1 rs123456 0 1234555
1 rs234567 0 1237793
1 rs224534 0 -1237697
1 rs233556 0 1337456
Big Data Training for Translational Omics Research
Inclusion Thresholds
• Common options that can be used to filter out individuals or
SNPs on the basis of the summary statistic measures:
Feature inclusion criteria
As summary statistic
As inclusion criteria
Missingness per individual
--missing
--mind N
Missingness per marker
--missing
--geno N
Allele frequency
--freq
--maf N
Hardy-Weinberg equilibrium --hardy
--hwe N
• Reference:
http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml
Big Data Training for Translational Omics Research
PLINK Output Files
• Option 1:
We could output into files that look like our input (--recode), which is useful if we
want to use MACH afterwards.
• Outputs: *.ped & *.map
FAM001 1 0 0 1 2 A A G G
FAM001 2 0 0 1 2 A A A G
1 rs123456 0 1234555
1 rs234567 0 1237793
Big Data Training for Translational Omics Research
PLINK Output Files
• Option 2:
We could output into formats that are ready for association study (--recodeA),
which is useful if we would like to run association study afterwards.
• Outputs: *.raw
FAM001 1 0 0 1 2 0 0
FAM001 2 0 0 1 2 0 1
Big Data Training for Translational Omics Research
Example - Summary Statistic
• Input files:
5SNPs.ped
5SNPs.map
• PLINK commands:
plink --noweb --file 5SNPs --missing --out younameit
Big Data Training for Translational Omics Research
Example - Summary Statistic
• This option creates two files:
plink.imiss
plink.lmiss
which detail missingness by individual and by SNP
(locus).
Big Data Training for Translational Omics Research
Example - Summary Statistic
• PLINK commands:
plink --noweb --file 5SNPs --freq --out younameit

This option create a file younameit.frq with five columns:
CHR
Chromosome
SNP
SNP identifier
A1
Allele 1 code (minor allele)
A2
Allele 2 code (major allele)
MAF
Minor allele frequency
NCHROBS Non-missing allele count
Big Data Training for Translational Omics Research
Example - Summary Statistic
• PLINK commands:
plink --noweb --file 5SNPs --hardy --out younameit

This option create a file younameit.hwe with the following format:
SNP
SNP identifier
TEST
Code indicating sample
A1
Minor allele code
A2
Major allele code
GENO
Genotype counts: 11/12/22
O(HET)
Observed heterozygosity
E(HET)
Expected heterozygosity
P
H-W p-value
Big Data Training for Translational Omics Research
Example - Inclusion Thresholds
• Input files:
5SNPs.ped
5SNPs.map
• PLINK commands:
plink --noweb --file 5SNPs --mind 0.1 --geno 0.1 --maf 0.05 --hwe 0.001 -recode --out younameit
Big Data Training for Translational Omics Research
Example of the Log File
5 (of 5) markers to be included from [ 5SNPs.map ]
344 individuals read from [ 5SNPs.ped ]
Before frequency and genotyping pruning, there are 5 SNPs
344 founders and 0 non-founders found
32 of 344 individuals removed for low genotyping ( MIND > 0.1 )
0 markers to be excluded based on HWE test ( p <= 0.001 )
0 markers failed HWE test in cases
0 markers failed HWE test in controls
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 0.1 )
0 SNPs failed frequency test ( MAF < 0.05 )
After frequency and genotyping pruning, there are 5 SNPs
After filtering, there are 312 individuals
Writing recoded ped file to [ younameit.ped ]
Writing new map file to [ younameit.map ]
Big Data Training for Translational Omics Research
Remove Very Closely Related Individuals
• Input files:
5SNPs.ped
5SNPs.map
• PLINK commands:
plink --noweb --file 5SNPs --genome --out younameit
Big Data Training for Translational Omics Research
Remove Very Closely Related Individuals
• This will create the file younameit.genome which has the following fields:
FID1 Family ID for first individual
IID1 Individual ID for first individual
FID2 Family ID for second individual
IID2 Individual ID for second individual
RT
Relationship type given PED file
EZ
Expected IBD sharing given PED file
Z0
P(IBD=0)
Z1
P(IBD=1)
Z2
P(IBD=2)
PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )
PHE
Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs)
DST
IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs )
PPC
IBS binomial test
RATIO Of HETHET : IBS 0 SNPs (expected value is 2)
Big Data Training for Translational Omics Research
Remove Very Closely Related Individuals
• Scan the younameit.genome file for any individuals with high
PIHAT values (e.g. greater than 0.05). Optionally, remove one
member of the pair if you find close relatives.
• Reference:
http://pngu.mgh.harvard.edu/~purcell/plink/ibdibs.shtml