For submission to Genetic Epidemiology

GEIRA: Gene-Environment and Gene-Gene Interaction Research Application
SAS version 1.0
User’s manual
Bo Ding
Institute of Environmental Medicine
Karolinska Institutet
171 77 Stockholm, Sweden
June 11, 2010
1
Introduction
GEIRA is a tool designed to perform genome-wide gene-environment interaction
analyses. It can be easily extended to genome-wide gene-gene interaction analyses on
the condition that a genetic variable (such as SNP, CNV, Haptotype, etc.) can be
dichotomized (0 = unexposed, 1 = exposed) as does an environmental variable.
GEIRA calculates measures of both additive and multiplicative interaction.
Multiplicative interaction refers here to an interaction term in the logistic regression
model. Additive interaction is defined as a deviation from additivity of the absolute
effects of two risk factors as originally described by Rothman(1). The term, attributable
proportion due to interaction (AP), is used to quantify the contribution of interaction to
a disease risk, as compared with the contribution of each of the two risk factors added to
each other. The formula for calculating AP is described as follows:
AP = (RR11-RR10-RR01+1)/RR11,
where RR10 is the relative risk for the first risk factor in the absence of the second, RR01
is the relative risk for the second risk factor in the absence of the first, RR11 is the
relative risk in the exposure category when both risk factors are present. Those who are
unexposed to both the first and the second risk factor were defined as reference
category, i.e., RR00 = 1. Confidence intervals for AP were estimated by calculating a
symmetrical confidence interval using the formulas developed by Hosmer and
Lemeshow(2).
Calculations were made separately based on dominant, recessive and co-dominant
genetic models. All estimates of interactions were displayed in a single table to make
the output easily read. GEIRA (SAS version) was written using the SAS macro
language and incorporates procedures in SAS/Base (SAS 9.2). However, use of GEIRA
requires no previous experience on SAS. This manual provides a detailed guide to get
you started.
2
Quick start
Download the GEIRA (SAS) package from www.epinet.se and save it on your local
machine, for example C:\GEIRA. Unzip the package. There is 1 pdf file for the User’s
manual (GEIRA_manual_SAS.pdf), 3 sample input data files (sample.tped,
sample.tfam, and SampleCov.txt), 1 program file (GEIRA.sas), and 6 sample output
files (out_dom.sas7bdat, out_dom_adjust.sas7bdat, out_rec.sas7bdat,
out_rec_adjust.sas7bdat, out_add.sas7bdat, and out_add_adjust.sas7bdat).
A. Invocation of SAS on a Windows Platform
Starting SAS is done by clicking on the Start menu and highlighting the Programs
Tab. Locate and highlight The SAS System and double-click on SAS 9.2. SAS will
load
B. Import covariate data file (in a txt format, e.g., SampleCov.txt) through SAS
Import Wizard
(a) Open the wizard by clicking on the File menu and selecting Import Data. The
Import Wizard should appear.
(b) In following the steps of the Import Wizard you will:
1. Select the type of data you want to import (in the sample case, select Tab Delimited
File (*.txt)).
2. Browse for the file (e.g., C:\GEIRA\SampleCov.txt) Open and then click Next.
3. Choose the Library WORK and Member (same it as Covariate)
4. The Import Wizard will then ask you if you would like to create a file
containing the PROC IMPORT command, Click on Finish.
If it was successful in the Log window there should be a note as below:
NOTE: WORK.COVARIATE data set was successfully created
Running GEIRA with sample datasets
Step1. Start the SAS system, select the program editor window, and use the File/Open
Program to open your copy of the GEIRA program file (GEIRA.sas) into the program
editor window.
Step2. Once your GEIRA program appears in the editor, you run it by selecting Submit
from the Run pull-down menu.
Step3. Assign the macro variables as follows (the sample datasets)
%GEIRA (data_covariate= samplecov,
id=indid,
nongenovar=%str(cov env),
path_tped=C:\GEIRA\sample.tped,
path_tfam=C:\GEIRA\sample.tfam,
SampleSize=2000,
covar_cat=%str(),
envir=env,
covar_cont=%str(cov),
GenetModel=dom,
chr=1)
%Multi(GenetModel=dom,p_AP=p_AP)
See 4.3 for detailed description on macro variables.
Viewing output
In the Explorer window, you will see two output files (Out_dom and Out_dom_adjust).
Open the two files by double clicking them, comparing contents with the provided files
(out_dom.sas7bdat and out_dom_adjust.sas7bdat). Make sure they have identical
results.
C. Export the SAS output data file to an external file through SAS Export Wizard
(a) From a SAS session’s Results window, select the File menu and then select the
Export Data item
(b) In following the steps of the Export Wizard you will:
1. Choose the Library WORK and Member (e.g., out_dom)
2. Select the type of data you want to export (e.g., Tab Delimited File (*.txt)), click
Next.
2. Browse for a location where you want your output data file reside (e.g., C:\GEIRA)
and give a file name, then click Save and Next.
3.Click on Finish.
If it was successful in the Log window there should be a note as below:
NOTE: "C:\Documents and Settings\XX\Desktop\sampe.txt" file was successfully
created.
Congratulations! You’ve done the sample test run and can move on to your own data.
3
Features
Upon invocation, GEIRA performs the following procedures (Fig1 and 2):
1. Data importing.
Here we assume that some other software package has previously been used to quality
control genome-wide genotype data, including dataset filtering on the basis of SNP
genotype call rates, minor allele frequency, Hardy-Weinberg equilibrium, individual
missing genotypes, outliers, etc. The program will read in transposed PLINK format
data files, i.e., TPED (containing SNP and genotype information where one row is a
SNP) and TFAM (containing individual and family information where one row is an
individual).
The TPED file must contain exactly 4 columns:
chromosome (1-22, X, Y or 0 if unplaced)
SNP rs# (Must start with ‘rs’)
Genetic distance (morgans)
Base-pair position (bp units)
Genotypes (column 7 onwards) should also be white-space delimited character
(A,C,G,T) except 0 which is, by default, the missing genotype character. All markers
should be biallelic. All SNPs (whether haploid or not) must have two alleles specified.
Either Both alleles should be missing (i.e. 0) or neither. No header row should be given.
The TFAM file is a white-space (space or tab) delimited file:
Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype
Affection status (Phenotype), by default, should be coded:
-9 missing
0 missing
1 unaffected
2 affected
TPED/TFAM files:
<-------------tped ------------->
1 rs1 0 5000650 A A A C C C A C C C C C
1 rs2 0 5000830 G T G T G G T T G T T T
1
2
3
4
5
6
<1
1
1
1
1
1
tfam
0 0
0 0
0 0
0 0
0 0
0 0
->
1
1
1
1
1
1
1
1
1
2
2
2
(see PLINK documentation for details
http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr).
In addition to TPED and TFAM files, a covariate file (txt file) containing non-genetic
covariate information (including the environmental factor under study) is needed.
2. Risk allele assigning.
A minor allele is determined using the whole subjects (both cases and controls). A risk
allele is determined by comparing the minor allele frequency (MAF) in cases and
controls. If the MAF in cases is greater than or equal to that in controls, the minor allele
is assigned to the risk allele, the major allele otherwise (see fig.2 for details).
3. Data converting
The raw genotype dataset will then be converted to three datasets corresponding to three
genetic models, i.e., dominant, recessive and co-dominant, based on the following:
Assuming C is the minor allele and also the risk allele.
Dominant model coding: A_A→0, A_C→1, C_C→1
Recessive model coding: A_A→0, A_C→0, C_C→1
Co-dominant model coding: A_A→0, A_C→1, C_C→2
4. Interaction calculation: Dominant, recessive or co-dominant model. Calculate
estimates for both additive and multiplicative interactions.
5. Supervising macro. We now have four sections to automate the major tasks for the
analysis. Since these modules should be executed in proper order, we create a
supervising macro that will pass the correct parameters to each module in order and
automate the process.
6. Adjustment for multiple testing: P-value adjustments using Bonferroni, Sidak and
False Discovery Rate (FDR) are corrected for all tests performed and are calculated
using simple functions of the raw p-values.
Fig.1 A pictorial representation of the GEIRA structure procedures
Fig 2. Detailed procedures of assigning a risk allele.
4
User-defined parameters
4.1 Directory location
The user must specify the path where the log file should be placed. These should be
specified as follows:
proc printto log=`mylogpath\logfile.txt´ ;
The user should modify mylogpath, leaving the single quotation marks. In Windows,
statements would be as follows:
proc printto log='C:\GEIRA\logfile.txt';
4.2 Input datasets
GEIRA reads two genetic data files in the transposed PLINK format (TPED and
TFAM) and a covariate file containing non-genetic covariate information (including the
environmental factor under study).
At least two variables must be included in the covariate dataset:
Individual ID (indid), which is a categorical variable, maximum 15 characters (should
match the individual ID in the TFAM file).
Variable for an environmental exposure (env), which is a numerical variable coded with
0 (unexposed) or 1 (exposed).
In addition to the individual ID and the variable for an environmental exposure, one or
more covariates (cov) used for controlling for confounding, either numerical or
categorical variables, can also be included in the covariate data file.
4.3. Macro variable parameters
Input macro-variables:
data_covariate=
specifies the name of the covariate data file
id=
specifies individual id variable name in the covariate data file
nongenovar=
specifies the variable for the environmental exposure and a list of
confounding variables that you want to adjust for in the covariate
data file. E.g., nongenovar=%str(age, sex, smoking)
path_tped=
specifies the path where the tped file is located. e.g.,
C:\GEIRA\myfile.tped
path_tfam=
specifies the path where the tped file is located. e.g.,
C:\GEIRA\myfile.tfam
SampleSize=
provides total sample size including both cases and controls
covar_cat=
specifies a list of categorical variables that you want to adjust for,
leaving it empty if you don’t have any categorical variables to
adjust for. E.g., covar_cat=%str( sex) or covar_cat=%str()
envir=
provides the name of the variable for environmental exposure
covar_cont=
specifies a list of numerical variables that you want to adjust for,
leaving it empty if you don’t have any numerical variables to
adjust for. E.g., covar_cont=%str( age) or covar_cont=%str()
specifies the genetic model that you want to test, dom=dominant
model, rec=recessive model and add=co-dominant model
GenetModel=
chr=
specifies the chromosome number that you want to analyze
p_AP=
specifies the variable name of test p value that you want to adjust
for multiple testing, e.g., p_AP, or P_AP2_ADD
5. OUTPUT
There is one output data file and one additional file if the module of adjustment of
multiple testing is also run for each genetic model.
5.1 Dominant or recessive model
5.1.1
Out_dom: output file name for dominant model.
Out_dom_adjust: name of output file with adjustment for multiple testing for
dominant model
Out_rec: output file name for recessive model.
Out_rec_adjust: name of output file with adjustment for multiple testing for
recessive model
5.1.2
Variable name in Out_dom or Out_rec.
chr – chromosome number
rsn – SNP rs number
AP – the attributable proportion due to interaction
APlowci – the low boundary of 95% confidence interval (CI) for AP
APhighci – the high boundary of 95% CI for AP
P_AP – P value for AP (P value for biological interaction)
Pmulti – P value for statistical interaction (multiplicative)
OR_ind01 – Odds ratio (OR) for the “Environmental variable No, Genetic
variable Yes” exposure category compared with the “Environmental variable
No, Genetic variable No” exposure category
lowci_ind01 – the low boundary of 95% CI for OR_ind01
uppci_ind01 – the high boundary of 95% CI for OR_ind01
OR_ind10 – OR for the “Environmental variable Yes, Genetic variable No”
exposure category compared with the “Environmental variable No, Genetic
variable No” exposure category
lowci_ind10 – the low boundary of 95% CI for OR_ind10
uppci_ind10 – the high boundary of 95% CI for OR_ind10
OR_ind11 – OR for the “Environmental variable Yes, Genetic variable Yes”
exposure category compared with the “Environmental variable No, Genetic
variable No” exposure category
lowci_ind11 – the low boundary of 95% CI for OR_ind11
uppci_ind11 – the high boundary of 95% CI for OR_ind11
ind00_0 – number of controls in the exposure category “Environmental variable
No, Genetic variable No”
ind00_1 – number of cases in the exposure category “Environmental variable
No, Genetic variable No”
ind01_0 – number of controls in the exposure category “Environmental variable
No, Genetic variable Yes”
ind01_1 – number of cases in the exposure category “Environmental variable
No, Genetic variable Yes”
ind10_0 – number of controls in the exposure category “Environmental variable
Yes, Genetic variable No”
ind10_1 – number of cases in the exposure category “Environmental variable
Yes, Genetic variable No”
ind11_0 – number of controls in the exposure category “Environmental variable
Yes, Genetic variable Yes”
ind11_1 – number of cases in the exposure category “Environmental variable
Yes, Genetic variable Yes”
minor – minor allele
major – major allele
risk – risk allele
cc_minor – minor allele frequency in both cases and controls
cc_major – major allele frequency in both cases and controls
case_minor – minor allele frequency in cases
ctrl_minor – minor allele frequency in controls
ctrl_major – major allele frequency in controls
case_major – major allele frequency in cases
5.1.3 Variable name in Out_dom_adjust or Out_rec_adjust
Four additional variables were added to output files: Out_dom or Out_rec:
raw_p – the same p value as user defined macro variable p_AP (the variable name of
test p value that you want to adjust for multiple testing, e.g.,
p_AP)
Bonferroni p-value – Bonferroni adjusted p value
Sidak p-value – Sidak adjusted p value
False Discovery Rate p-value – False discovery rate adjusted p-value
5.2. Co-dominant model
5.2.1 Out_add: output file name for co-dominant model.
Out_add_adjust: name of output file with adjustment for multiple testing for
co-dominant model.
5.2.2 Variables in Out_add:
chr – chromosome number
rsn – SNP rs number
AP1_ADD – the attributable proportion due to interaction (level 1 genetic
variable: one-dose risk allele)
AP1lowci_ADD – the low boundary of 95% CI for AP1_ADD
AP1highci_ADD – the high boundary of 95% CI for AP1_ADD
P_AP1_ADD – P value for AP1_ADD
AP2_ADD – the attributable proportion due to interaction (level 2 genetic
variable: two-dose risk allele)
AP2lowci_ADD – the low boundary of 95% CI for AP2_ADD
AP2highci_ADD – the high boundary of 95% CI for AP2_ADD
P_AP2_ADD – p value for AP (level 2 genetic variable: two-dose risk allele
)
Pmulti – P value for multiplicative interaction
OR_ind01_ADD – OR for the “Environmental variable No, Genetic variable
Yes 1” exposure category compared with the “Environmental variable No,
Genetic variable No” exposure category
lowci_ind01_ADD – the low boundary of 95% CI for OR_ind01_ADD
uppci_ind01_ADD – the high boundary of 95% CI for OR_ind01_ADD
OR_ind02_ADD – OR for the “Environmental variable No, Genetic variable
Yes 2” exposure category compared with the “Environmental variable No,
Genetic variable No” exposure category
lowci_ind02_ADD – the low boundary of 95% CI for OR_ind02_ADD
uppci_ind02_ADD – the high boundary of 95% CI for OR_ind02_ADD
OR_ind10_ADD – OR for the “Environmental variable Yes, Genetic variable
No 0” exposure category compared with the “Environmental variable No,
Genetic variable No” exposure category
lowci_ind10_ADD – the low boundary of 95% CI for OR_ind10_ADD
uppci_ind10_ADD – the high boundary of 95% CI for OR_ind10_ADD
OR_ind11_ADD – OR for the “Environmental variable Yes, Genetic variable
Yes 1” exposure category compared with the “Environmental variable No,
Genetic variable No” exposure category
lowci_ind11_ADD – the low boundary of 95% CI for OR_ind11_ADD
uppci_ind11_ADD – the high boundary of 95% CI for OR_ind11_ADD
OR_ind12_ADD – OR for the “Environmental variable Yes, Genetic variable
Yes 2” exposure category compared with the “Environment variable No,
Genetic variable No” exposure category
lowci_ind12_ADD - the low boundary of 95% CI for OR_ind12_ADD
uppci_ind12_ADD - the high boundary of 95% CI for OR_ind12_ADD
ind00_0 – number of controls in the exposure category “Environmental variable
No, Genetic variable No”
ind00_1 – number of cases in the exposure category “Environmental variable
No, Genetic variable No”
ind01_0 – number of controls in the exposure category “Environmental variable
No, Genetic variable Yes”
ind01_1 – number of cases in the exposure category “Environmental variable
No, Genetic variable Yes”
ind02_0 – number of controls in the exposure category “Environmental variable
No, Genetic variable Yes 2”
ind02_1 – number of cases in the exposure category “Environmental variable
No, Genetic variable Yes 2”
ind10_0 – number of controls in the exposure category “Environmental variable
Yes, Genetic variable No”
ind10_1 – number of cases in the exposure category “Environmental variable
Yes, Genetic variable No”
ind11_0 – number of controls in the exposure category “Environmental variable
Yes, Genetic variable Yes”
ind11_1 – number of cases in the exposure category “Environmental variable
Yes, Genetic variable Yes”
ind12_0 – number of controls in the exposure category “Environmental variable
Yes, Genetic variable Yes 2”
ind12_1 – number of cases in the exposure category “Environmental variable
Yes, Genetic variable Yes 2”
minor – minor allele
major – major allele
risk – risk allele
cc_minor – minor allele frequency in both cases and controls
cc_major – major allele frequency in both cases and controls
case_minor – minor allele frequency in cases
ctrl_minor – minor allele frequency in controls
ctrl_major – major allele frequency in controls
case_major – major allele frequency in cases
5.2.3 Variable name in Out_add_adjust
Four additional variables were added to output files: Out_add:
raw_p – the same p value as user defined macro variable p_AP ( the variable name of
test p value that you want to adjust for multiple testing, e.g.,
P_AP1_ADD, or P_AP2_ADD)
Bonferroni p-value – Bonferroni adjusted p value
Sidak p-value – Sidak adjusted p value
False Discovery Rate p-value – False discovery rate adjusted p-value
6 Getting started with GEIRA
6.1 Download and store the GEIRA program
GEIRA is available for free download. Go to the website http://www.epinet.se and
download the GEIRA SAS version, save the program file on your local machine. The
file should be assigned the extension “.sas”.
6.2. Running GEIRA with your own data sets
Step 1. Start the SAS system, select the program editor window, and use the Open
Program command to open your copy of the GEIRA program file (GEIRA.sas) into the
program editor window.
Step 2. Modify the logfile to proc printto log=` C:\GEIRA\LOG \logfile.txt´ ;
Step 3. Assign values to each of the macro variables as described in 4.3.
Step 4. Submit the program.
7 Contact and support
Users may contact the program author at [email protected] for assistance with issues not
detailed in this manual. Users are free to modify and distribute the program. However,
just like any other free program, we cannot guarantee that it does not contain bugs.
8. References
1. Rothman, K.J., Greenland, S. & Walker, A.M. Concepts of interaction. Am J
Epidemiol 112, 467-470 (1980).
2. Hosmer, D.W. & Lemeshow, S. Confidence interval estimation of interaction.
Epidemiology 3, 452-456 (1992).