GEIRA: Gene-Environment and Gene-Gene Interaction Research Application SAS version 1.0 User’s manual Bo Ding Institute of Environmental Medicine Karolinska Institutet 171 77 Stockholm, Sweden June 11, 2010 1 Introduction GEIRA is a tool designed to perform genome-wide gene-environment interaction analyses. It can be easily extended to genome-wide gene-gene interaction analyses on the condition that a genetic variable (such as SNP, CNV, Haptotype, etc.) can be dichotomized (0 = unexposed, 1 = exposed) as does an environmental variable. GEIRA calculates measures of both additive and multiplicative interaction. Multiplicative interaction refers here to an interaction term in the logistic regression model. Additive interaction is defined as a deviation from additivity of the absolute effects of two risk factors as originally described by Rothman(1). The term, attributable proportion due to interaction (AP), is used to quantify the contribution of interaction to a disease risk, as compared with the contribution of each of the two risk factors added to each other. The formula for calculating AP is described as follows: AP = (RR11-RR10-RR01+1)/RR11, where RR10 is the relative risk for the first risk factor in the absence of the second, RR01 is the relative risk for the second risk factor in the absence of the first, RR11 is the relative risk in the exposure category when both risk factors are present. Those who are unexposed to both the first and the second risk factor were defined as reference category, i.e., RR00 = 1. Confidence intervals for AP were estimated by calculating a symmetrical confidence interval using the formulas developed by Hosmer and Lemeshow(2). Calculations were made separately based on dominant, recessive and co-dominant genetic models. All estimates of interactions were displayed in a single table to make the output easily read. GEIRA (SAS version) was written using the SAS macro language and incorporates procedures in SAS/Base (SAS 9.2). However, use of GEIRA requires no previous experience on SAS. This manual provides a detailed guide to get you started. 2 Quick start Download the GEIRA (SAS) package from www.epinet.se and save it on your local machine, for example C:\GEIRA. Unzip the package. There is 1 pdf file for the User’s manual (GEIRA_manual_SAS.pdf), 3 sample input data files (sample.tped, sample.tfam, and SampleCov.txt), 1 program file (GEIRA.sas), and 6 sample output files (out_dom.sas7bdat, out_dom_adjust.sas7bdat, out_rec.sas7bdat, out_rec_adjust.sas7bdat, out_add.sas7bdat, and out_add_adjust.sas7bdat). A. Invocation of SAS on a Windows Platform Starting SAS is done by clicking on the Start menu and highlighting the Programs Tab. Locate and highlight The SAS System and double-click on SAS 9.2. SAS will load B. Import covariate data file (in a txt format, e.g., SampleCov.txt) through SAS Import Wizard (a) Open the wizard by clicking on the File menu and selecting Import Data. The Import Wizard should appear. (b) In following the steps of the Import Wizard you will: 1. Select the type of data you want to import (in the sample case, select Tab Delimited File (*.txt)). 2. Browse for the file (e.g., C:\GEIRA\SampleCov.txt) Open and then click Next. 3. Choose the Library WORK and Member (same it as Covariate) 4. The Import Wizard will then ask you if you would like to create a file containing the PROC IMPORT command, Click on Finish. If it was successful in the Log window there should be a note as below: NOTE: WORK.COVARIATE data set was successfully created Running GEIRA with sample datasets Step1. Start the SAS system, select the program editor window, and use the File/Open Program to open your copy of the GEIRA program file (GEIRA.sas) into the program editor window. Step2. Once your GEIRA program appears in the editor, you run it by selecting Submit from the Run pull-down menu. Step3. Assign the macro variables as follows (the sample datasets) %GEIRA (data_covariate= samplecov, id=indid, nongenovar=%str(cov env), path_tped=C:\GEIRA\sample.tped, path_tfam=C:\GEIRA\sample.tfam, SampleSize=2000, covar_cat=%str(), envir=env, covar_cont=%str(cov), GenetModel=dom, chr=1) %Multi(GenetModel=dom,p_AP=p_AP) See 4.3 for detailed description on macro variables. Viewing output In the Explorer window, you will see two output files (Out_dom and Out_dom_adjust). Open the two files by double clicking them, comparing contents with the provided files (out_dom.sas7bdat and out_dom_adjust.sas7bdat). Make sure they have identical results. C. Export the SAS output data file to an external file through SAS Export Wizard (a) From a SAS session’s Results window, select the File menu and then select the Export Data item (b) In following the steps of the Export Wizard you will: 1. Choose the Library WORK and Member (e.g., out_dom) 2. Select the type of data you want to export (e.g., Tab Delimited File (*.txt)), click Next. 2. Browse for a location where you want your output data file reside (e.g., C:\GEIRA) and give a file name, then click Save and Next. 3.Click on Finish. If it was successful in the Log window there should be a note as below: NOTE: "C:\Documents and Settings\XX\Desktop\sampe.txt" file was successfully created. Congratulations! You’ve done the sample test run and can move on to your own data. 3 Features Upon invocation, GEIRA performs the following procedures (Fig1 and 2): 1. Data importing. Here we assume that some other software package has previously been used to quality control genome-wide genotype data, including dataset filtering on the basis of SNP genotype call rates, minor allele frequency, Hardy-Weinberg equilibrium, individual missing genotypes, outliers, etc. The program will read in transposed PLINK format data files, i.e., TPED (containing SNP and genotype information where one row is a SNP) and TFAM (containing individual and family information where one row is an individual). The TPED file must contain exactly 4 columns: chromosome (1-22, X, Y or 0 if unplaced) SNP rs# (Must start with ‘rs’) Genetic distance (morgans) Base-pair position (bp units) Genotypes (column 7 onwards) should also be white-space delimited character (A,C,G,T) except 0 which is, by default, the missing genotype character. All markers should be biallelic. All SNPs (whether haploid or not) must have two alleles specified. Either Both alleles should be missing (i.e. 0) or neither. No header row should be given. The TFAM file is a white-space (space or tab) delimited file: Family ID Individual ID Paternal ID Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype Affection status (Phenotype), by default, should be coded: -9 missing 0 missing 1 unaffected 2 affected TPED/TFAM files: <-------------tped -------------> 1 rs1 0 5000650 A A A C C C A C C C C C 1 rs2 0 5000830 G T G T G G T T G T T T 1 2 3 4 5 6 <1 1 1 1 1 1 tfam 0 0 0 0 0 0 0 0 0 0 0 0 -> 1 1 1 1 1 1 1 1 1 2 2 2 (see PLINK documentation for details http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr). In addition to TPED and TFAM files, a covariate file (txt file) containing non-genetic covariate information (including the environmental factor under study) is needed. 2. Risk allele assigning. A minor allele is determined using the whole subjects (both cases and controls). A risk allele is determined by comparing the minor allele frequency (MAF) in cases and controls. If the MAF in cases is greater than or equal to that in controls, the minor allele is assigned to the risk allele, the major allele otherwise (see fig.2 for details). 3. Data converting The raw genotype dataset will then be converted to three datasets corresponding to three genetic models, i.e., dominant, recessive and co-dominant, based on the following: Assuming C is the minor allele and also the risk allele. Dominant model coding: A_A→0, A_C→1, C_C→1 Recessive model coding: A_A→0, A_C→0, C_C→1 Co-dominant model coding: A_A→0, A_C→1, C_C→2 4. Interaction calculation: Dominant, recessive or co-dominant model. Calculate estimates for both additive and multiplicative interactions. 5. Supervising macro. We now have four sections to automate the major tasks for the analysis. Since these modules should be executed in proper order, we create a supervising macro that will pass the correct parameters to each module in order and automate the process. 6. Adjustment for multiple testing: P-value adjustments using Bonferroni, Sidak and False Discovery Rate (FDR) are corrected for all tests performed and are calculated using simple functions of the raw p-values. Fig.1 A pictorial representation of the GEIRA structure procedures Fig 2. Detailed procedures of assigning a risk allele. 4 User-defined parameters 4.1 Directory location The user must specify the path where the log file should be placed. These should be specified as follows: proc printto log=`mylogpath\logfile.txt´ ; The user should modify mylogpath, leaving the single quotation marks. In Windows, statements would be as follows: proc printto log='C:\GEIRA\logfile.txt'; 4.2 Input datasets GEIRA reads two genetic data files in the transposed PLINK format (TPED and TFAM) and a covariate file containing non-genetic covariate information (including the environmental factor under study). At least two variables must be included in the covariate dataset: Individual ID (indid), which is a categorical variable, maximum 15 characters (should match the individual ID in the TFAM file). Variable for an environmental exposure (env), which is a numerical variable coded with 0 (unexposed) or 1 (exposed). In addition to the individual ID and the variable for an environmental exposure, one or more covariates (cov) used for controlling for confounding, either numerical or categorical variables, can also be included in the covariate data file. 4.3. Macro variable parameters Input macro-variables: data_covariate= specifies the name of the covariate data file id= specifies individual id variable name in the covariate data file nongenovar= specifies the variable for the environmental exposure and a list of confounding variables that you want to adjust for in the covariate data file. E.g., nongenovar=%str(age, sex, smoking) path_tped= specifies the path where the tped file is located. e.g., C:\GEIRA\myfile.tped path_tfam= specifies the path where the tped file is located. e.g., C:\GEIRA\myfile.tfam SampleSize= provides total sample size including both cases and controls covar_cat= specifies a list of categorical variables that you want to adjust for, leaving it empty if you don’t have any categorical variables to adjust for. E.g., covar_cat=%str( sex) or covar_cat=%str() envir= provides the name of the variable for environmental exposure covar_cont= specifies a list of numerical variables that you want to adjust for, leaving it empty if you don’t have any numerical variables to adjust for. E.g., covar_cont=%str( age) or covar_cont=%str() specifies the genetic model that you want to test, dom=dominant model, rec=recessive model and add=co-dominant model GenetModel= chr= specifies the chromosome number that you want to analyze p_AP= specifies the variable name of test p value that you want to adjust for multiple testing, e.g., p_AP, or P_AP2_ADD 5. OUTPUT There is one output data file and one additional file if the module of adjustment of multiple testing is also run for each genetic model. 5.1 Dominant or recessive model 5.1.1 Out_dom: output file name for dominant model. Out_dom_adjust: name of output file with adjustment for multiple testing for dominant model Out_rec: output file name for recessive model. Out_rec_adjust: name of output file with adjustment for multiple testing for recessive model 5.1.2 Variable name in Out_dom or Out_rec. chr – chromosome number rsn – SNP rs number AP – the attributable proportion due to interaction APlowci – the low boundary of 95% confidence interval (CI) for AP APhighci – the high boundary of 95% CI for AP P_AP – P value for AP (P value for biological interaction) Pmulti – P value for statistical interaction (multiplicative) OR_ind01 – Odds ratio (OR) for the “Environmental variable No, Genetic variable Yes” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind01 – the low boundary of 95% CI for OR_ind01 uppci_ind01 – the high boundary of 95% CI for OR_ind01 OR_ind10 – OR for the “Environmental variable Yes, Genetic variable No” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind10 – the low boundary of 95% CI for OR_ind10 uppci_ind10 – the high boundary of 95% CI for OR_ind10 OR_ind11 – OR for the “Environmental variable Yes, Genetic variable Yes” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind11 – the low boundary of 95% CI for OR_ind11 uppci_ind11 – the high boundary of 95% CI for OR_ind11 ind00_0 – number of controls in the exposure category “Environmental variable No, Genetic variable No” ind00_1 – number of cases in the exposure category “Environmental variable No, Genetic variable No” ind01_0 – number of controls in the exposure category “Environmental variable No, Genetic variable Yes” ind01_1 – number of cases in the exposure category “Environmental variable No, Genetic variable Yes” ind10_0 – number of controls in the exposure category “Environmental variable Yes, Genetic variable No” ind10_1 – number of cases in the exposure category “Environmental variable Yes, Genetic variable No” ind11_0 – number of controls in the exposure category “Environmental variable Yes, Genetic variable Yes” ind11_1 – number of cases in the exposure category “Environmental variable Yes, Genetic variable Yes” minor – minor allele major – major allele risk – risk allele cc_minor – minor allele frequency in both cases and controls cc_major – major allele frequency in both cases and controls case_minor – minor allele frequency in cases ctrl_minor – minor allele frequency in controls ctrl_major – major allele frequency in controls case_major – major allele frequency in cases 5.1.3 Variable name in Out_dom_adjust or Out_rec_adjust Four additional variables were added to output files: Out_dom or Out_rec: raw_p – the same p value as user defined macro variable p_AP (the variable name of test p value that you want to adjust for multiple testing, e.g., p_AP) Bonferroni p-value – Bonferroni adjusted p value Sidak p-value – Sidak adjusted p value False Discovery Rate p-value – False discovery rate adjusted p-value 5.2. Co-dominant model 5.2.1 Out_add: output file name for co-dominant model. Out_add_adjust: name of output file with adjustment for multiple testing for co-dominant model. 5.2.2 Variables in Out_add: chr – chromosome number rsn – SNP rs number AP1_ADD – the attributable proportion due to interaction (level 1 genetic variable: one-dose risk allele) AP1lowci_ADD – the low boundary of 95% CI for AP1_ADD AP1highci_ADD – the high boundary of 95% CI for AP1_ADD P_AP1_ADD – P value for AP1_ADD AP2_ADD – the attributable proportion due to interaction (level 2 genetic variable: two-dose risk allele) AP2lowci_ADD – the low boundary of 95% CI for AP2_ADD AP2highci_ADD – the high boundary of 95% CI for AP2_ADD P_AP2_ADD – p value for AP (level 2 genetic variable: two-dose risk allele ) Pmulti – P value for multiplicative interaction OR_ind01_ADD – OR for the “Environmental variable No, Genetic variable Yes 1” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind01_ADD – the low boundary of 95% CI for OR_ind01_ADD uppci_ind01_ADD – the high boundary of 95% CI for OR_ind01_ADD OR_ind02_ADD – OR for the “Environmental variable No, Genetic variable Yes 2” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind02_ADD – the low boundary of 95% CI for OR_ind02_ADD uppci_ind02_ADD – the high boundary of 95% CI for OR_ind02_ADD OR_ind10_ADD – OR for the “Environmental variable Yes, Genetic variable No 0” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind10_ADD – the low boundary of 95% CI for OR_ind10_ADD uppci_ind10_ADD – the high boundary of 95% CI for OR_ind10_ADD OR_ind11_ADD – OR for the “Environmental variable Yes, Genetic variable Yes 1” exposure category compared with the “Environmental variable No, Genetic variable No” exposure category lowci_ind11_ADD – the low boundary of 95% CI for OR_ind11_ADD uppci_ind11_ADD – the high boundary of 95% CI for OR_ind11_ADD OR_ind12_ADD – OR for the “Environmental variable Yes, Genetic variable Yes 2” exposure category compared with the “Environment variable No, Genetic variable No” exposure category lowci_ind12_ADD - the low boundary of 95% CI for OR_ind12_ADD uppci_ind12_ADD - the high boundary of 95% CI for OR_ind12_ADD ind00_0 – number of controls in the exposure category “Environmental variable No, Genetic variable No” ind00_1 – number of cases in the exposure category “Environmental variable No, Genetic variable No” ind01_0 – number of controls in the exposure category “Environmental variable No, Genetic variable Yes” ind01_1 – number of cases in the exposure category “Environmental variable No, Genetic variable Yes” ind02_0 – number of controls in the exposure category “Environmental variable No, Genetic variable Yes 2” ind02_1 – number of cases in the exposure category “Environmental variable No, Genetic variable Yes 2” ind10_0 – number of controls in the exposure category “Environmental variable Yes, Genetic variable No” ind10_1 – number of cases in the exposure category “Environmental variable Yes, Genetic variable No” ind11_0 – number of controls in the exposure category “Environmental variable Yes, Genetic variable Yes” ind11_1 – number of cases in the exposure category “Environmental variable Yes, Genetic variable Yes” ind12_0 – number of controls in the exposure category “Environmental variable Yes, Genetic variable Yes 2” ind12_1 – number of cases in the exposure category “Environmental variable Yes, Genetic variable Yes 2” minor – minor allele major – major allele risk – risk allele cc_minor – minor allele frequency in both cases and controls cc_major – major allele frequency in both cases and controls case_minor – minor allele frequency in cases ctrl_minor – minor allele frequency in controls ctrl_major – major allele frequency in controls case_major – major allele frequency in cases 5.2.3 Variable name in Out_add_adjust Four additional variables were added to output files: Out_add: raw_p – the same p value as user defined macro variable p_AP ( the variable name of test p value that you want to adjust for multiple testing, e.g., P_AP1_ADD, or P_AP2_ADD) Bonferroni p-value – Bonferroni adjusted p value Sidak p-value – Sidak adjusted p value False Discovery Rate p-value – False discovery rate adjusted p-value 6 Getting started with GEIRA 6.1 Download and store the GEIRA program GEIRA is available for free download. Go to the website http://www.epinet.se and download the GEIRA SAS version, save the program file on your local machine. The file should be assigned the extension “.sas”. 6.2. Running GEIRA with your own data sets Step 1. Start the SAS system, select the program editor window, and use the Open Program command to open your copy of the GEIRA program file (GEIRA.sas) into the program editor window. Step 2. Modify the logfile to proc printto log=` C:\GEIRA\LOG \logfile.txt´ ; Step 3. Assign values to each of the macro variables as described in 4.3. Step 4. Submit the program. 7 Contact and support Users may contact the program author at [email protected] for assistance with issues not detailed in this manual. Users are free to modify and distribute the program. However, just like any other free program, we cannot guarantee that it does not contain bugs. 8. References 1. Rothman, K.J., Greenland, S. & Walker, A.M. Concepts of interaction. Am J Epidemiol 112, 467-470 (1980). 2. Hosmer, D.W. & Lemeshow, S. Confidence interval estimation of interaction. Epidemiology 3, 452-456 (1992).
© Copyright 2026 Paperzz