Positional effects revealed in Illumina Methylation Array

Supplemental Materials
Positional effects revealed in Illumina Methylation Array and the
impact on analysis
Chuan Jiao1, Chunling Zhang2, Rujia Dai1, Yan Xia1, Kangli Wang1, Gina Giase3, Chao Chen1,* and
Chunyu Liu1, 3,*
1
The State Key Laboratory of Medical Genetics, Central South University, Changsha, Hunan,
410012, China
2 Center for Research Informatics, University of Chicago, Chicago, IL, 60607, USA
3 Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, 60607, USA
* To whom correspondence should be addressed. Tel: (+1)3124132599; Email: [email protected]. Correspondence
may also be addressed to Chao Chen. Tel: (+86)18874114280; Fax: (+86)0731-84478152; Email:
[email protected].
Here we first processed the datasets by the same pipeline, and corrected the batch and positional
effects by ComBat and lm function in R. Second, we use four evaluation methods to estimate the
corrected methods of position, and we demonstrate that the positional effect exists in the Illumina
HumanMethylation BeadChip. Last, to provide guidance when analyzing the Illumina Infinium
HumanMethylation datasets, we proposed a method to control the artifact.
Figure S1. The sample maps of Methyl450 (left) and Methyl27 (right). The numbers from one to
twelve are the position identifiers we used in the analysis. Figures obtained from Illumina website
(http://www.illumina.com/).
Other datasets results
We regarded the arrays as the batches in some datasets (GSE58885, GSE26133, BrainCloud and
GSE38873), here we don’t list the results of the boxplot in different batches.
Figure S2. The distribution in GSE74193 eight different processed datasets. (a) The line chart of
average methylation levels in different positions. (b) The boxplot of methylation levels in different
batches.
Figure S3. Average methylation levels in different positions. (a) GSE58885 dataset (b) GSE38873
(c) GSE26133 dataset (d) BrainCloud data.
Figure S4. PVCA results in 450k datasets. (a) GSE58885 dataset. PVCA estimated the contribution
of each factor to the overall variation. We considered four possible sources of variation, including: days
post-conception (DPC), sex, batches (Array) and positions (Position); as well as the weight of residual
effect (resid) that known factors could not explain. (b) GSE74193 dataset. We considered six possible
sources of variation, including: Age, Disease status (Dx), Race, Sex, Batches and Positions (Position);
as well as the weight of residual effect (resid) that known factors could not explain.
Figure S5. PVCA results in 27k datasets. (a) GSE38873dataset. PVCA estimated the contribution
of each factor to the overall variation. We considered nine possible sources of variation, including: Age,
Sex, Disease status (Dx), life time use of antipsychotics (LTantiPSY), Smoking, BrainPH, Postmortem
interval (PMI), Batch and Position; as well as the weight of residual effect (resid) that known factors
could not explain. (b) GSE26133 dataset. We considered four possible sources of variation, including:
Gender, Array, Batch and Position; as well as the weight of residual effect (resid) that known factors
could not explain. (c) BrainCloud dataset. We considered five possible sources of variation, including:
Age, Sex, Race, Batch and Position; as well as the weight of residual effect (resid) that known factors
could not explain.
Figure S6. The comparison between different processed datasets in GSE74193. The Fig.S5a,
S5b, S5c and S5d are used to compare the correction methods of positional effects, the Fig.S5e is
used to compare the correction order of positional effects and batch effect, the Fig.S5f are used to
evaluate the efficiency of correction of positional effects. (a) Pos(ComBat)_data versus Pos(lm)_Data ,
(b)
BatchPos(ComBat)_data versus Batch(ComBat)Pos(lm)_data, (c) PosBatch(ComBat)_data
versus Pos(lm)Batch(ComBat)_data, (d) PosBatch(ComBat)_data versus FN_data, (e)
PosBatch(ComBat)_data versus BatchPos(ComBat)_data, (f) PosBatch(ComBat)_data versus
Batch_data, (g) PosBatch(ComBat)_data versus Raw_data. The red lines mean y=x. The top left
corner values reveal the Wilcoxon signed-rank one-tailed test result, the W is a test statistic means the
sum of the signed ranks, which can be compared to a critical value from a reference table to get a pvalue.