(a) An illustration of integrated omics data sets (b

Supporting Information
Text S1: Materials and data collection
We acquired flash frozen lung tissues, computed tomography (CT) data, and clinical data on
319 samples from the Lung Genomics Research Consortium (LGRC, https://www.lunggenomics.org/). These samples were initially given major clinical diagnoses of either interstitial
lung disease (ILD) or chronic obstructive pulmonary disease (COPD) based on their clinical,
pathologic, and radiographic data (136 COPD and 183 ILD). 669 clinical variables were
collected by questionnaires (demographic, medical history, family history, smoking history,
concomitant therapy, symptom, SF-12 health, St. Georges respiratory, environmental, and
occupational), tests (six-minute walk test, cardiopulmonary exercise test, PFT, blood test, and
CT scan), and diagnosis reports (central and local pathology, clinical report). More details on the
data
collection
are
publicly
available
(http://www.ltrcpublic.com/clinical_IQ.htm;
SI_survey_form_1.pdf” and β€œSI_survey_form_2.pdf). Gene expression and miRNA data sets
were generated through Agilent microarray platforms (for details, refer to S13 Table; GSE47460
and GSE72967) After we matched probes with gene symbols and applied loess normalization [1]
to two Agilent platforms, we obtained 15,966 genes and 937 miRNAs. For gene expression,
redundant (e.g. non-expressed and/or non-informative) gene expression features with mean (<
7) and standard deviation (< 0.4) were removed. For miRNAs, we filtered the miRNAs probes
based on the mean expression (< 1.7) across a total of n samples. We analyzed 669 clinical
variables, 4,258 gene expressions and 438 miRNA genes from 319 samples.
Text S2: Details of smoothing and Feature Topology Plots (FTP)
(π‘š)
(π‘š)
(π‘š)
Let 𝑋⃑𝑗 = (𝑋1𝑗 , … , 𝑋|πΌπ‘š |𝑗 ) denote an intensity vector for sample 𝑗 and the mth omics data,
where 𝑖 ∈ πΌπ‘š , 𝑗 ∈ 𝐽, π‘š = 1, . . . , 𝑀 and |πΌπ‘š | is the size of feature set πΌπ‘š for the mth omics data.
(π‘š)
(π‘š)
Denote by (𝑒𝑖1 , 𝑒𝑖2 ) the two-dimensional MDS coordinates of feature 𝑖 ∈ πΌπ‘š in the mth omics
data. With fixed sample j and omics data m, we fit a generalized additive model using a thin
(π‘š) (π‘š)
(π‘š)
plate spline penalty on 𝑋⃑𝑗 and (𝑒𝑖1 , 𝑒𝑖2 ):
(π‘š)
(π‘š)
(π‘š)
(π‘š)
(π‘š) (π‘š)
(π‘š)
(π‘š)
(π‘š) (π‘š)
(π‘š)
(π‘š)
(π‘š) (π‘š)
(π‘š)
(π‘š)
𝐸(𝑋𝑖𝑗 |𝑒𝑖1 , 𝑒𝑖2 ) = 𝛽𝑗0 + 𝛽𝑗1 𝑠𝑗1 (𝑒𝑖1 , 𝑒𝑖2 ) + 𝛽𝑗2 𝑠𝑗2 (𝑒𝑖1 , 𝑒𝑖2 ) + β‹― + 𝛽𝑗𝑝𝑗 𝑠𝑗𝑝𝑗 (𝑒𝑖1 , 𝑒𝑖2 ) ,
where 𝑖 ∈ πΌπ‘š and 𝑝𝑗 is the optimal number of spline bases. We applied penalized thin plate
regression splines for the generalized additive models using the β€œmgcv” package in R [8]. The
estimated coefficients and splines derived from the penalized approach are used to smooth the
intensity estimates:
(π‘š)
(π‘š)
(π‘š) (π‘š)
(π‘š) (π‘š)
(π‘š) (π‘š)
𝑓̂𝑗 (π‘₯1 , π‘₯2 ) = 𝛽̂𝑗0 + 𝛽̂𝑗1 𝑠̂𝑗1 (π‘₯1 , π‘₯2 ) + 𝛽̂𝑗2 𝑠̂𝑗2 (π‘₯1 , π‘₯2 ) + β‹― + 𝛽̂𝑗𝑝𝑗 𝑠̂𝑗𝑝𝑗 (π‘₯1 , π‘₯2 )
for 𝑖 ∈ πΌπ‘š , 𝑗 ∈ 𝐽, π‘š = 1, . . . , 𝑀, and (π‘₯1 , π‘₯2 ) ∈ 𝑅 2 . In the model, features with missing values are
allowed. The top panel of Supplementary Figure 1b illustrates feature intensities at scaled MDS
(π‘š)
coordinates π‘ˆ β€²(π‘š) , adjusted by the minimum and maximum of MDS coordinates of 𝑒𝑖1
(π‘š)
𝑒𝑖2 ,
respectively, where π‘ˆ
β€²(π‘š)
=
β€²(π‘š)
{π‘’π‘–π‘˜ }
|πΌπ‘š |×2
and
β€²(π‘š)
π‘’π‘–π‘˜
β€²(π‘š)
𝑒𝑖1
(π‘š)
=
=
β€²(π‘š)
𝑒𝑖2
{
(π‘š)
𝑒𝑖1 βˆ’min(𝑒𝑖1 )
(π‘š)
(π‘š)
max(𝑒𝑖1 )βˆ’min(𝑒𝑖1 )
(π‘š)
=
and
(π‘š)
(𝑒𝑖2 )βˆ’min(𝑒𝑖2 )
(π‘š)
for 𝑖 ∈
(π‘š)
max(𝑒𝑖2 )βˆ’min(𝑒𝑖2 )
πΌπ‘š , π‘š = 1, . . . , 𝑀, and π‘˜ = 1 and 2. The feature intensities in the middle panel in Supplementary
(π‘š)
Figure 1b represent predicted smoothed intensities, β„Žπ‘—
𝑠 𝑑
points in 𝐺 = {(𝑛 , 𝑛)}
(𝑛+1)×(𝑛+1)
(π‘š) 𝑠 𝑑
= {𝑓̂𝑗 (𝑛 , 𝑛)}
(𝑛+1)×(𝑛+1)
at lattice
for 𝑠 = 0, … , 𝑛, 𝑑 = 0, … , 𝑛, 𝑖 ∈ πΌπ‘š , 𝑗 ∈ 𝐽, π‘š = 1, . . . , 𝑀, and 𝑛 refers
to the grid number (we use default=19). To make sure that lattice points in 𝐺 account for a
distribution of MDS coordinates in π‘ˆ β€²(π‘š) , we remove some grid points in 𝐺, whose distances
from any points in π‘ˆ β€²(π‘š) are greater than 0.2. As a result, the smoothed intensity plot does not
cover entire two-dimensional space (i.e., The region remains white) as shown in the middle and
bottom figure of Supplement Figure 1b.
Text S3: Simulation setting to evaluate iPF
We adopted a simulation scheme introduced by Qiu et al. [2] to evaluate iPF. Simulated data
sets were generated by the β€œclusterGeneration” package in R (http://www.r-project.org) with an
adjustment to degrees of noise features to mimic complexity of real data sets. A total of M=5
different omics data sets that hold |𝐽|=100 samples belonging to five disease subtypes (20
samples each) were simulated. In each simulated data set, numNonNoisy=200, 400, 600,
features from five pre-defined gene modules were simulated which lead to five disease
subtypes. Extra noise features (numNoisy) (e.g. numNoisy is in proportion to numNonNoisy,
numNoisy = 200 (=numNonNoisy x 1), 400 (=numNonNoisy x 2), 600 (=numNonNoisy x 3) and
800 (=numNonNoisy x 4), if numNonNoisy = 200) were sampled from uniform distribution. The
parameter β€œsepVal” that determines the degree of separation of five disease subtypes was set
at 0.7.
To evaluate clustering accuracy of iPF, we integrated the five simulated data sets and
applied eight different clustering methods (β€œnaïve”, β€œspK”, β€œmClust” β€œvar”, β€œpca”, β€œFF”, β€œFFspK”
and β€œFFmClust”) to cluster samples into five subtypes and compare to true clustering labels. Kmeans clustering (β€œnaïve”) [3], sparse K-means clustering (β€œspK”) [4] and model-based
clustering (β€œmClust”) [5] are applied to all original features of integrated five omics data sets (i.e.
(1)
(5)
clustering based on 𝑋⃑𝑗 = (𝑋⃑𝑗 , … , 𝑋⃑𝑗 ) for each sample, j). The method labeled with β€œvar”
means we sort features by variance in decreasing order and in turn apply K-means clustering to
only the top 100 features in each omics data set. The method of β€œpca” means we perform
principal component analysis (PCA) to select the top three PCs to which we apply K-means
clustering. The methods (β€œFF”, β€œFFspK” and β€œFFmClust”) indicate we apply Kmeans, sparseKmeans and model-based clustering to smoothed feature intensities (Feature Fusion). More
(π‘š)
specifically, smoothed intensities on the grid points are β„Žπ‘—
(π‘š) 𝑠 𝑑
= {𝑓̂𝑗 (𝑛 , 𝑛)}
(𝑛+1)×(𝑛+1)
as in
Supplementary material B. To numerically assess, we calculated an adjusted Rand index (ARI)
that measures the similarity between inferred clustering labels and underlying true subtype
labels. The simulations were repeated 100 times, and average ARIs are presented in
Supplementary Figure 7. Supplementary Figure 7 shows the three methods related to Feature
Fusion methods (β€œFF”, β€œFFspK” and β€œFFmClust”) clearly better perform compared to the other
methods. Especially the Feature Fusion methods are fairly robust to effects of noise features,
even if the number of noisy feature increases. Therefore, the Feature Fusion technique
promotes effective integrative clustering, which is mostly attributed to dimension reduction and
smoothing.
Text S4: Comprehensive validation scheme for iPF
As in Supplementary Figure 2a-d, we first divided all samples (n=319) into two groups: training
set (n=91, Batch 1) and testing set (n=228, Batch 2). In the discovery phase, we applied iPF to
the training data, and identified three distinct patterns of feature topology plot (FTP) in clinical
and transcriptome (mRNA +miRNA) data sets. In the prediction phase, we produce FTPs of
each testing sample (n=228) that utilize the MDS coordinates derived from training data sets. To
validate if training set and testing set share homogenous variation structures in 2D space, we
compared FTPs and disease proportions in pie charts of both training and testing set. In
Supplementary Figure 2b and 2c, we found visual patterns of FTPs of both the training and
testing set look alike, and the similar disease compositions across the nine sub-clusters were
observed. After the visual confirmation, we independently applied iPF to the testing set. We thus
estimated new feature relocations (MDS coordinates) of the testing data, and thereby created
de novo FTPs of the testing set. These de novo FTPs are, therefore, no longer associated with
the FTPs derived from the training set. Interestingly, both FTPs in Supplementary Figure 3b and
2d appear similar. This result implies both training and testing data set are formed in the
homogenous distance structure of whole features. We measured concordance levels of the subclusters that are generated in the prediction phase and the validation phase by means of the
adjusted rand index (ARI). Using 228 samples of the testing data set, we obtained ARI=0.76 for
three clusters of clinical data set, and ARI=0.43 for three clusters of transcriptome data set.
Taken together, we conclude that the sub-clusters of the training and testing data set
adequately represents the common patterns of clustering, which resultingly provides a rationale
to perform pooled analysis, such that we applied iPF to all patients of both two batches in Figure
4.
a.
Figure
S5:
b.
(a)
An
illustration
of
integrated
(b) A workflow to generate feature topology plot (FTP)
omics
data
sets
Figure S6: Flowchart of validation scheme for Integrative phenotyping framework for
multiple omics data sets
(Clinical, Gene expression + miRNA)
(ix)
Figure S7: An example of iPF that utilizes fused multiple data sets at the stage (vi)
Figure S8: Examples of iPF using various combinations of the omics data sets (pooled
analysis)
Figure S9A: The gap statistics and its scree plot to choose the optimal number of clustering
(clinical and miRNA data).
Figure S9B: The gap statistics and its scree plot to choose the optimal number of clustering
(mRNA and miRNA data).
Figure S9C: The gap statistics and its scree plot to choose the optimal number of clustering
(mRNA and clincal data).
Figure S9D: The gap statistics and its scree plot to choose the optimal number of clustering
(clincal data and combined data of mRNA and miRNA).
Figure S10: The best choice of the number of feature modules
# of true signal features: 200
x0
x1
x2
x3
# of true signal features: 400
x4 x0
x1
x2
x3
x4 x0
# of true signal features: 600
x1
x2
x3
x4
Proportion of Noise Features
Figure S11: Simulation study shows robust true feature discovery in β€œFeature Fusion”. The xaxis represents multiplication levels of noise features. The y-axis represents average ARIs
from 100 simulations. Each figure is generated based on simulation scenarios of the different
number of true features (e.g., 200, 400, and 600, respectively).
Figure S12: Immunomodulating drugs target overexpressed genes in module two
Gene expression
Platform
# of used samples
miRNA expression
Batch 1
Batch 2
Batch 1
Batch 2
Agilent
Human GE
4x44K
160
Agilent
Human GE
8x60K
311
Agilent Human
microRNA
array V2
91
Agilent Human
microRNA
array V3
228
41,000
42,405
961
1,368
15,966
15,966
937
937
4,258
4,258
438
438
# of raw transcripts
# of overlapped
transcripts
# of selected
transcripts
Supplementary Table 1. The description of micro array data sets for gene expression and miRNA
Table S13: The description of mRNA and miRNA lung disease data
Variables
Con
Ord
Bin
Cat
Con
Ord
Spearman
Spearman
-Spearman
---
---
Bin
Point Biserial
Point Biserial
extension
Rank Biserial
Rank Biserial
extension
Phi
--
Cramer’s V
Cramer’s V
Cat
Supplementary
Table 2.
Various types
of correlation
Table
S14: Various
correlation
types
dependingstructure
on variable attributes
Total
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Cluster F
Cluster G
Cluster H
Cluster I
P-value
Age, yrs
Gender, % female
(n=319)
62.7
44.2
(n=76)
65.7*
39.5
(n=34)
63.6
41.2
(n=11)
58.9
36.4
(n=25)
58
64
(n=43)
55*
65.1*
(n=22)
59.8
59.1
(n=18)
63.2
61.1
(n=10)
63.7
10
(n=80)
66.1*
30*
ANOVA
0.00000491
0.000467
Body Mass Index, BMI
28.9
28
28.4
26.6
29.5
27.5
31.3
31.9
28.9
29.8
0.0446
FEV1 % predicted
61.5
48*
46.3*
67.8
60.1
64.3
66.2
78.2*
67.5
73.4*
1.78E-15
FVC % predicted
69.5
72.4
69.8
76.7
73.7
69.7
64.8
71.4
62.2
65.8
0.0355
FEV1/FVC ratio
DLCO
Total lung capacity, mean
CT % emphysema
Lung reticular volume, ml
Diagnosis, % IPF
0.89
54.7
5.33
7.18
283
34.8
0.653*
59.2
6.55*
14.4*
63.8*
1.32*
0.656*
52.3
6.94*
17.4*
72.6*
2.94*
0.858
53.9
6.31
16.2
158
36.4
0.8
66
5.37
0.745
181
8*
0.93
57.3
4.87
1.88*
198
9.3*
1.03
47.8
3.96*
0.105*
294
40.9
1.1*
62.6
4.1*
0.899
397
72.2*
1.09
51.3
4.73
0.728
536
50
1.12
47*
4.19*
1.01*
662*
90*
1.63E-37
0.000816
4.23E-26
8.09E-19
9.22E-21
1.96E-37
Diagnosis, % Emphysema
18.5
43.4*
29.4
27.3
24
14
4.55
0
0
0*
1.19E-10
S15 Table: The demographic summary of clinical features in each sub-cluster
Features
hsa-miR-381
hsa-miR-23b
hsa-miR-338-3p
hsa-miR-17
hsa-miR-376c
hsa-miR-139-5p
hsa-miR-34c-5p
hsa-miR-20a
hsa-miR-92a
hsa-miR-101
hsa-miR-487b
hsa-miR-495
Modules
3
2
1
1
2
2
2
1
1
1
3
3
Genes in module
553
216
401
401
216
216
216
401
401
401
553
553
Genes in DB
844
884
557
877
614
576
586
893
539
643
101
1152
Genes in Both
165
32
43
75
22
21
48
81
45
56
8
198
Total genes
3596
3596
3596
3596
3596
3596
3596
3596
3596
3596
3596
3596
P-values
0.0002
0.0004
0.0043
0.0045
0.0049
0.0073
0.0175
0.0232
0.0258
0.0319
0.0351
0.0422
Supplementary Table 4. Summary of Fisher exact test results of each nine significant miRNA and
S16 Table: Target gene enrichment analysis (via Fisher exact test) related to twelve
significant
the
number ofmiRNA
genes features
targeted in DB and selected in modules.
Module
miRNA
Gene
Regression
Coefficient
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
hsa-miR-139-5p
hsa-miR-139-5p
hsa-miR-139-5p
hsa-miR-17
hsa-miR-17
hsa-miR-17
hsa-miR-17
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-338-3p
hsa-miR-376c
hsa-miR-376c
hsa-miR-381
hsa-miR-381
hsa-miR-381
hsa-miR-487b
hsa-miR-487b
hsa-miR-487b
hsa-miR-495
hsa-miR-495
hsa-miR-495
hsa-miR-495
hsa-miR-495
hsa-miR-495
hsa-miR-495
hsa-miR-495
RPS6KA1
DPEP2
ITGAL
SACS
COL14A1
COL15A1
TNFRSF21
COL15A1
SMO
RPL39L
DDIT4L
MANF
LRRC17
APCDD1
TRAF5
NLGN2
LRIG3
PDK1
GREM2
ZNF436
KIAA0114
TMEM79
BOC
LPPR4
SLC2A10
SDC1
BCL11A
CTSE
RCAN2
AHNAK2
WARS
ITGAL
ADRB2
VIPR1
RNF182
RNF182
DOCK4
TNS1
DENND3
ADRB2
RTKN2
KIAA0408
FUT1
CTNNBIP1
DOCK4
CTNND2
-0.64
-0.57
-0.57
-0.58
-0.57
-0.60
-0.57
-0.70
-0.64
-0.65
-0.64
-0.63
-0.62
-0.63
-0.59
-0.58
-0.58
-0.61
-0.56
-0.57
-0.55
-0.55
-0.55
-0.56
-0.53
-0.53
-0.52
-0.54
-0.54
-0.50
-0.60
-0.60
-0.69
-0.62
-0.62
-0.71
-0.70
-0.58
-0.74
-0.70
-0.70
-0.65
-0.66
-0.64
-0.64
-0.65
P-value for
testing
Coefficient of
association of determination
miRNA with
target mRNA
6.26E-23
3.08E-19
3.10E-18
1.67E-19
1.79E-18
1.53E-17
3.00E-17
3.03E-33
2.18E-32
1.01E-29
2.72E-28
3.13E-28
1.91E-27
9.36E-27
2.75E-26
3.87E-26
3.67E-25
1.48E-24
1.31E-22
2.28E-22
6.18E-22
3.77E-21
2.41E-20
6.50E-20
7.31E-19
1.04E-18
1.74E-18
8.08E-18
1.00E-17
1.22E-17
3.25E-19
5.09E-18
1.69E-22
2.22E-19
2.18E-18
7.95E-28
1.75E-27
1.42E-17
1.81E-28
6.93E-25
1.88E-24
2.16E-23
1.08E-22
1.43E-22
3.20E-22
4.25E-22
0.39
0.34
0.32
0.34
0.32
0.31
0.30
0.52
0.51
0.48
0.46
0.46
0.45
0.44
0.44
0.43
0.42
0.41
0.39
0.38
0.38
0.36
0.35
0.35
0.33
0.33
0.32
0.31
0.31
0.31
0.34
0.32
0.38
0.34
0.32
0.46
0.45
0.31
0.46
0.42
0.41
0.40
0.39
0.39
0.38
0.38
S17 Table: Regression analysis on target miRNA features, and coefficient of determination
S18 Table: The top disease or functional annotations associated with genes in module two in
Cluster E patients
*P value is calculated by the Fisher’s Exact Test. A P value threshold of ≀ 7.29E-10 was
selected for statistical significance.
†The z-score predicts the direction of change for the function. An absolute z-score of β‰₯ 2 is
considered significant. A function is: Increased if the z-score is β‰₯ 2 and decreased if the z-score
is ≀ -2.
‑Predicted activation state is the predicted direction of change for the function based on the
regulation z-score. Increased indicates that the z-score is β‰₯ 2 and decreased indicates that the
z-score is ≀ -2.
A.
109
S13 T
93
14
able
117
12
14
66
Cluster 1
58
Cluster 2
83
Cluster 3
B.
S19 Figure: Basic consensus clustering using only gene expression data. We performed basic
consensus unsupervised clustering (i.e., used R-package β€œConsensusClusterPlus”). Here we
identify three consensus clusters notably correlated. However this consensus clustering does not
clearly characterize a cluster as an β€œintermediate” group, just as we identified the sub-cluster E
(Figure 4). Therefore the basic consensus clustering hardly identifies a novel sub-phenotype
beyond the traditional disease definition.
References
1. Smyth GK, Speed T (2003) Normalization of cDNA microarray data. Methods 31: 265-273.
2. Qiu WL, Joe H (2006) Generation of random clusters with specified degree of separation.
Journal of Classification 23: 315-334.
3. Hartigan JAaW, M. A. (1979) A K-means clustering algorithm. Applied Statistics 28, 100–108.
4. Witten D, Tibshirani R (2010) A Framework for Feature Selection in Clustering (vol 105, pg
713, 2010). Journal of the American Statistical Association 105: 1637-1637.
5. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density
estimation. Journal of the American Statistical Association 97: 611-631.
6. Hubert L and Arabie P (1985). Comparing partitions. Journal of Classification 2 (1): 193–218.
7. Ronglai Shen SW, Qianxing Mo (2013) Sparse integrative clustering of multiple omics data
sets. Annals of Applied Statistics 7: 269-294.
8. Wood SN (2003) Thin plate regression splines. J Roy Stat Soc B 65: 95-114.