The MNI algorithm uses a two step method to

Supplementary Information
A Network Biology Approach to Prostate Cancer
Ayla Ergün, Carolyn A. Lawrence, Michael A. Kohanski, Timothy A. Brennan &
James J. Collins
1
The MNI Algorithm
The MNI algorithm operates in two phases to determine significant genetic mediators of a
condition of interest, e.g., a disease. In phase one (network identification phase), a network is
derived from an N  M training set of microarray expression data, consisting of
measurements of steady-state expression ratios of N transcripts in M experiments. In phase
two (mediator determination phase), the trained regulatory network is used as a filter to
determine the genes affected by a test condition. For this study, the training set consisted of
1144 expression profiles from solid human tumor samples and cancerous cell lines. As test
profiles, we used 14 non-recurrent primary, 9 recurrent primary and 9 distant metastatic
prostate cancer samples from La Tulippe et al (2002). In Sections 1.1-1.4, we present a
summary of the MNI algorithm, which is described in greater detail by di Bernardo et al
(2005).
1.1
Model Structure
To predict the genetic mediators of a disease, the MNI algorithm first infers a model of
regulatory influences in a cell. The model relates changes in gene transcript concentrations to
each other, and it is not constrained by any apriori knowledge. The transcript synthesis rate
is modeled as a function of the influence of all other transcript concentrations and external
influences on the transcript i as follows:
yi  ui  y j ij  di yi .
n
(1)
j
1
where yi represents the concentration of transcript i , nij represents the influence of transcript j
on transcript i , di represents the degradation rate of transcript i , and ui represents the net
external influences on the transcription rate of transcript i .
Since concentrations and not transcription rates of mRNA are known, a simplifying
assumption is made that transcript concentrations are measured under steady-state conditions.
With this assumption, the model becomes:
yi  0  ui  y j ij  di yi .
n
(2)
j
We compute the measurements relative to a baseline, and with the assumption that the
degradation rate is the same for the baseline-level transcripts, we make the following
transformation:
yi
u
 i
yib uib
nij
 yj 
j  y  .
 jb 
(3)
By taking the logarithm of both sides of Equation 3 and by substituting variables and
parameters, the model is reduced to a log-linear system of equations:
a x
ij
j
  pi ,
where
(4)
j
nij ,
aij  
nij  1,
ji
j i
 yj 
x j  log 2 
, and
 y 
 jb 
u 
pi  log 2  i  .
 uib 
2
The model coefficients aij , which constitute the connectivity matrix A , represent the
influence of the concentrations of transcript j on the synthesis rate of transcript i . The
variables x j are the log-transformed expression-change ratios of each transcript j , and
constitute the columns of the training matrix, X . The variables pi are the net external
influences on transcript i and constitute the rows of the matrix P . This external perturbations
matrix P is used to determine the transcripts that are most inconsistent with the network,
and therefore most likely genetic mediators of a disease.
1.2
Phase I: Network Identification
The first task of the MNI algorithm is to learn the network model coefficients aij . For the
algorithm, the training matrix X is known while the connectivity matrix A and the external
perturbations matrix P are unknown. To estimate the network model A , with no prior
knowledge on P , the MNI algorithm uses a recursive strategy. The algorithm starts by using
a naïve model of the regulatory structure. Initially, A is specified to account only for selfdegradation: aij  1 for i  j and aij  0 otherwise. This begins the first iteration, where P
is estimated directly from A and X . An external influence is considered significant if it
satisfies:
pˆ il    max( pˆ il ).
1l  M
(5)
where  is the significance threshold. We chose   0.25 for our study, i.e., an estimate of
the external influence on transcript i is considered significant if it is greater than 25% of the
maximum absolute value of pil in all experiments l  1,..., M . To finish the iteration,
experiments with insignificant perturbations are removed from the training set, and a new
connectivity matrix is determined, using linear regression to solve the following equation for
aij:
a x
ij
j
  pi .
(6)
j
3
This begins the second iteration, where pi is re-estimated using the newly calculated
aij coefficients. This iterative estimation is repeated five times, or until convergence of the
variables.
1.3
Phase II: Determination of Disease Mediators
Once a network has been estimated, transcripts that are significantly perturbed in the test
profile can be identified. A test expression profile, x jc , where the index jc represents the
expression of transcript j in response to a test condition c for j  1,..., N , is tested against
the network to determine its external influences using the following equation:
 aˆ
ic
x jc   pˆ ic
(7)
j
where  pˆ it represents the estimated external influences on each transcript i . The
significance of each perturbation is determined by calculating two z-scores based on the
perturbation size and estimation error:
z ic 
pˆ ic
(normal)
 ic
(8.1)
z imc  zic  xic (modified)
(8.2)
The modified z-score is designed to boost the likelihood of including genes with significant
changes in the test expression profile. In Equation 8.1,  ic represents the standard deviation
of the perturbation value, which is calculated by applying propagation of error to Equation 6,
yielding:
 i2c    ijk x jc xkc   aˆij2 var( x jc ) ,
j
k
(9)
j
4
where  ijk are the elements of the covariance matrix calculated from aˆij .
Transcripts are ranked according to the normal and modified z-scores, where those with the
highest absolute z-score are determined to be those most significantly contributing to disease
mediators in each case. The final list is chosen to be that with the highest mean z-score,
between the normal or modified z-score lists.
1.4
Singular Value Decomposition
For the training data set, we used a training matrix of size 12600 1144 . Therefore,
calculating aij in Equation 7 is an under-determined problem. In order to find a unique
solution, we used a dimensional reduction strategy based on singular value decomposition
(SVD) to reduce the size of the training set.
Using SVD, the training matrix, X , is
decomposed as:
X  USV T
(10)
where U is an N  M matrix, S is a diagonal matrix of dimension M  M containing the
singular values of X , and V is an M  M matrix containing the principal components of the
transcript expression profiles in columns. Q principal components based on the largest
singular values are chosen ( Q  M ). The Q profiles serve as the characteristic expression
profiles for the N transcripts and describe most of the expression variation in the N
transcripts. Using the approach specified by Everitt and Dunn (2001), each singular value
was considered significant if its relative variance was greater than or equal to the threshold of
0.7/ n , where n =1144 is the number of experiments. The relative variance of a singular value
is calculated as the square of that singular value divided by the sum of the squares of all of
the singular values. Based on this approach, Q was set to 62 for this study. Thus, for network
determination in this study, X was redefined as a 62 1144 matrix of characteristic
(“metagene”) expression profiles. Using the characteristic profiles, X is approximated as
follows:
5
X  UQ SQV T
(11)
where U Q is an N  Q matrix and contains only the first Q columns of U , and S Q is a
Q  M diagonal matrix of the largest Q singular values. These matrices are used to
transform X into the metagene space and back into the N -dimensional transcript space to
perform the network identification and genetic mediator determination.
To improve the specificity of the algorithm, a “tournament” approach was adapted, which
requires that the MNI algorithm is run iteratively on increasingly smaller subsets of the
transcripts. For this study, three tournaments were conducted, and after each tournament, we
kept the top third of the transcripts, i.e., those with the highest z-scores. A final tournament
was conducted by applying MNI to the 200 transcripts with the highest z-scores. This
procedure was repeated for 14 non-recurrent primary, 9 recurrent primary and 9 distant
metastatic prostate cancer samples from La Tulippe et al (2002). We set to zero the z-score
for transcripts that were not within the list of the top 200 transcripts identified as mediators
for a given sample. To identify a characteristic list of genes within each group (i.e., nonrecurrent primary, recurrent primary and metastatic prostate cancer), the normal or modified
z-scores (chosen by the algorithm), were averaged and ranked across samples and across
transcripts for the corresponding genes.
2
Supplementary Results
Several of the genes identified in the metastatic prostate cancer group, in addition to these
associated with the AR signaling pathway, have been identified as key genetic mediators of
prostate cancer (Supplementary Table 1). For example, AMACR, α-methylacyl-CoA
racemase, is an enzyme that is consistently upregulated in metastatic prostate cancer (Luo et
al, 2002; Rubin et al, 2002; Rubin et al, 2005). CDH17 is a member of the cadherin
superfamily, which mediate cell-cell adhesion and have been implicated in prostate cancer
metastasis and invasion (Tomita, 2004). VDR, the vitamin D receptor, is a member of the
6
steroid receptor family, and there are data indicating that the hormonal form of vitamin D
promotes the differentiation and inhibits the proliferation, invasiveness, and metastasis of
human prostatic cancer cells (Feldman et al, 2000). The biological actions of vitamin D are
primarily mediated by the vitamin D receptor, and gene polymorphisms in VDR have been
linked to a heightened risk of advanced prostate cancer (Stewart and Weigel, 2004; John et
al, 2005).
Several significant prostate cancer genes – 15-LOX-2, PLA2G2A, NPY, CRISP3, GDF15 –
were identified by MNI as potential genetic mediators for the non-recurrent primary prostate
cancer group alone (Supplementary Table 1). Expression of 15-LOX-2 (also known as
ALOX15B), arachidonate 15-lipoxygenase type 2, is higher in healthy prostates than in
prostate tumors, and appears to suppress cancer development (Bhatia et al, 2003). The
expression of PLA2G2A, phospholipase A2, is elevated in neoplastic prostatic tissue and
dysregulation of PLA2G2A may play a role in prostatic carcinogenesis (Jiang et al, 2002).
NPY, neuropeptide Y precursor, regulates the growth of prostate cancer cells (Magni et al,
2001). CRISP3, cysteine-rich secretory glycoprotein 3, a member of the cysteine-rich
secretory protein family, has been identified as a potential prostate biomarker (Kosari et al,
2002), and GDF15, growth differentiation factor 15 precursor, is associated with early
prostate carcinogenesis (Cheung et al, 2004).
7
References
Bhatia B, Maldonado CJ, Tang S, Chandra D, Klein RD, Chopra D, Shappell SB, Yang P,
Newman RA & Tang DG (2003) Subcellular localization and tumor-suppressive functions of
15-lipoxygenase 2 (15-LOX2) and its splice variants. J. Biol. Chem. 278: 25091-25100
Cheung PK, Woolcock B, Adomat H, Sutcliffe M, Bainbridge TC, Jones EC, Webber D,
Kinahan T, Sadar M, Gleave ME & Vielkind J (2004) Protein profiling of microdissected
prostate tissue links growth differentiation factor 15 to prostate carcinogenesis. Cancer Res.
64: 5929-5933
di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, Wojtovich AP, Elliott
SJ, Schaus SE & Collins JJ (2005) Chemogenomic profiling on a genome-wide scale using
reverse-engineered gene networks. Nat. Biotechnol. 23: 377-383
Everitt B S & Dunn G (2001) Applied Multivariate Data Analysis. Arnold, London
Feldman D, Zhao XY & Krishnan AV (2000)Vitamin D and prostate cancer. Endocrinology
141: 5-9
Jiang J, Neubauer BL, Graff JR, Chedid M, Thomas JE, Roehm NW, Zhang S, Eckert GJ,
Koch MO, Eble JN & Cheng L (2002) Expression of group IIA secretory phospholipase A2
is elevated in prostatic intraepithelial neoplasia and adenocarcinoma. Am. J. Pathol. 160:
667-671
John EM, Schwartz GG, Koo J, Van Den Berg D & Ingles SA (2005) Sun exposure, vitamin
D receptor gene polymorphisms, and risk of advanced prostate cancer. Cancer Res. 65: 54705479
8
Kosari F, Asmann, YW, Cheville, JC & Vasmatzis G (2002) Cysteine-rich secretory protein3: a potential biomarker for prostate cancer. Cancer Epidemiol. Biomarkers Prev. 11: 14191426
LaTulippe E, Satagopan J, Smith A, Scher H, Scardino P, Reuter V & Gerald WL (2002)
Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional
programs associated with metastatic disease. Cancer Res. 62: 4499-4506
Luo J, Zha S, Gage WR, Dunn TA, Hicks JL, Bennett CJ, Ewing CM, Platz EA,
Ferdinandusse S, Wanders RJ, Trent JM, Isaacs WB & De Marzo AM (2002) Alphamethylacyl-coA racemase: a new molecular marker for prostate cancer. Cancer Res. 62:
2220-2226
Magni P & Motta M (2001) Expression of neuropeptide Y receptors in human prostate
cancer cells. Ann. Oncol. 12: 27-29
Rubin MA, Bismar TA, Andren O, Mucci L, Kim R, Shen R, Ghosh D, Wei JT, Chinnaiyan
AM, Adami HO, Kantoff PW & Johansson JE (2005) Decreased
-methylacyl CoA
racemase expression in localized prostate cancer is associated with an increased rate of
biochemical recurrence and cancer-specific death. Cancer Epidemiol. Biomarkers Prev. 14:
1424-1432
Rubin MA, Zhou M, Dhanasekaran SM, Varambally S, Barrette TR, Sanda MG, Pienta KJ,
Ghosh D & Chinnaiyan AM (2002) Alpha-methylacyl coenzyme A racemase as a tissue
biomarker for prostate cancer. JAMA. 287: 1662-1670
Stewart LV & Weigel NL (2004) Vitamin D and prostate cancer. Exp. Biol. Med. 229: 277284
9
Tomita K (2000) Cadherin switching in human prostate cancer progression. Cancer Res. 60:
3650-3654
10