A HYPERGRAPH-BASED LEARNING ALGORITHM FOR

A HYPERGRAPH-BASED LEARNING ALGORITHM FOR CLASSIFYING ARRAYCGH
DATA WITH SPATIAL PRIOR
Ze Tian, TaeHyun Hwang and Rui Kuang
Department of Computer Science and Engineering, University of Minnesota, Twin Cities,
200 Union Street SE, Minneapolis, MN 55455, USA,
Correspondence: [email protected]
ABSTRACT
Array-based comparative genomic hybridization (arrayCGH) has been used to detect DNA copy number variations at genome scale for molecular diagnosis and prognosis of cancer. A special property of arrayCGH data is that,
among the spot-intensity variables in the arrayCGH data,
there are spatial relations introduced by the layout of the
probes along the chromosomes. Standard classification algorithms are not capable of capturing the spatial relations
for accurate cancer classification or biomarker identification
from the arrayCGH data. We introduce a hypergraph based
learning algorithm to classify arrayCGH data with spatial
priors modeled as correlations among variables for cancer
classification and biomarker identification. In the experiments, we show that, by incorporating the spatial relations
among the spots as prior, our algorithm is more accurate
than other baseline algorithms on a bladder cancer arrayCGH data. Furthermore, some discriminative regions identified by our algorithm contain genomic elements that are
cancer-relavent.
1. INTRODUCTION
DNA copy-number variations (CNVs) are the events of amplification or deletion of a large segment of DNA on chromosomes. CNVs play an important role in tumorigenesis. Previous studies have shown a close relationship between the CNV patterns and cancer cell lineages [1, 2]. It is
believed that characterization of these DNA copy-number
changes is beneficial for both the basic understanding of
cancer and its diagnosis [3]. With the advent of arrayCGH
technology, high-resolution arrayCGH datasets were generated in many studies of cancers such as brain, prostate,
colon, pancreatic and lung cancers [4, 5, 6, 7]. The arrayCGH data was used to discriminate healthy patients from
cancer patients and classify patients of different cancer subtypes. Thus, arrayCGH data is considered as a new source
of biomarkers that provide important information of candidate cancer loci for the classification of patients and discovery of molecular mechanisms of cancers.
Previous studies [8] and [9] have used DNA copy numbers for cancer discriminant analysis. These methods ignore the spatial correlation among sampling spots in arrayCGH data, and thus suffer from two drawbacks: the accuracy is low and the classifier lacks interpretability. Recently
[10] proposed a SVM variant to take the spatial information into consideration for classification of CGH/arrayCGH
data. [10] proposed a supervised classification method which
is a variant of L1 -SVM. By incorporating the biological
specificities of DNA copy number variations along the genome
as prior knowledge, they introduced new constraints to L1 SVM, which leads to a new algorithm called fused SVM.
They demonstrated that the new algorithm improved both
the classification accuracy and the identification of discriminative regions of the genome on two cancer datasets.
In this paper, we introduce a hypergraph-based iterative learning algorithm called HyperPrior to integrate CNVs
with more general spatial prior information for robust cancer outcome prediction and biomarker identification. HyperPrior using spatial relations as a priori information is a
sibling of the method proposed by us in [11] for integrating gene expressions and protein-protein interactions. HyperPrior algorithm formulates an optimization problem as
learning labels and hyperedge weights together with the assignment of edge weights constrained by the neighbor relation between spot regions on chromosomes. The neighbor
relation is based on the fact that adjacent sampling spots
are more likely to be involved in the same DNA insertion
or deletion event. These properties of HyperPrior algorithm
promise to improve prediction accuracy and provide more
robust identification of discriminative spots of CNVs. Furthermore, the resulted weights on the spots can be used to
discover highly weighted DNA regions on chromosomes,
which might also suggest important gene amplification and
deletion events related to the genetic mechanism of cancer.
2. HYPERGRAPH AND SPATIAL PRIOR
A hypergraph is a special graph which contains hyperedges.
In a simple graph, each edge connects a pair of vertices,
Spot 1
Sample spot 1 spot 2 spot 3
S1 (+1) gain basal loss
S2 (+1) gain
gain basal
S3 (-1)
gain
gain basal
S4 (-1)
loss
loss
gain
S5 (0)
basal loss
loss
S6 (0)
loss basal loss
CNV profiles
Spot 2
…
…
…
…
…
…
…
−
pot 2
−
S3
Spot 3
Spot 1
Spot 2
Spot 3
Spot 1
Spot 2
Spot 3
S4
+
3
S2
Spot 2
Spot 1
+
S1
Spot 3
S5
Spatial prior
S6
Hypergraph built from CNV
profiles
Fig. 1. Learning model of HyperPrior
but in a hypergraph each edge can connect arbitrary number of vertices in the graph. Let V = {v1 , v2 , . . . , v|V | }
be a set of vertices and E = {e1 , e2 , . . . , e|E| } be a set
of edges defined on V : for any hyperedge e ∈ E, e =
(e) (e)
(e)
(e) (e)
(e)
{v1 , v2 , . . . , v|e| }, where {v1 , v2 , . . . , v|e| } is a subset of V . A hyperedge e and a vertex v are called incident
if v ∈ e. A non-negative real number (a weight) can be
assigned to each hyperedge by a function w (w can also
be defined as a vector variable and we will use both notations interchangeably). The vertex set V , hyperedge set
E and the weight function w fully defines a weighted hypergraph denoted by G(V, E, w). The incidence matrix H
for hypergraph G(V, E, w) is a |V | × |E| matrix with elements defined as h(v, e) = 1 (or a real positive value if H
is weighted) when v ∈ e and 0 P
otherwise. The degree of
a vertex v is defined as d(v) = e∈E h(v, e)w(e), which
is the (weighted) sum of the weights of the hyperedges incident with
P v. The degree of a hyperedge e is defined as
d(e) = v∈E h(v, e), which is the number of vertices incident with e. Fig 1 shows how a hypergraph is built from
an arrayCGH dataset. In arrayCGH data, each spot is assigned a log-ratio denoting how the corresponding DNA
copy number varies compared to the normal level. The
sign of the value represent either a ‘gain’(amplification) or
a ‘loss’ (deletion) of the DNA segment in the corresponding regions on the chromosomes. To represent both DNA
amplification event and deletion event at each spot, the logratio values of each spot are split into an amplification group
and a deletion group at the spot. Hypergraph is a natural model to represent this relation, i.e. we use two kinds
of hyperedges to differentiate the amplification and deletion
states of CNVs. For example, in Fig 1 we use S1 to S6 to
represent individual samples in the arracyCGH data. From
the arrayCGH profile on the left of the figure, sample S1,
S2 and S3 have ‘gain’ state in spot1 and sample S4 and S6
have ‘loss’ state in spot1. Thus, sample S1, S2 and S3 are
covered by an amplification hyperedge denoted by a solidedge ellipse and sample S4 and S6 are covered by a deletion
hyperedge denoted by a dash-edge ellipse in the figure. The
‘basal’ state is regarded as no event at the spot.
Let G(V, E, w) be a weighted hypergraph to model arrayCGH data: each patient sample is denoted by a vertex
v ∈ V and each hyperedge denotes one of the two CNV
states (gain/loss) of a spot. The incidence matrix H between V and E are defined by the CNV log-ratio values on
the samples. Suppose that the log-ratio value of a spot i of
sample u is L(u, i). If L(u, i) > 0 and the state is ‘gain’,
Hu,ig = L(u, i) and Hu,il = 0 where ig and il are the
indexes of the gain and loss states of spot i respectively; if
L(u, i) < 0 and the state is ‘loss’ in this spot, Hu,ig = 0 and
Hu,il = −L(u, i). Note, in this case H is a weighted incidence matrix. Given a hypergraph G(V, E, w), we define
a function y to assign initial labels to V in the hypergraph.
If a vertex v is in the positive patient group, y(v) = +1;
if it is in the negative patient group, y(v) = −1; if v is a
test sample, y(v) = 0. For normalization, in experiments
we set y(v) = 1/n1 for positive vertex and y(v) = −1/n2
for negative vertex, where n1 is the number of positive samples in training set and n2 is the number of negative samples in training set. To build a predictive model and identify
discriminative CNVs, the HyperPrior algorithm will learn a
weighting of all the hyperedges in the hypergraph built from
arrayCGH data. To avoid overfitting, we introduce a prior
on the weights to be learned by the HyperPrior algorithm.
In Fig 1, the amplification (deletion) states of adjacent spots
are connected in a prior graph as constraints on the weights
in the learning regularization framework of HyperPrior. For
example, amplification state of spot 1 is connected to that of
spot 2, and the one of spot 2 is connected to that of spot 3,
etc.
3. HYPERGRAPH-BASED LEARNING
For cancer outcome prediction, our goal is to find the correct labels for the unlabeled vertices of the test samples in
the hypergraph. Let f be the objective function (vector) of
labels to be learned. Intuitively, there are two criteria for
learning optimal f : 1) we want to assign the same label
to vertices that share many incidental hyperedges in common; 2) assignment of the labels should be similar to the
initial labeling y. For criteria 1), we define the following
cost function,
Ω(f, w) =
f (v) 2
1 X w(e) X f (u)
(p
−p
)
2
d(e) u,v∈e
d(u)
d(v)
e∈E
If the predicted labels on the vertices are consistent with the
incidences with the hyperedges, the value of Ω(f ) should
be minimized. For criteria 2), we directly measure the 2norm distance between the vectors of the predicted and the
original labels as follows,
X
||f − y||2 =
(f (u) − y(u))2
u∈V
To introduce the spatial relations of spots as prior knowledge into the hypergraph-based learning, we assume that
adjacent regions on the chromosomes should receive similar
weights on their associated hyperedges with the same states.
We define a binary indicator δi,j to capture the adjacency relations between a pair of hyperedge ei and ej . The indicator
δi,j = 1 if the two spots associated with ei and ej are adjacent on the chromosome and both ei and ej are ‘gain’ or
‘loss’ state hyperedges; otherwise 0. To assign weights to
hyperedges consistent with such prior knowledge, we define
the following cost function over the hyperedge weights,
Ψ(w) =
|E|
w(ej ) 2
w(ei )
1 X
−p
δi,j ( p
) ,
2 i,j=1
σ(ej )
σ(ei )
P|E|
where σ(ei ) = j=1 δi,j , which is the number of hyperedges adjacent to the hyperedge ei . Minimizing Ψ(w) ensures that hyperedges with the same state associated with
adjacent regions on chromosomes will get similarly weighted.
After the prior knowledge is introduced, our task is to
minimize the sum of the three cost terms, which is
minimize Ω(f, w) + µ||f − y||2 + ρΨ(w)
f,w
(1)
subject to
w(e)
for ∀e ∈ E
P ≥0
e∈E h(v, e)w(e) = d(v) for ∀v ∈ V.
where µ and
P ρ are positive real numbers. The intuition
of adding e∈E h(v, e)w(e) = d(v) as another constraint
is to maintain the hypergraph structure. This optimization
problem can be solved efficiently with an iterative algorithm
introduced by us in [11].
4. EXPERIMENTS
We tested HyperPrior algorithm on an arrayCGH dataset
published in [12]. This dataset contains arrayCGH profiles
of 57 bladder tumor samples. Following the data preprocess
procedures in [10], we removed the probes corresponding to
sexual chromosomes and then construct two types of tumor
classification problems: by grade (12 tumors of Grade 1 vs.
45 tumors of higher grades) or by stage (16 tumors of Stage
Ta vs. 32 tumors of Stage T2+). As a preprocessing step,
we performed a k-means clustering (k = 3) on the log ratio values of all spots and use the resultant three clusters to
define the ‘gain’, ‘loss’, ‘basal’ states. This preprocessing
step removes some noise in the data and the clustering result is very stable on the one-dimentional data. The standard
SVM with linear and RBF kernels, the original hypergraph
algorithm [13] and fused SVM [10] were used as baseline
algorithms. To assess the classification performance of all
methods, we performed a cross-validation by a leave-oneout (LOO) procedure for the two classification problems.
Fig. 3. Enriched biological functions by the 544 genes at
highly weighted chromosome regions at the significance
level p-value < 0.001. The enriched functions are sorted by
p-values calculated using the right-tailed Fisher Exact Test.
Table 1 shows the misclassification rates by the best parameters of all the methods. In both classification problems,
HyperPrior algorithm achieved the best classification accuracy, which suggests that the additional cost term introduced
from the spatial prior does help sample classification.
The weights assigned by HyperPrior algorithm with optimal parameters for grade classification are visualized in
Fig 2. The weights of hyperedges for ‘DNA amplification’ and ‘DNA deletion’ are plotted separately. It is evident in the plots that only scarce chromosomal regions are
highly weighted by HyperPrior algorithm. We analyzed
the biological functions of the genes located in the highly
weighted chromosome regions by Gene Ontology (GO) annotations and pathway analysis with Ingenuity. We investigated whether the genes involve over-represented GO categories and biological pathways that are related with bladder
cancer. We selected highly weighted chromosome regions
having weights higher than 20 by HyperPrior for amplification states and deletion states separately, and find genes
in these selected chromosome regions. These filtering process yields 169 genes with ‘amplification state’ and 375
genes with ‘deletion state’. With these genes together as input, Ingenuity identifies 12 enriched functions scoring a pvalue less than 1.0e-3. The enriched functions include posttranslation modification, antigen presentation, and cellular
movement shows strong consistency with those identified
by [14, 15, 16]. One of the interesting genes in the highly
weighted deletion chromosome regions is tumor protein p53
inducible protein 3 (TP53IP3) at chromosome 2. TP53IP3
is a transcriptional target of tumor suppressor protein p53
(TP53) and plays a role in TP53-mediated apoptosis. Recent
study showed that there is an association of TP53IP3 promoter polymorphism with higher-grade bladder cancer [17].
Low frequency of alternations at TP53 responsive TP53IP3
gene polymorphism is also known for its association with
lung carcinomas and breast cancer risk [18].
LOO errors
by grade
by stage
SVM (linear)
0.158
0.188
SVM (RBF)
0.158
0.188
L1-SVM
0.211
0.271
fused SVM
0.123
0.146
Hypergraph
0.193
0.188
HyperPrior
0.105
0.104
Table 1. This table shows the misclassification rates in the LOO cross-validation on the bladder cancer dataset with the grade
labeling and the stage labeling. For SVMs with linear kernel and RBF kernel, parameters C = {10−5 , 10−4 , . . . , 104 , 105 }
and σ = {10−5 , 10−4 , . . . , 104 , 105 } were tested. For hypergraph (and HyperPrior) algorithm, parameters α =
{0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99}, and ρ = {10−3 , 10−2 , . . . , 102 , 103 } (for HyperPrior only) were tested.
80
80
1
2
3
4
5
6
7
8
9
10
11
12 13 14 15 16 17 1819 20 21
22
1
40
20
0
2
3
4
5
6
7
8
9
10
11
12 13 14 15 16 17 1819 20 21
22
60
weights
weights
60
40
20
0
500
1000
spot locations
1500
2000
0
0
500
1000
spot locations
1500
2000
weights of DNA amplifications
weights of DNA deletions
Fig. 2. The figure plots separately the weights of the hyperedges of ‘amplification state’ and ‘deletion state’, assigned by
the HyperPrior algorithm with best choice of α and ρ parameters for the grade classification. The spots are ordered by their
locations on chromosomes and the corresponding weights are plotted in blue curves. Red lines represent the chromosome
separations.
5. CONCLUSIONS
In this article, we introduced a hypergraph-based method
for sample classification and biomarker selection in arrayCGH data. In the experiments, our algorithm performs well
in classification and identified interesting gene sets that are
possibly cancer-relevant. We plan to investigate further the
highly weighted chromosome regions and the associated genes
for new discoveries.
6. REFERENCES
[1] K. Jong, E. Marchiori, et al., “Cross-platform array comparative genomic hybridization meta-analysis separates hematopoietic and mesenchymal from epithelial tumors,” Nature Genetics, vol. 26, pp.
1499–1506, 2007.
[2] S. Myllykangas, T. Böhling, and S. Knuutila, “Specificity, selection
and significance of gene amplifications in cancer,” Semin Cancer
Biol., vol. 17, no. 1, pp. 42–55, 2007.
[3] J. R. Pollack, C. M. Perou, et al., “Genome-wide analysis of dna
copy-number changes using cdna microarrays,” Nature Genetics, vol.
23, pp. 41–46, 1999.
[4] J. Chaudhary and M. Schmidt, “The impact of genomic alterations
on the transcriptome: a prostate cancer cell line case study,” Chromosome Research, vol. 14, no. 5, pp. 567–586, 2006.
[5] F. Liu, P. J. Park, et al., “A genome-wide screen reveals functional
gene clusters in the cancer genome and identifies epha2 as a mitogen
in glioblastoma,” Cancer Res., vol. 66, no. 22, pp. 10815–10823,
2006.
[6] H. S. Phillips, S. Kharbanda, et al., “Molecular subclasses of highgrade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis,” Cancer Cell., vol. 9,
no. 3, pp. 157–173, 2006.
[7] J. C. M Pole, C. Courtay-Cahen, et al., “High-resolution analysis
of chromosome rearrangements on 8p in breast, colon and pancreatic cancer reveals a complex pattern of loss, gain and translocation,”
Oncogene, vol. 25, pp. 5693–5706, 2006.
[8] C. Jones, E. Ford, et al., “Molecular cytogenetic identification of
subgroups of grade iii invasive ductal breast carcinomas with different clinical outcomes,” Clin Cancer Res., vol. 10, pp. 5988–5997,
2004.
[9] S. F. Chin, Y. Wang, et al., “Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers,”
Oncogene., vol. 26, no. 13, pp. 1959–1970, 2007.
[10] F. Rapaport, E. Barillot, and J. P. Vert, “Classification of arraycgh
data using fused svm,” Bioinformatics, vol. 24, pp. i375–i382, 2008.
[11] T. Hwang, Z. Tian, J. P. Kocher, and R. Kuang, “Learning on
weighted hypergraphs to integrate protein interactions and gene expressions for cancer outcome prediction,” in IEEE International
Conference on Data Mining, 2008.
[12] N. Stransky, C. Vallot, et al., “Regional copy numberindependent
deregulation of transcription in cancer,” Nature Genetics, vol. 38,
pp. 1386–1396, 2006.
[13] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs:
Clustering, classification, and embedding,” in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds., Cambridge, MA, 2007, pp. 1601–1608, MIT Press.
[14] M. R. Saban, H. L. Hellmich, et al., “Repeated bcg treatment of
mouse bladder selectively stimulates small gtpases and hla antigens
and inhibits single-spanning uroplakins,” BMC Cancer, vol. 7, no. 1,
pp. 204, 2007.
[15] P. A. Konstantinopoulos, M. V. Karamouzis, and A. G. Papavassiliou,
“Post-translational modifications and regulation of the ras superfamily of gtpases as anticancer targets,” Nat Rev Drug Discov, vol. 6, no.
7, pp. 541–555, 2007.
[16] S. C. Smith, B. Nicholson, et al., “Profiling bladder cancer organ site-specific metastasis identifies lamc2 as a novel biomarker of
hematogenous dissemination,” Am J Pathol, vol. 174, no. 2, pp. 371–
379, 2009.
[17] M. Ito, H. Nishiyama, et al., “Association of the pig3 promoter polymorphism with invasive bladder cancer in a japanese population,”
Jpn. J. Clin. Oncol., vol. 36, pp. 116–120, 2006.
[18] V. G. Gorgoulis, T. Liloglou, et al., “Absence of association with
cancer risk and low frequency of alterations at a p53 responsive pig3
gene polymorphism in breast and lung carcinomas,” Mutat. Res., vol.
556, pp. 143–150, 2004.