A HYPERGRAPH-BASED LEARNING ALGORITHM FOR CLASSIFYING ARRAYCGH DATA WITH SPATIAL PRIOR Ze Tian, TaeHyun Hwang and Rui Kuang Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 200 Union Street SE, Minneapolis, MN 55455, USA, Correspondence: [email protected] ABSTRACT Array-based comparative genomic hybridization (arrayCGH) has been used to detect DNA copy number variations at genome scale for molecular diagnosis and prognosis of cancer. A special property of arrayCGH data is that, among the spot-intensity variables in the arrayCGH data, there are spatial relations introduced by the layout of the probes along the chromosomes. Standard classification algorithms are not capable of capturing the spatial relations for accurate cancer classification or biomarker identification from the arrayCGH data. We introduce a hypergraph based learning algorithm to classify arrayCGH data with spatial priors modeled as correlations among variables for cancer classification and biomarker identification. In the experiments, we show that, by incorporating the spatial relations among the spots as prior, our algorithm is more accurate than other baseline algorithms on a bladder cancer arrayCGH data. Furthermore, some discriminative regions identified by our algorithm contain genomic elements that are cancer-relavent. 1. INTRODUCTION DNA copy-number variations (CNVs) are the events of amplification or deletion of a large segment of DNA on chromosomes. CNVs play an important role in tumorigenesis. Previous studies have shown a close relationship between the CNV patterns and cancer cell lineages [1, 2]. It is believed that characterization of these DNA copy-number changes is beneficial for both the basic understanding of cancer and its diagnosis [3]. With the advent of arrayCGH technology, high-resolution arrayCGH datasets were generated in many studies of cancers such as brain, prostate, colon, pancreatic and lung cancers [4, 5, 6, 7]. The arrayCGH data was used to discriminate healthy patients from cancer patients and classify patients of different cancer subtypes. Thus, arrayCGH data is considered as a new source of biomarkers that provide important information of candidate cancer loci for the classification of patients and discovery of molecular mechanisms of cancers. Previous studies [8] and [9] have used DNA copy numbers for cancer discriminant analysis. These methods ignore the spatial correlation among sampling spots in arrayCGH data, and thus suffer from two drawbacks: the accuracy is low and the classifier lacks interpretability. Recently [10] proposed a SVM variant to take the spatial information into consideration for classification of CGH/arrayCGH data. [10] proposed a supervised classification method which is a variant of L1 -SVM. By incorporating the biological specificities of DNA copy number variations along the genome as prior knowledge, they introduced new constraints to L1 SVM, which leads to a new algorithm called fused SVM. They demonstrated that the new algorithm improved both the classification accuracy and the identification of discriminative regions of the genome on two cancer datasets. In this paper, we introduce a hypergraph-based iterative learning algorithm called HyperPrior to integrate CNVs with more general spatial prior information for robust cancer outcome prediction and biomarker identification. HyperPrior using spatial relations as a priori information is a sibling of the method proposed by us in [11] for integrating gene expressions and protein-protein interactions. HyperPrior algorithm formulates an optimization problem as learning labels and hyperedge weights together with the assignment of edge weights constrained by the neighbor relation between spot regions on chromosomes. The neighbor relation is based on the fact that adjacent sampling spots are more likely to be involved in the same DNA insertion or deletion event. These properties of HyperPrior algorithm promise to improve prediction accuracy and provide more robust identification of discriminative spots of CNVs. Furthermore, the resulted weights on the spots can be used to discover highly weighted DNA regions on chromosomes, which might also suggest important gene amplification and deletion events related to the genetic mechanism of cancer. 2. HYPERGRAPH AND SPATIAL PRIOR A hypergraph is a special graph which contains hyperedges. In a simple graph, each edge connects a pair of vertices, Spot 1 Sample spot 1 spot 2 spot 3 S1 (+1) gain basal loss S2 (+1) gain gain basal S3 (-1) gain gain basal S4 (-1) loss loss gain S5 (0) basal loss loss S6 (0) loss basal loss CNV profiles Spot 2 … … … … … … … − pot 2 − S3 Spot 3 Spot 1 Spot 2 Spot 3 Spot 1 Spot 2 Spot 3 S4 + 3 S2 Spot 2 Spot 1 + S1 Spot 3 S5 Spatial prior S6 Hypergraph built from CNV profiles Fig. 1. Learning model of HyperPrior but in a hypergraph each edge can connect arbitrary number of vertices in the graph. Let V = {v1 , v2 , . . . , v|V | } be a set of vertices and E = {e1 , e2 , . . . , e|E| } be a set of edges defined on V : for any hyperedge e ∈ E, e = (e) (e) (e) (e) (e) (e) {v1 , v2 , . . . , v|e| }, where {v1 , v2 , . . . , v|e| } is a subset of V . A hyperedge e and a vertex v are called incident if v ∈ e. A non-negative real number (a weight) can be assigned to each hyperedge by a function w (w can also be defined as a vector variable and we will use both notations interchangeably). The vertex set V , hyperedge set E and the weight function w fully defines a weighted hypergraph denoted by G(V, E, w). The incidence matrix H for hypergraph G(V, E, w) is a |V | × |E| matrix with elements defined as h(v, e) = 1 (or a real positive value if H is weighted) when v ∈ e and 0 P otherwise. The degree of a vertex v is defined as d(v) = e∈E h(v, e)w(e), which is the (weighted) sum of the weights of the hyperedges incident with P v. The degree of a hyperedge e is defined as d(e) = v∈E h(v, e), which is the number of vertices incident with e. Fig 1 shows how a hypergraph is built from an arrayCGH dataset. In arrayCGH data, each spot is assigned a log-ratio denoting how the corresponding DNA copy number varies compared to the normal level. The sign of the value represent either a ‘gain’(amplification) or a ‘loss’ (deletion) of the DNA segment in the corresponding regions on the chromosomes. To represent both DNA amplification event and deletion event at each spot, the logratio values of each spot are split into an amplification group and a deletion group at the spot. Hypergraph is a natural model to represent this relation, i.e. we use two kinds of hyperedges to differentiate the amplification and deletion states of CNVs. For example, in Fig 1 we use S1 to S6 to represent individual samples in the arracyCGH data. From the arrayCGH profile on the left of the figure, sample S1, S2 and S3 have ‘gain’ state in spot1 and sample S4 and S6 have ‘loss’ state in spot1. Thus, sample S1, S2 and S3 are covered by an amplification hyperedge denoted by a solidedge ellipse and sample S4 and S6 are covered by a deletion hyperedge denoted by a dash-edge ellipse in the figure. The ‘basal’ state is regarded as no event at the spot. Let G(V, E, w) be a weighted hypergraph to model arrayCGH data: each patient sample is denoted by a vertex v ∈ V and each hyperedge denotes one of the two CNV states (gain/loss) of a spot. The incidence matrix H between V and E are defined by the CNV log-ratio values on the samples. Suppose that the log-ratio value of a spot i of sample u is L(u, i). If L(u, i) > 0 and the state is ‘gain’, Hu,ig = L(u, i) and Hu,il = 0 where ig and il are the indexes of the gain and loss states of spot i respectively; if L(u, i) < 0 and the state is ‘loss’ in this spot, Hu,ig = 0 and Hu,il = −L(u, i). Note, in this case H is a weighted incidence matrix. Given a hypergraph G(V, E, w), we define a function y to assign initial labels to V in the hypergraph. If a vertex v is in the positive patient group, y(v) = +1; if it is in the negative patient group, y(v) = −1; if v is a test sample, y(v) = 0. For normalization, in experiments we set y(v) = 1/n1 for positive vertex and y(v) = −1/n2 for negative vertex, where n1 is the number of positive samples in training set and n2 is the number of negative samples in training set. To build a predictive model and identify discriminative CNVs, the HyperPrior algorithm will learn a weighting of all the hyperedges in the hypergraph built from arrayCGH data. To avoid overfitting, we introduce a prior on the weights to be learned by the HyperPrior algorithm. In Fig 1, the amplification (deletion) states of adjacent spots are connected in a prior graph as constraints on the weights in the learning regularization framework of HyperPrior. For example, amplification state of spot 1 is connected to that of spot 2, and the one of spot 2 is connected to that of spot 3, etc. 3. HYPERGRAPH-BASED LEARNING For cancer outcome prediction, our goal is to find the correct labels for the unlabeled vertices of the test samples in the hypergraph. Let f be the objective function (vector) of labels to be learned. Intuitively, there are two criteria for learning optimal f : 1) we want to assign the same label to vertices that share many incidental hyperedges in common; 2) assignment of the labels should be similar to the initial labeling y. For criteria 1), we define the following cost function, Ω(f, w) = f (v) 2 1 X w(e) X f (u) (p −p ) 2 d(e) u,v∈e d(u) d(v) e∈E If the predicted labels on the vertices are consistent with the incidences with the hyperedges, the value of Ω(f ) should be minimized. For criteria 2), we directly measure the 2norm distance between the vectors of the predicted and the original labels as follows, X ||f − y||2 = (f (u) − y(u))2 u∈V To introduce the spatial relations of spots as prior knowledge into the hypergraph-based learning, we assume that adjacent regions on the chromosomes should receive similar weights on their associated hyperedges with the same states. We define a binary indicator δi,j to capture the adjacency relations between a pair of hyperedge ei and ej . The indicator δi,j = 1 if the two spots associated with ei and ej are adjacent on the chromosome and both ei and ej are ‘gain’ or ‘loss’ state hyperedges; otherwise 0. To assign weights to hyperedges consistent with such prior knowledge, we define the following cost function over the hyperedge weights, Ψ(w) = |E| w(ej ) 2 w(ei ) 1 X −p δi,j ( p ) , 2 i,j=1 σ(ej ) σ(ei ) P|E| where σ(ei ) = j=1 δi,j , which is the number of hyperedges adjacent to the hyperedge ei . Minimizing Ψ(w) ensures that hyperedges with the same state associated with adjacent regions on chromosomes will get similarly weighted. After the prior knowledge is introduced, our task is to minimize the sum of the three cost terms, which is minimize Ω(f, w) + µ||f − y||2 + ρΨ(w) f,w (1) subject to w(e) for ∀e ∈ E P ≥0 e∈E h(v, e)w(e) = d(v) for ∀v ∈ V. where µ and P ρ are positive real numbers. The intuition of adding e∈E h(v, e)w(e) = d(v) as another constraint is to maintain the hypergraph structure. This optimization problem can be solved efficiently with an iterative algorithm introduced by us in [11]. 4. EXPERIMENTS We tested HyperPrior algorithm on an arrayCGH dataset published in [12]. This dataset contains arrayCGH profiles of 57 bladder tumor samples. Following the data preprocess procedures in [10], we removed the probes corresponding to sexual chromosomes and then construct two types of tumor classification problems: by grade (12 tumors of Grade 1 vs. 45 tumors of higher grades) or by stage (16 tumors of Stage Ta vs. 32 tumors of Stage T2+). As a preprocessing step, we performed a k-means clustering (k = 3) on the log ratio values of all spots and use the resultant three clusters to define the ‘gain’, ‘loss’, ‘basal’ states. This preprocessing step removes some noise in the data and the clustering result is very stable on the one-dimentional data. The standard SVM with linear and RBF kernels, the original hypergraph algorithm [13] and fused SVM [10] were used as baseline algorithms. To assess the classification performance of all methods, we performed a cross-validation by a leave-oneout (LOO) procedure for the two classification problems. Fig. 3. Enriched biological functions by the 544 genes at highly weighted chromosome regions at the significance level p-value < 0.001. The enriched functions are sorted by p-values calculated using the right-tailed Fisher Exact Test. Table 1 shows the misclassification rates by the best parameters of all the methods. In both classification problems, HyperPrior algorithm achieved the best classification accuracy, which suggests that the additional cost term introduced from the spatial prior does help sample classification. The weights assigned by HyperPrior algorithm with optimal parameters for grade classification are visualized in Fig 2. The weights of hyperedges for ‘DNA amplification’ and ‘DNA deletion’ are plotted separately. It is evident in the plots that only scarce chromosomal regions are highly weighted by HyperPrior algorithm. We analyzed the biological functions of the genes located in the highly weighted chromosome regions by Gene Ontology (GO) annotations and pathway analysis with Ingenuity. We investigated whether the genes involve over-represented GO categories and biological pathways that are related with bladder cancer. We selected highly weighted chromosome regions having weights higher than 20 by HyperPrior for amplification states and deletion states separately, and find genes in these selected chromosome regions. These filtering process yields 169 genes with ‘amplification state’ and 375 genes with ‘deletion state’. With these genes together as input, Ingenuity identifies 12 enriched functions scoring a pvalue less than 1.0e-3. The enriched functions include posttranslation modification, antigen presentation, and cellular movement shows strong consistency with those identified by [14, 15, 16]. One of the interesting genes in the highly weighted deletion chromosome regions is tumor protein p53 inducible protein 3 (TP53IP3) at chromosome 2. TP53IP3 is a transcriptional target of tumor suppressor protein p53 (TP53) and plays a role in TP53-mediated apoptosis. Recent study showed that there is an association of TP53IP3 promoter polymorphism with higher-grade bladder cancer [17]. Low frequency of alternations at TP53 responsive TP53IP3 gene polymorphism is also known for its association with lung carcinomas and breast cancer risk [18]. LOO errors by grade by stage SVM (linear) 0.158 0.188 SVM (RBF) 0.158 0.188 L1-SVM 0.211 0.271 fused SVM 0.123 0.146 Hypergraph 0.193 0.188 HyperPrior 0.105 0.104 Table 1. This table shows the misclassification rates in the LOO cross-validation on the bladder cancer dataset with the grade labeling and the stage labeling. For SVMs with linear kernel and RBF kernel, parameters C = {10−5 , 10−4 , . . . , 104 , 105 } and σ = {10−5 , 10−4 , . . . , 104 , 105 } were tested. For hypergraph (and HyperPrior) algorithm, parameters α = {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99}, and ρ = {10−3 , 10−2 , . . . , 102 , 103 } (for HyperPrior only) were tested. 80 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 20 21 22 1 40 20 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 20 21 22 60 weights weights 60 40 20 0 500 1000 spot locations 1500 2000 0 0 500 1000 spot locations 1500 2000 weights of DNA amplifications weights of DNA deletions Fig. 2. The figure plots separately the weights of the hyperedges of ‘amplification state’ and ‘deletion state’, assigned by the HyperPrior algorithm with best choice of α and ρ parameters for the grade classification. The spots are ordered by their locations on chromosomes and the corresponding weights are plotted in blue curves. Red lines represent the chromosome separations. 5. CONCLUSIONS In this article, we introduced a hypergraph-based method for sample classification and biomarker selection in arrayCGH data. In the experiments, our algorithm performs well in classification and identified interesting gene sets that are possibly cancer-relevant. We plan to investigate further the highly weighted chromosome regions and the associated genes for new discoveries. 6. REFERENCES [1] K. Jong, E. Marchiori, et al., “Cross-platform array comparative genomic hybridization meta-analysis separates hematopoietic and mesenchymal from epithelial tumors,” Nature Genetics, vol. 26, pp. 1499–1506, 2007. [2] S. Myllykangas, T. Böhling, and S. Knuutila, “Specificity, selection and significance of gene amplifications in cancer,” Semin Cancer Biol., vol. 17, no. 1, pp. 42–55, 2007. [3] J. R. Pollack, C. M. Perou, et al., “Genome-wide analysis of dna copy-number changes using cdna microarrays,” Nature Genetics, vol. 23, pp. 41–46, 1999. [4] J. Chaudhary and M. Schmidt, “The impact of genomic alterations on the transcriptome: a prostate cancer cell line case study,” Chromosome Research, vol. 14, no. 5, pp. 567–586, 2006. [5] F. Liu, P. J. Park, et al., “A genome-wide screen reveals functional gene clusters in the cancer genome and identifies epha2 as a mitogen in glioblastoma,” Cancer Res., vol. 66, no. 22, pp. 10815–10823, 2006. [6] H. S. Phillips, S. Kharbanda, et al., “Molecular subclasses of highgrade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis,” Cancer Cell., vol. 9, no. 3, pp. 157–173, 2006. [7] J. C. M Pole, C. Courtay-Cahen, et al., “High-resolution analysis of chromosome rearrangements on 8p in breast, colon and pancreatic cancer reveals a complex pattern of loss, gain and translocation,” Oncogene, vol. 25, pp. 5693–5706, 2006. [8] C. Jones, E. Ford, et al., “Molecular cytogenetic identification of subgroups of grade iii invasive ductal breast carcinomas with different clinical outcomes,” Clin Cancer Res., vol. 10, pp. 5988–5997, 2004. [9] S. F. Chin, Y. Wang, et al., “Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers,” Oncogene., vol. 26, no. 13, pp. 1959–1970, 2007. [10] F. Rapaport, E. Barillot, and J. P. Vert, “Classification of arraycgh data using fused svm,” Bioinformatics, vol. 24, pp. i375–i382, 2008. [11] T. Hwang, Z. Tian, J. P. Kocher, and R. Kuang, “Learning on weighted hypergraphs to integrate protein interactions and gene expressions for cancer outcome prediction,” in IEEE International Conference on Data Mining, 2008. [12] N. Stransky, C. Vallot, et al., “Regional copy numberindependent deregulation of transcription in cancer,” Nature Genetics, vol. 38, pp. 1386–1396, 2006. [13] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs: Clustering, classification, and embedding,” in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds., Cambridge, MA, 2007, pp. 1601–1608, MIT Press. [14] M. R. Saban, H. L. Hellmich, et al., “Repeated bcg treatment of mouse bladder selectively stimulates small gtpases and hla antigens and inhibits single-spanning uroplakins,” BMC Cancer, vol. 7, no. 1, pp. 204, 2007. [15] P. A. Konstantinopoulos, M. V. Karamouzis, and A. G. Papavassiliou, “Post-translational modifications and regulation of the ras superfamily of gtpases as anticancer targets,” Nat Rev Drug Discov, vol. 6, no. 7, pp. 541–555, 2007. [16] S. C. Smith, B. Nicholson, et al., “Profiling bladder cancer organ site-specific metastasis identifies lamc2 as a novel biomarker of hematogenous dissemination,” Am J Pathol, vol. 174, no. 2, pp. 371– 379, 2009. [17] M. Ito, H. Nishiyama, et al., “Association of the pig3 promoter polymorphism with invasive bladder cancer in a japanese population,” Jpn. J. Clin. Oncol., vol. 36, pp. 116–120, 2006. [18] V. G. Gorgoulis, T. Liloglou, et al., “Absence of association with cancer risk and low frequency of alterations at a p53 responsive pig3 gene polymorphism in breast and lung carcinomas,” Mutat. Res., vol. 556, pp. 143–150, 2004.
© Copyright 2025 Paperzz