Zhang, et al., Bioinformatics 2011; Zhang, et al., Bioinformatics 2012; Zhang, et al., NAR, 2014 PCA-CMI Gene Regulatory Network Inference (direct regulation) Causality statistic dependency Luonan Chen Chinese Academy of Sciences 1. Gene regulatory network (GRN) and Reconstruction Transcription and regulatory Transcription is the process of creating a complementary RNA copy of a sequence of DNA, which leads to gene expression. Transcription is regulated or controlled by transcription factor. Gene regulatory networks A collection of DNA segments in a cell interact with each other and build a graph which is gene regulatory networks. Function of Gene regulatory networks • Gene regulatory networks explicitly represent the causality of biologic processes. • Gene regulatory network is critical for learning biologic process like cellular differentiation, developmental process, disease mechanism, etc. Reverse Engineering of Gene Network Microarray Gene expression profiles Reverse Engineering Reconstruction of network 2. Advance in GRN Inference Method Popular methods There are many approaches based on gene expression, e.g. • Boolean Networks • Bayesian Networks • Differential Equation • Regression Method • Linear Programming • Correlation Methods (Pearson Correlation Coefficient, Mutual Information) Strength and weakness of different methods There are many problems for current models (Bansal et al. Mol. Syst. Biol,2007; Marbach et al., PNAS,2010; Smet et al., Nat. Rev. Microbiol. 2010). Boolean Networks (Discrete Model, time-consuming) Bayesian Networks (Discrete Model, time-consuming, loop?) Regression Method ( Linear, not good for high dimension) Linear Programming (Linear, not good for high dimension ) Differential Equation (not good for high dimension) Pearson Correlation (Good for high dimension, but not for nonlinear) • Mutual information (Good for nonlinear, but not sparse) • • • • • • 3. PCA-CMI PCA-CMI for inferring GRNs from gene expression data considers the nonlinear dependence and topological structure of GRNs by employing path consistency algorithm (PCA) based on conditional mutual information (CMI). Nonlinear dependence Iterative inference Method (PCA-CMI) Information theory • For a discrete variable (gene) X, the entropy H(X) is the measure of average uncertainty of variable X H ( X ) p( x) log p( x), xX where p(x) is the probability of each discrete value x. • Joint entropy H(X, Y) of X and Y can be denoted by H ( X ,Y ) p( x, y)log p( x, y), xX , yY where p(x, y) is the joint probability of x and y. Conditional mutual information • Mutual information (MI) measures the dependency between two variables. For X and Y, MI is I ( X ,Y ) xX , yY p( x, y ) log p ( x, y ) p ( x) p ( y ) H ( X ) H (Y ) H ( X , Y ), • Conditional mutual information measures conditional dependency between two variables (genes) given other variable(s). CMI of X and Y given Z is I ( X ,Y | Z ) xX , yY , zZ p( x, y, z ) log p ( x, y | z ) p( x | z ) p( y | z ) H ( X , Z ) H (Y , Z ) H ( Z ) H ( X , Y , Z ). Entropy • Entropy is estimated with Gaussian kernel probability density estimator 1 P( X i ) N N (2 ) j 1 1 n2 | C |n 2 1 exp( ( X j X i )T C 1 ( X j X i )), 2 where C is the covariance matrix of variable X, N and n are numbers of samples and variables (genes) in C. • We can get the entropy of variable X as follows. H ( X ) log[(2 e) n2 1 | C | ] log(2 e) n | C | . 2 12 Gaussian Assumption • Mutual information can be expressed as follows 1 | C ( X ) | | C (Y ) | I ( X , Y ) log , 2 | C( X ,Y ) | 1 | C ( X , Z ) | | C (Y , Z ) | I ( X , Y | Z ) log . 2 | C(Z ) | | C( X , Y , Z ) | When variables (genes) X and Y are independent, we get I(X,Y)=0. Similarly, if the variables X and Y are conditional independence given Z, we have I(X,Y|Z)=0. Path consistency algorithm (PCA-CMI) High-Order Correlation (Partial Correlation) A B Y A B X Y A B X Y X 0-order correlation Corr(X,Y)=1 0-order correlation Corr(X,Y)=1 0-order correlation Corr(X,Y)=1 1-order correlation Corr(X,Y|A)=0 1-order correlation Corr(X,Y|A)=1 1-order correlation Corr(X,Y|A)=1 2-order-correlation Corr(X,Y|A,B)=0 2-order-correlation Corr(X,Y|A,B)=1 For a n-node network, at least n-2 order correlation is required to correctly identify the direct interaction. But it is usually impossible to obtain the reverse matrix for high-dimension network 相关关系因果关系 High-Order Correlation (Partial Correlation) (Direct relation for X and Y) A B Y X 0-order correlation Corr(X,Y)=1 1-order correlation Corr(X,Y|A)=0 Causality: statistical dependency (Direct relation for Y and X) A B Y X 0-order correlation Corr(X,Y)=1 1-order correlation Corr(X,Y|A)=1 2-order-correlation Corr(X,Y|A,B)=0 Requires more samples ! (Direct relation for Y and X) A B Y X For a 4-node network, at least 2 order correlation is 0-order correlation required , at leastCorr(X,Y)=1 4 independent samples. 1-order correlation Corr(X,Y|A)=1 2-order-correlation Corr(X,Y|A,B)=1 High-Order Correlation (Partial Correlation) A B Y A B X Y A B X Y X 0-order correlation 0-order correlation 0-order correlation Corr(X,Y)=1 Corr(X,Y)=1 Corr(X,Y)=1 For a n-node network, at least n-2 order correlation is required . 1-order correlation 1-orderto correlation Thus, it is usually impossible obtain the reverse1-order matrix correlation for such a network Corr(X,Y|A)=0 Corr(X,Y|A)=1 Corr(X,Y|A)=1 2-order-correlation Corr(X,Y|A,B)=0 2-order-correlation Corr(X,Y|A,B)=1 Flowchart for PCA-CMI PCA-CMI detects the true network even with small samples. Zhang, et al., Bioinformatics 2011 ; Zhang, et al., Bioinformatics 2012 ; Zhang, et al., NAR, 2014 Novelty of PCA-CMI • PCA-CMI is based on nonlinear assumption. • PCA-CMI is able to distinguish direct or causal interactions from indirect associations in the sense of statistical dependency. • Eliminate singularity of matrix. • With Gaussian distribution, CMI is computed by a concise formula. 20 Zhang, et al., Bioinformatics 2011 ; Zhang, et al., Bioinformatics 2012 ; Zhang, et al., NAR, 2014 Practical Package for whole genomewide network • Software by CAS PICB group (with GPU) available • Constructing a network with 20000-40000 genes Simpson’s paradox Different CMI orders from gene expression data. Edges with red dotted lines G2-G4, G4-G5 and G2-G9 are false positives, while edge G4-G9 is false negative. The false positive edges G2-G4 and G4G5 in zero-order network were successfully removed by PCA-CMI. • A higher order network has a higher accuracy (ACC) with a lower false positive rate (FPR) than that of a lower order network. Publications Zhang, et al., Bioinformatics 2011; Zhang, et al., Bioinformatics 2012; Zhang, et al., NAR, 2014
© Copyright 2026 Paperzz