gene regulatory networks

Zhang, et al., Bioinformatics 2011; Zhang, et al., Bioinformatics 2012; Zhang, et al., NAR, 2014
PCA-CMI
Gene Regulatory Network Inference
(direct regulation)
Causality  statistic dependency
Luonan Chen
Chinese Academy of Sciences
1. Gene regulatory network (GRN)
and Reconstruction
Transcription and regulatory
Transcription is the process of creating a complementary
RNA copy of a sequence of DNA, which leads to gene
expression. Transcription is regulated or controlled by
transcription factor.
Gene regulatory networks
A collection of DNA segments in a cell interact with each
other and build a graph which is gene regulatory networks.
Function of Gene regulatory networks
• Gene regulatory networks explicitly represent the
causality of biologic processes.
• Gene regulatory network is critical for learning
biologic process like cellular differentiation,
developmental process, disease mechanism, etc.
Reverse Engineering of Gene Network
Microarray
Gene expression profiles
Reverse Engineering
Reconstruction of network
2. Advance in GRN Inference Method
Popular methods
There are many approaches based on gene expression, e.g.
• Boolean Networks
• Bayesian Networks
• Differential Equation
• Regression Method
• Linear Programming
• Correlation Methods
(Pearson Correlation Coefficient, Mutual Information)
Strength and weakness of different
methods
There are many problems for current models (Bansal et al.
Mol. Syst. Biol,2007; Marbach et al., PNAS,2010; Smet et
al., Nat. Rev. Microbiol. 2010).
Boolean Networks (Discrete Model, time-consuming)
Bayesian Networks (Discrete Model, time-consuming, loop?)
Regression Method ( Linear, not good for high dimension)
Linear Programming (Linear, not good for high dimension )
Differential Equation (not good for high dimension)
Pearson Correlation (Good for high dimension, but not for
nonlinear)
• Mutual information (Good for nonlinear, but not sparse)
•
•
•
•
•
•
3. PCA-CMI
PCA-CMI for inferring GRNs from gene expression data
considers the nonlinear dependence and topological
structure of GRNs by employing path consistency algorithm
(PCA) based on conditional mutual information (CMI).
Nonlinear dependence
Iterative inference
Method (PCA-CMI)
Information theory
• For a discrete variable (gene) X, the entropy H(X) is the
measure of average uncertainty of variable X
H ( X )    p( x) log p( x),
xX
where p(x) is the probability of each discrete value x.
• Joint entropy H(X, Y) of X and Y can be denoted by
H ( X ,Y )  

p( x, y)log p( x, y),
xX , yY
where p(x, y) is the joint probability of x and y.
Conditional mutual information
• Mutual information (MI) measures the dependency
between two variables. For X and Y, MI is
I ( X ,Y )  

xX , yY
p( x, y ) log
p ( x, y )
p ( x) p ( y )
 H ( X )  H (Y )  H ( X , Y ),
• Conditional mutual information measures conditional
dependency between two variables (genes) given
other variable(s). CMI of X and Y given Z is
I ( X ,Y | Z ) 

xX , yY , zZ
p( x, y, z ) log
p ( x, y | z )
p( x | z ) p( y | z )
 H ( X , Z )  H (Y , Z )  H ( Z )  H ( X , Y , Z ).
Entropy
• Entropy is estimated with Gaussian kernel probability
density estimator
1
P( X i ) 
N
N
 (2 )
j 1
1
n2
| C |n 2
1
exp( ( X j  X i )T C 1 ( X j  X i )),
2
where C is the covariance matrix of variable X, N and
n are numbers of samples and variables (genes) in C.
• We can get the entropy of variable X as follows.
H ( X )  log[(2 e)
n2
1
| C | ]  log(2 e) n | C | .
2
12
Gaussian Assumption
• Mutual information can be expressed as follows
1
| C ( X ) |  | C (Y ) |
I ( X , Y )  log
,
2
| C( X ,Y ) |
1
| C ( X , Z ) |  | C (Y , Z ) |
I ( X , Y | Z )  log
.
2
| C(Z ) |  | C( X , Y , Z ) |
When variables (genes) X and Y are independent, we
get I(X,Y)=0. Similarly, if the variables X and Y are
conditional independence given Z, we have I(X,Y|Z)=0.
Path consistency algorithm (PCA-CMI)
High-Order Correlation
(Partial Correlation)
A
B
Y
A
B
X
Y
A
B
X
Y
X
0-order correlation
Corr(X,Y)=1
0-order correlation
Corr(X,Y)=1
0-order correlation
Corr(X,Y)=1
1-order correlation
Corr(X,Y|A)=0
1-order correlation
Corr(X,Y|A)=1
1-order correlation
Corr(X,Y|A)=1
2-order-correlation
Corr(X,Y|A,B)=0
2-order-correlation
Corr(X,Y|A,B)=1
For a n-node network, at least n-2 order correlation is required to correctly identify
the direct interaction.
But it is usually impossible to obtain the reverse matrix for high-dimension network
相关关系因果关系
High-Order Correlation (Partial Correlation)
(Direct relation for X and Y)
A
B
Y
X
0-order correlation
Corr(X,Y)=1
1-order correlation
Corr(X,Y|A)=0
Causality: statistical dependency
(Direct relation for Y and X)
A
B
Y
X
0-order correlation
Corr(X,Y)=1
1-order correlation
Corr(X,Y|A)=1
2-order-correlation
Corr(X,Y|A,B)=0
Requires more
samples !
(Direct relation for Y and X)
A
B
Y
X
For a 4-node network, at least 2 order correlation is
0-order correlation
required , at leastCorr(X,Y)=1
4 independent samples.
1-order correlation
Corr(X,Y|A)=1
2-order-correlation
Corr(X,Y|A,B)=1
High-Order Correlation
(Partial Correlation)
A
B
Y
A
B
X
Y
A
B
X
Y
X
0-order correlation
0-order correlation
0-order correlation
Corr(X,Y)=1
Corr(X,Y)=1
Corr(X,Y)=1
For
a n-node network,
at least n-2 order correlation
is required .
1-order
correlation
1-orderto
correlation
Thus,
it is
usually impossible
obtain the reverse1-order
matrix correlation
for such a network
Corr(X,Y|A)=0
Corr(X,Y|A)=1
Corr(X,Y|A)=1
2-order-correlation
Corr(X,Y|A,B)=0
2-order-correlation
Corr(X,Y|A,B)=1
Flowchart for PCA-CMI
PCA-CMI detects the true network even with small samples.
Zhang, et al., Bioinformatics 2011 ; Zhang, et al., Bioinformatics 2012 ; Zhang, et al., NAR, 2014
Novelty of PCA-CMI
• PCA-CMI is based on nonlinear assumption.
• PCA-CMI is able to distinguish direct or causal
interactions from indirect associations in the sense of
statistical dependency.
• Eliminate singularity of matrix.
• With Gaussian distribution, CMI is computed by a
concise formula.
20
Zhang, et al., Bioinformatics 2011 ; Zhang, et al., Bioinformatics 2012 ; Zhang, et al., NAR, 2014
Practical Package for whole genomewide network
• Software by CAS PICB group (with GPU) available
• Constructing a network with 20000-40000 genes
Simpson’s paradox
Different CMI orders from gene expression data.
Edges with red dotted lines G2-G4, G4-G5 and G2-G9 are false positives,
while edge G4-G9 is false negative. The false positive edges G2-G4 and G4G5 in zero-order network were successfully removed by PCA-CMI.
• A higher order network has a higher accuracy (ACC)
with a lower false positive rate (FPR) than that of a
lower order network.
Publications
Zhang, et al., Bioinformatics 2011;
Zhang, et al., Bioinformatics 2012;
Zhang, et al., NAR, 2014