Module networks and dependency networks

Module networks
Colin Dewey
BMI/CS 576
www.biostat.wisc.edu/bmi576
[email protected]
Fall 2016
RECAP
• Probabilistic graphical models (PGMs) provide a natural
representation of molecular networks
• Bayesian networks are a type of PGMs popular for
representing molecular networks
• Learning the structure of a Bayesian network is
computationally hard
• Sparse candidate algorithm: a heuristic algorithm for
learning Bayesian network structure
• There is often not enough data to accurately estimate
the full network structure
– Can instead estimate “features” of the network via
bootstrapping
What you should know
• Motivation of module networks
• What is a module network?
• Compare and contrast module network
learning algorithm to other network inference
algorithms
– E.g. Sparse candidate
Module Networks
• Motivation:
– Most complex systems have too many variables
– Not enough data to robustly learn networks
– Large networks are hard to interpret
• Key idea: Group similarly behaving variables into
“modules” and learn the same parents and
parameters for each module
• Relevance to gene regulatory networks
– Genes that are co-expressed are likely regulated in
similar ways
Segal et al 2005
Definition of a module
• Statistical definition (specific to module
networks by Segal 2005)
– A set of random variables that share a statistical
model
• Biological definition of a module
– Set of genes that are co-expressed and coregulated
An expression module
Set of genes that behave similarly across
conditions
Genes
Genes
Modules
Genes
Gasch & Eisen, 2002
Key questions of Module Networks
• How to represent the Conditional Probability
Distributions (CPD) for children?
– Regression Tree
• How to learn module networks?
Defining a Module Network
• A probabilistic graphical model over N random
variables
• Set of module variables M1.. MK
• Module assignments A that specifies the
module (1-to-K) for each Xi
• CPD per module P(Mj|PaMj), PaMj are random
variable parents of module Mj
– Each variable Xi in Mj has the same conditional
distribution
Bayesian network vs Module network
Each variable takes three values: UP, DOWN, SAME
Bayesian network vs. Module network
• Bayesian network
– Different CPD per random variable
– Learning only requires to search for parents
• Module network
– CPD per module
• Same CPD for all random variables in the same module
– Learning requires parent search and module
membership assignment
Learning a Module Network
• Given training dataset
number of modules (K)
• Learn
, fixed
– Module assignments A of each variable to a
module
– The parents of each module
Score of a Module network
Module network
structure
Module assignments
Data
Random variables in module j
𝐾
log 𝑃 𝒟 𝒮, 𝒜 =
log
𝑖=1
𝐿𝑗 Pa𝑴𝒋 , 𝑿𝑗 , 𝜃𝑀𝑗|Pa𝑴 : 𝒟 𝑃(𝜃𝑀𝑗 |Pa𝑴 )𝑑𝜃𝑀𝑗 |Pa𝑴
𝒋
𝒋
𝒋
Parents of module j
K: number of modules
Likelihood of module j
Parameters of CPD for module j
Important: Marginal likelihood decomposes into sum of scores for each module
Module network learning algorithm
Greedy structure search
• Similar to Sparse Candidate algorithm
• Differences:
– Only consider add-edge or delete-edge operators
• reverse-edge not well-defined here
– Consider all n nodes (not just k candidates) as
parent for each module
• Not too computationally expensive because there are
only k modules (compared to n for a general Bayesian
network)
Module assignment search
• Happens in two places
• Module initialization
– Interpret as clustering of the random variables
• Module re-assignment
Module initialization as clustering of variables
for module network
Module re-assignment
• Must preserve the acyclic graph structure
• Must improve score
• Module re-assignment happens using a
sequential update procedure:
– Update only one variable at a time
– The change in score of moving a variable from one
module to another while keeping the other
variables fixed
Module re-assignment via sequential update
Representing the Conditional probability
distribution
• Xi are continuous variables
• How to represent the distribution of Xi given
the state of its parents?
• How to capture context-specific
dependencies?
• Module networks use a regression tree
Modeling the relationship between regulators and
targets
• suppose we have a set of (8) genes that all have in their
upstream regions the same activator/repressor binding
sites
A regression tree
• A rooted binary tree T
• Each node in the tree is either an interior
node or a leaf node
• Interior nodes are labeled with a binary test
Xi<u, u is a real number observed in the data
• Leaf nodes are associated with univariate
distributions of the child
A regression tree to capture a CPD
e1, e2 are values seen in the data
X2
X1
X1 > e1
X3
Interior node
NO
YES
X2 > e2
NO
YES
Leaf
Expression of gene represented by X3 modeled using Gaussians at each leaf node
An example regression tree for a Module
network
A very simple regression tree
e1
NO
e2
X2 > e1
YES
X3
X2 > e2
X3
X2
NO
YES
Algorithm for growing a regression tree
• Input: dataset D, child variable Xi, candidate
parents Ci of Xi
• Output: Tree T
• Initialize T to a leaf node,
estimated from all
samples of Xi
• While not converged
– For every leaf node l in T
• Find
with the best split at l
• If split improves score
– add two leaf nodes, i and j below l
– Update samples and parameters associated with , i and
Learning a regression tree
• Assume we are searching for the parents of a
variable X3 and it already has two parents X1 and X2
• X4 will be considered using “split” operations of
existing leaf nodes
NO
X1 > e1
NO
YES
X2 > e2
YES
X2 > e2
NO
N1
X1 > e1
YES
NO
YES
N2
N3
Nl: Gaussian associated with leaf l
N2
X4 > e3
NO
YES
N4
N5
N3
Convergence in regression tree
•
•
•
•
Depth of tree
Improvement in score
Maximum number of parents
Minimum number of samples per leaf node
Assessing the value of using Module Networks
• Using simulated data
– Generate data from a known module network
– Known module network was in turn learned from real data
• 10 modules, 500 variables
– Evaluate using
• Test data likelihood
• Recovery of true parent-child relationships in learned module
network
• Using gene expression data
– External validation of modules (Gene ontology, motif
enrichment)
– Cross-check with literature
Test data likelihood
10 Modules is the best for
almost all training data set
sizes
Each line type represents size of training data
Recovery of graph structure
Application of Module networks to yeast
expression data
Segal et al, Regev, Pe’er, Gasch 2005
Module networks has better performance than
simple Bayesian network
Gain in test data
likelihood over Bayesian
network using expression
data
The Respiration and Carbon Module
Regulation
tree
Global View
of Modules
• modules for common processes
often share common
– regulators
– binding site motifs
Summary
• Module networks
– A type of Bayesian network
– Identifies modules (sets of similarly behaving random
variables) and learns parents for each module
– Conditional probability distributions capture “rules” of
regulatory relationships
– Learning requires inferring parent->module
relationships and module assignments
• In practice give more realistic networks compared
to Bayesian networks