Module networks Colin Dewey BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Fall 2016 RECAP • Probabilistic graphical models (PGMs) provide a natural representation of molecular networks • Bayesian networks are a type of PGMs popular for representing molecular networks • Learning the structure of a Bayesian network is computationally hard • Sparse candidate algorithm: a heuristic algorithm for learning Bayesian network structure • There is often not enough data to accurately estimate the full network structure – Can instead estimate “features” of the network via bootstrapping What you should know • Motivation of module networks • What is a module network? • Compare and contrast module network learning algorithm to other network inference algorithms – E.g. Sparse candidate Module Networks • Motivation: – Most complex systems have too many variables – Not enough data to robustly learn networks – Large networks are hard to interpret • Key idea: Group similarly behaving variables into “modules” and learn the same parents and parameters for each module • Relevance to gene regulatory networks – Genes that are co-expressed are likely regulated in similar ways Segal et al 2005 Definition of a module • Statistical definition (specific to module networks by Segal 2005) – A set of random variables that share a statistical model • Biological definition of a module – Set of genes that are co-expressed and coregulated An expression module Set of genes that behave similarly across conditions Genes Genes Modules Genes Gasch & Eisen, 2002 Key questions of Module Networks • How to represent the Conditional Probability Distributions (CPD) for children? – Regression Tree • How to learn module networks? Defining a Module Network • A probabilistic graphical model over N random variables • Set of module variables M1.. MK • Module assignments A that specifies the module (1-to-K) for each Xi • CPD per module P(Mj|PaMj), PaMj are random variable parents of module Mj – Each variable Xi in Mj has the same conditional distribution Bayesian network vs Module network Each variable takes three values: UP, DOWN, SAME Bayesian network vs. Module network • Bayesian network – Different CPD per random variable – Learning only requires to search for parents • Module network – CPD per module • Same CPD for all random variables in the same module – Learning requires parent search and module membership assignment Learning a Module Network • Given training dataset number of modules (K) • Learn , fixed – Module assignments A of each variable to a module – The parents of each module Score of a Module network Module network structure Module assignments Data Random variables in module j 𝐾 log 𝑃 𝒟 𝒮, 𝒜 = log 𝑖=1 𝐿𝑗 Pa𝑴𝒋 , 𝑿𝑗 , 𝜃𝑀𝑗|Pa𝑴 : 𝒟 𝑃(𝜃𝑀𝑗 |Pa𝑴 )𝑑𝜃𝑀𝑗 |Pa𝑴 𝒋 𝒋 𝒋 Parents of module j K: number of modules Likelihood of module j Parameters of CPD for module j Important: Marginal likelihood decomposes into sum of scores for each module Module network learning algorithm Greedy structure search • Similar to Sparse Candidate algorithm • Differences: – Only consider add-edge or delete-edge operators • reverse-edge not well-defined here – Consider all n nodes (not just k candidates) as parent for each module • Not too computationally expensive because there are only k modules (compared to n for a general Bayesian network) Module assignment search • Happens in two places • Module initialization – Interpret as clustering of the random variables • Module re-assignment Module initialization as clustering of variables for module network Module re-assignment • Must preserve the acyclic graph structure • Must improve score • Module re-assignment happens using a sequential update procedure: – Update only one variable at a time – The change in score of moving a variable from one module to another while keeping the other variables fixed Module re-assignment via sequential update Representing the Conditional probability distribution • Xi are continuous variables • How to represent the distribution of Xi given the state of its parents? • How to capture context-specific dependencies? • Module networks use a regression tree Modeling the relationship between regulators and targets • suppose we have a set of (8) genes that all have in their upstream regions the same activator/repressor binding sites A regression tree • A rooted binary tree T • Each node in the tree is either an interior node or a leaf node • Interior nodes are labeled with a binary test Xi<u, u is a real number observed in the data • Leaf nodes are associated with univariate distributions of the child A regression tree to capture a CPD e1, e2 are values seen in the data X2 X1 X1 > e1 X3 Interior node NO YES X2 > e2 NO YES Leaf Expression of gene represented by X3 modeled using Gaussians at each leaf node An example regression tree for a Module network A very simple regression tree e1 NO e2 X2 > e1 YES X3 X2 > e2 X3 X2 NO YES Algorithm for growing a regression tree • Input: dataset D, child variable Xi, candidate parents Ci of Xi • Output: Tree T • Initialize T to a leaf node, estimated from all samples of Xi • While not converged – For every leaf node l in T • Find with the best split at l • If split improves score – add two leaf nodes, i and j below l – Update samples and parameters associated with , i and Learning a regression tree • Assume we are searching for the parents of a variable X3 and it already has two parents X1 and X2 • X4 will be considered using “split” operations of existing leaf nodes NO X1 > e1 NO YES X2 > e2 YES X2 > e2 NO N1 X1 > e1 YES NO YES N2 N3 Nl: Gaussian associated with leaf l N2 X4 > e3 NO YES N4 N5 N3 Convergence in regression tree • • • • Depth of tree Improvement in score Maximum number of parents Minimum number of samples per leaf node Assessing the value of using Module Networks • Using simulated data – Generate data from a known module network – Known module network was in turn learned from real data • 10 modules, 500 variables – Evaluate using • Test data likelihood • Recovery of true parent-child relationships in learned module network • Using gene expression data – External validation of modules (Gene ontology, motif enrichment) – Cross-check with literature Test data likelihood 10 Modules is the best for almost all training data set sizes Each line type represents size of training data Recovery of graph structure Application of Module networks to yeast expression data Segal et al, Regev, Pe’er, Gasch 2005 Module networks has better performance than simple Bayesian network Gain in test data likelihood over Bayesian network using expression data The Respiration and Carbon Module Regulation tree Global View of Modules • modules for common processes often share common – regulators – binding site motifs Summary • Module networks – A type of Bayesian network – Identifies modules (sets of similarly behaving random variables) and learns parents for each module – Conditional probability distributions capture “rules” of regulatory relationships – Learning requires inferring parent->module relationships and module assignments • In practice give more realistic networks compared to Bayesian networks
© Copyright 2026 Paperzz