Overview of Bayesian Networks With Examples in R

Overview of Bayesian
Networks With Examples in R
(Scutari and Denis 2015)
Overview
•
•
•
•
Please install bnlearn in R  install.packages(“bnlearn”)
Theory
Types of Bayesian networks
Learning Bayesian networks
• Structure learning
• Parameter learning
• Using Bayesian networks
• Queries
• Conditional independence
• Inference based on new evidence
•
•
Hard vs. soft evidence
Conditional probability vs. most likely outcome (a.k.a maximum a posteriori)
• Exact
• Approximate
• R packages for Bayesian networks
• Case study: protein signaling network
Theory
Bayesian networks (BNs)
• Represent a probability distribution as a probabilistic directed acyclic
graph (DAG)
• Graph = nodes and edges (arcs) denote variables and dependencies, respectively
• Directed = arrows represent the directions of relationships between nodes
• Acyclic = if you trace arrows with a pencil, you cannot traverse back to the same node
without picking up your pencil
• Probabilistic = each node has an associated probability that can be influenced by values
other nodes assume based on the structure of the graph
• The node at the tail of a connection is called the parent, and the node
at the head of the connection is called its child
• Ex. A  B: A is the parent, B is its child
Which is a BN?
A
B
Taken from http://www.cse.unsw.edu.au/~cs9417ml/Bayes/Pages/Bayesian_Networks_Definition.html
Factorization into local distributions
• Factorization of the joint distribution of all variables (global
distribution) into local distributions encoded by the DAG is:
1) intuitively appealing
2) reduces the variables/computational requirements when
using the BN for inference
3) increases power for parameter learning
• The dimensions of local distributions usually do not scale with the
size of the BN
• Each variable (node) depends only on its parents
• For the BN here:
P(A, S, E, O, R, T) = P(A)P(S)P(E|A:S)P(O|E)P(R|E)P(T|O:R)
• A and S are often referred to as root nodes
Fundamental connections
1) Serial connection, e.g. A  E  O
2) Divergent connection, e.g. O  E  R
3) Convergent connection, e.g. A  E  S (also
referred to as a v-structure)
• The child of a convergent connection is often referred
to as a collider
• Only (immoral) v-structures uniquely define
probabilistic relationships
• Ex.
BA
A
A
CA
AA
BA
CA
P(B)P(A|B)P(C|A) = P(A,B)P(C|A) = P(B|A)P(A)P(C|A)
d-separation
• “d” stands for dependence
• Defines conditional independencies/dependencies
• Determines whether a set of X variables is independent of another
set Y, given a third set Z
• Intuitively important because it reveals how variables are related
• Computationally important because it provides a means for efficient
inferencing
• Reduces the effective dimension of inference problems
d-separation
• Formal definition: If A, B, and C are three disjoint subsets of nodes in
a DAG G, then C is said to d-separate A from B if along every path
between a node in A and a node in B there is a node v satisfying one
of the following two conditions:
1) v has converging arcs (i.e. there are two arcs pointing to v from the adjacent
nodes in the path) and neither v nor any of its descendants (i.e. the nodes
that can be reached from v) are in C
or
2) v is in C and does not have converging arcs
d-separation practice
• In R:
> library(bnlearn)
> dag <model2network("[A][S][E|A:S][O|E][R|E][T|O:R][Z|T]")
> dsep(bn = dag, x = "A", y = "O", z = "E")
[1] TRUE
• What this says is that given E, A and O are independent
> dsep(bn = dag, x = "A", y = "S")
[1] TRUE
> dsep(bn = dag, x = "A", y = "S", z = "E")
[1] [FALSE]
• Conditioning on a collider or its descendants (Z) makes the
parent nodes dependent
• Intuitively, if we know E, then certain combinations of A and S are more
likely and hence conditionally dependent
• Note that it is impossible for nodes directly linked by an edge
to be independent conditional on any other node
Equivalent class = CPDAG
• Two DAGs defined over the same set of
variables are equivalent if and only if they:
1) have the same skeleton (i.e. the same
underlying undirected graph)
and
2) the same v-structures
• Compelled edges: edges whose directions are
oriented in the equivalence class because
assuming the opposite direction would:
1) introduce new v-structures (and thus a different
DAG)
or
2) cycles (and thus the resulting graph would no
longer be a DAG)
• Note that DAGs can be probabilistically
equivalent but encode very different causal
relationships!
Markov blankets
• Information (evidence) on the values of
parents, children, and nodes sharing a
child for a given node give information on
that node
• Inference is most powerful when considering
all these nodes (due to the use of Bayes’
theorem when querying)
• Markov blanket defines this set of nodes
and effectively d-separates a given node
from the rest of the graph
• Symmetry of Markov blankets
• If node A is in the Markov blanket of node B,
then B is in the Markov blanket of A
The Markov Blanket of Node X9
Beyond dependencies: causal inference
• While a directed graph seems to suggest • Ex. The presence of a latent variable
significantly altered DAG used to represent
causality, in reality additional criteria
must be met
the relationships between test scores
• Specially designed perturbation
experiments can be employed to
characterize causal relationships
• Algorithms also exist that attempt to
elucidate causal relationships from
observational data
• Often times, the “high p, small n” nature of
the data result in subsets (equivalent
classes) of possible causal networks
• “If conditional independence judgments
are byproducts of stored causal
relationships, then tapping and
representing those relationships directly
would be a more natural and more
reliable way of expressing what we know
or believe about the world”
Types of BNs
Discrete BNs
Conditional probability table of A
• All variables contain discrete data
• Ex. Multinomial distribution
•
•
•
•
young
adult
old
0.3
0.5
0.2
A = age, young, adult, or old
S = gender, male or female
E = education, high or uni
R = residence, small or big
Conditional probability table of E
Conditional probability table of R
Gender = M
E
A
E
young
adult
old
high
0.75
0.72
0.88
univ
0.25
0.28
0.12
Gender = F
A
E
young
adult
old
high
0.64
0.70
0.90
uni
0.36
0.30
0.10
R
high
uni
small
0.25
0.20
big
0.75
0.80
Gaussian BNs (GBNs)
• Assumptions
• Each node follows a normal distribution
• Root nodes are described by the respective marginal distributions
• The conditioning effect of the parent nodes is given by an additive linear term in the
mean, and does not affect the variance
• In other words, each node has a variance that is specific to that node and does not depend on
the values of the parents
• The local distribution of each node can be equivalently expressed as a Gaussian
linear model which includes an intercept and the node’s parents as explanatory
variables, without any interaction terms
• Based on these assumptions, the joint distribution of all nodes (global
distribution) is multivariate normal
Gaussian BNs (GBNs)
E ~ N (50, 102)
V|G,E ~ N (-10.35 + 0.5G + 0.77E, 52)
W|V ~ N (15 + 0.7V, 52)
Hybrid BNs
CL ~ Beta (3,1)
• Contains both discrete and
continuous variables
• One common class of hybrid
BNs is conditional Gaussian
BNs
• Continuous variables cannot be
parents of discrete variables
• The Gaussian distribution of
continuous variables is
conditional on the configuration
of its discrete parent(s)
• In other words, the variable can
have a unique linear model (i.e.
mean, variance) for each
configuration of its discrete
parent(s)
G1|PR, CL ~ Pois (CL*g(PR))
TR|G1 ~ Ber (logit-1[G1-5/2.5])
Comparison of BNs
• Discrete BNs
• Local probability distributions can be plotted using
the bn.fit.barchart function from bnlearn
• The iss argument to include a weighted prior for
parameter learning using the bn.fit function from the
bnlearn only works with discrete data
• Discretization produces better BNs than misspecified
distributions and coarse approximations of the
conditional probabilities
• GBNs
• Perform better than hybrid BNs when few
observations are available
• Greater accuracy than discretization for continuous
variables
• Computationally more efficient than hybrid BNs
• Hybrid BNs
• Greater flexibility
• No dedicated R package
• No structure learning
Learning BNs
Structure learning
• All structure learning methods boil down to three approaches:
1) Constraint-based
2) Score-based
3) Hybrid-based
1) Constraint-based
• Constraint-based algorithms rely on conditional independence tests
• All modern algorithms first learn Markov blankets
• Simplifies the identification of neighbors and in turn reduces computational complexity
• Symmetry of Markov blankets also leveraged
• Discrete BNs
• Tests are functions of observed frequencies
• GBNs
• Tests are functions of partial correlation coefficients
• For both cases:
• We are checking the independence of two sets of variables given a third set
• Null hypothesis is conditional independence
• Test statistics are utilized
• Functions in bnlearn include gs, iamb, fast.iamb, inter.iamb, mmpc, and
si.hiton.pc
2) Score-based
• Candidate BNs are assigned a goodness-of-fit “network score” that heuristic
algorithms then attempt to maximize
• Due to the difficulty assigning scores, only two options are common:
1) BDe(discrete case)/BGe(continuous case)
2) BIC
• Larger values = better fit
• Classes of heuristic algorithms include greedy search, genetic, and simulated
annealing
• Functions in bnlearn include hc and tabu
3) Hybrid-based
• Combine constraint-based and score-based algorithms to offset respective
weaknesses
• Two steps:
1) Restrict
• Constraint-based algorithms are utilized to reduce the set of candidate DAGs
2) Maximize
• Score-based algorithms are utilized to find optimal DAG from the reduced set
• Functions in bnlearn include mmhc and rsmax2 where for rsmax2 you can specify
your own combinations of restrict and maximize algorithms
Parameter learning
• Once the structure of a DAG has been determined, the parameters
can be determined as well
• Two most common approaches are maximum likelihood estimation
and Bayesian estimation (not available for GBNs in bnlearn)
• Parameter estimates are based only on the subset of data spanning
the considered variable and its parents
• The bn.fit function from bnlearn will automatically determine the
type of data and fit parameters
Notes on learning
• Three learning techniques:
1) unsupervised, i.e. from the data set
2) supervised, i.e. from experts in the field of the phenomenon being studied
3) a combination of both
• The arguments blacklist and whitelist can be specified in structure learning
functions to force the absence and presence of specific edges, respectively
• For GBNs, you can easily replace parameter estimates with your own
regression fit
• Ex. The penalized package in R can be used to perform ridge, lasso, or elastic net
regression for biased coefficient estimates
Using BNs
Querying
• Once a BN has been constructed, it can be used
• The term query is derived from computer science terminology and
means to ask questions
• Two main types of queries:
1) conditional independence
• Uses only the DAG structure to explain how variables are associated with one
another, i.e. d-separation
2) inference, a.k.a. probabilistic reasoning or belief updating
• Uses the local distributions
2) Inference
• Investigates the distribution of one or more variables under non-trivial
conditioning
• Variable(s) being conditioned on are the new evidence
• The probability of the variable(s) of interest are then re-evaluated
• Works in the framework of Bayesian statistics because it focuses on the
computation of posterior probabilities or densities
• Based on the basic principle of modifying the joint distributions of nodes to
incorporate a new piece of information
• Uses the fundamental properties of BNs in that only local distributions are
considered when computing posterior probabilities to reduce dimensionality
• The network structure and distributional assumptions of a BN are treated
as fixed when performing inference
Types of evidence
• Hard evidence
• Instantiation of one or more variables in the network
• Soft evidence
• New distribution for one or more variables in the network, i.e. a new set of
parameters
Types of queries
• Conditional probability
• Interested in the marginal posterior probability distribution of variables given
evidence on other variables
• Most likely outcome (a.k.a. maximum a posteriori)
• Interested in finding the configuration of the variables that have the highest
posterior probability (discrete) or maximum posterior density (continuous)
Types of inference
• Exact inference
• Repeated applications of Bayes’ theorem with local computations to obtain
exact probability values
• Feasible only for small or very simple graphs
• Approximate inference
• Monte Carlo simulations are used to sample from the global distribution and
thus estimate probability values
• Several approaches can be used for both random sampling and weighting
• There are functions in bnlearn to generate random observations and
calculate probability distributions given evidence using these
techniques
R packages for BNs
R packages
• Two categories:
1) those that implement structure and parameter learning
2) those that focus on parameter learning and inference
• Some packages of note:
• bnlearn (developed by the authors)
• deal
• Can handle conditional Gaussian BNs
• pcalg
• Focuses on causal inference (implements the PC algorithm)
• Other packages include catnet, gRbase, gRain, and rbmn
• Some of these packages augment bnlearn
Questions?
Case study: protein signaling
network
Overview
• Analysis published in Sachs, K., Perez, O., Pe'er, D., Lauffenburger,
D.A., Nolan, G.P. (2005). Causal Protein-Signaling Networks Derived
from Multiparameter Single-Cell Data. Science, 308(5721):523-529.
• Hypothesis: Machine learning for the automated derivation of a
protein signaling network will elucidate many of the traditionally
reported signaling relationships and predict novel causal pathways
• Methods
• Measure concentrations of pathway molecules in primary immune system
cells
• Perturbation experiments to confirm causality
The data
> sachs <- read.table("sachs.data.txt", header = TRUE)
> head(sachs)
Raf
1
2
3
4
5
6
Mek
26.4
35.9
59.4
73
33.7
18.8
Plcg
13.2
16.5
44.1
82.8
19.8
3.75
• Continuous data
PIP2
8.82
12.3
14.6
23.1
5.19
17.6
PIP3
18.3
16.8
10.2
13.5
9.73
22.1
Erk
58.8
8.13
13
1.29
24.8
10.9
Akt
6.61
18.6
14.9
5.83
21.1
11.9
PKA
17
32.5
32.5
11.8
46.1
25.7
PKC
414
352
403
528
305
610
P38
17
3.37
11.4
13.7
4.66
13.7
Jnk
44.9
16.5
31.9
28.6
25.7
49.1
40
61.5
19.5
23.1
81.3
57.8
Data exploration
• Violations of the assumptions of
GBNs
• Highly skewed
• Concentrations cluster around 0
Densities of Mek, P38,
PIP2, and PIP3 along with
the normal distribution
curves
• Nonlinear correlations
• Difficult for accurate structure learning
• What can we do?
• Data transformations (log)
• Hybrid network: specify an appropriate
conditional distribution for each
variable
• Requires extensive prior knowledge of the
signaling pathway
• Discretize
Concentration of PKA vs.
concentration of PKC
along with the fitted
regression line
Discretizing the data
• Information-preserving discretization algorithm introduced by
Hartemink (2001)
1) Discretizes each variable into a large number of intervals
• idisc argument = type of intervals
• ibreaks argument = number of intervals
2) Iterates over the variables and collapses, for each of them, the pair of
adjacent intervals that minimize the lost of pairwise mutual information
• Basically does its best to reflect the dependence structure of the original data
> dsachs <- discretize(sachs, method = "hartemink", breaks = 3,
ibreaks = 60, idisc = "quantile")
• breaks = number of desired levels (“low”, “medium”, and “high”
concentrations)
Model averaging
• The quality of the structure learned from the data can be improved by
averaging multiple CPDAGs
• Bootstrap resampling as described in Friedman et al. (1999)
• “Perturb" the data
• Frequencies of edges and directions is their confidence measure
> boot <- boot.strength(dsachs, R = 500, algorithm = "hc",
algorithm.args = list(score = "bde", iss = 10))
• R = number of network structures
Model averaging results
> boot[boot$strength > 0.85 &
boot$direction >= 0.5, ]
• strength = frequency of edge
• direction = frequency of edge direction
conditional on the edge’s presence
• Many score-equivalent edges
• This means the directions are not well
established
> avg.boot <- averaged.network(boot,
threshold = 0.85)
1
23
24
34
56
57
67
89
90
100
from
Raf
Plcg
Plcg
PIP2
Erk
Erk
Akt
PKC
PKC
P38
to
Mek
PIP2
PIP3
PIP3
Akt
PKA
PKA
P38
Jnk
Jnk
strength direction
1
0.518
1
0.509
1
0.519
1
0.508
1
0.559
0.984 0.568089
1
0.566
1
0.508
1
0.509
0.95 0.505263
Note: your numbers may differ since no seed was set but you
should still have the same edges passing the threshold
The network
> avg.boot
• Network learned from the
discretized, observational data
• Since we are not confident in the
directions of any of the edges, we
remove them by constructing the
skeleton
> avg.boot <- skeleton(avg.boot)