:
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Bayesian Graphical models
One graph, two, and two hundred million...
Yuan Ji
Center for Biomedical Research Informatics
NorthShore University HealthSystem
Department of Health Studies
The University of Chicago
June 12, 2014
June 12, 2014 1
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Collaborator & References
Collaborators
UT Austin
U. of Louisville
U. of Chicago
U. of Chicago
Georgia Tech.
Yanxun Xu (postdoc), Peter Müller (Prof)
Riten Mitra (Assist. Prof)
Lorenzo Pesce and Computational Institute (HPC)
Kevin White and IGSB
Peng Qiu (Assist. Prof)
References
• ONE GRAPH :
Mitra et al. (2013, JASA); Telesca et al. (2012, JASA)
• DIFFERENTIAL GRAPHS :
Mitra et al. (2014a; 2014b) – multiplicity control
• TWO MILLION GRAPHS :
Zhu et al. (2014a, Nature Methods) ; Zhu et al. (2014b,
Submitted, PNAS.)
Website www.compgenome.org
:
June 12, 2014 2
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
ONE GRAPH
(Mitra et al., 2013; Telesca et al., 2013)
Single Graph:
June 12, 2014 3
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Bayesian Graphical Model – An overview
A class of Bayesian graphical hierarchical models
Bayesian paradigm:
Prior Pathways G0 + Data
Pathways G
| {z } → Posterior
|
{z
}
|
{z
}
Likelihood
Graphical prior
Posterior knowledge
Single Graph:
June 12, 2014 4
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Bayesian Graphical Model – An overview
A class of Bayesian graphical hierarchical models
Bayesian paradigm:
Prior Pathways G0 + Data
Pathways G
| {z } → Posterior
|
{z
}
|
{z
}
Likelihood
Graphical prior
Posterior knowledge
Graph is random Posterior distributions on graphs
False discovery control FDR is feasible based on posterior
probabilities.
Prior graph can be incorporated (e.g., consensus network from
KEGG, GeneGO, Ingenuity...)
Single Graph:
June 12, 2014 4
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
General structure
Bayesian paradigm:
Prior Pathways G0 + Data
Pathways G
| {z } → Posterior
|
{z
}
|
{z
}
Likelihood
Graphical prior
Posterior knowledge
Single Graph:
June 12, 2014 5
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
General structure
Bayesian paradigm:
Prior Pathways G0 + Data
Pathways G
| {z } → Posterior
|
{z
}
|
{z
}
Likelihood
Graphical prior
Posterior knowledge
Notation:
Y : observed expression data ygt , feature g , sample t
e: latent indicators egt for under-, over- and normal
expression
G: dependence structure (conditional independence)
c: strength of dependence
θ: other nuisance parameters
Model:
p(Y | θ, e)
p(e | G, c) p(c | G) p(G)
sampling model latent trinary dependence
and priors p(θ).
Single Graph:
June 12, 2014 5
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Probability Model – 1. Priors on random graph p(G)
Let G = (V , E ) denote a graph
E : set of nodes in the graph (features)
V : set of edges between pairs of nodes (edges between features)
Single Graph:
June 12, 2014 6
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Probability Model – 1. Priors on random graph p(G)
Let G = (V , E ) denote a graph
E : set of nodes in the graph (features)
V : set of edges between pairs of nodes (edges between features)
Prior on G
• Informative prior around G0 (consensus protein network):
p(G ) ∝ τ d(G ,G0 )
• Can deal with a graph with moderate size (say, 50 nodes)
• Need to have strong prior belief in G0
• Example: Cellular protein signaling pathways (Telesca et al.,
2012); Zodiac (Zhu et al., 2014)
Single Graph:
June 12, 2014 6
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Probability Model – 1. Priors on random graph p(G)
Let G = (V , E ) denote a graph
E : set of nodes in the graph (features)
V : set of edges between pairs of nodes (edges between features)
Prior on G
• Informative prior around G0 (consensus protein network):
p(G ) ∝ τ d(G ,G0 )
• Can deal with a graph with moderate size (say, 50 nodes)
• Need to have strong prior belief in G0
• Example: Cellular protein signaling pathways (Telesca et al.,
2012); Zodiac (Zhu et al., 2014)
• Vague prior when a prior network is not known: p(G) ∝ const
• Feasible only for graphs with relatively small size (e.g., 15
nodes), see Dobra et al. (2005)
• For histone modifications, little prior knowledge is known
about their dependence (Mitra et al. 2013)
Single Graph:
June 12, 2014 6
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Probability Model – 2. Joint prior of features presence given
the graph p(e | β, G)
Presence of features : Define {eit = 1} the presence indicator of
feature i in location t.
Joint distribution of e given G and β is defined as p(e | β, G).
Besag (1974) shows that any joint p(e | β, G ) can be written as
p(e | β, G ) = p(0 | β, G )
X
X
X
× exp
βi ei +
βij ei ej +
βijk ei ej ek + . . . + β1···m e1 · · · em ,
i
i<j
i<j<k
(1)
where βi1 ···ik is zero if and only if nodes i1 , . . . , ik do not form a
clique in the graph G .
Single Graph:
June 12, 2014 7
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Clique
A clique is a set of nodes of which all pairs in the set are connected.
Single Graph:
!"#$%$&'()*+$
,$&'()*+$
June 12, 2014 8
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Probability Model – 3. Sampling model p(y | e)
We model yit as random variable from a mixture distribution of
Poisson and Log-normals.
(
Poi(λi ) I (yit < ci )
eit = 0
p(yit | eit ) ∝
2
2
πi LN(µ1i , σ1i ) + (1 − πi )LN(µ2i , σ2i ) eit = 1
Single Graph:
(2)
June 12, 2014 9
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Probability Model – 3. Sampling model p(y | e)
We model yit as random variable from a mixture distribution of
Poisson and Log-normals.
(
Poi(λi ) I (yit < ci )
eit = 0
p(yit | eit ) ∝
2
2
πi LN(µ1i , σ1i ) + (1 − πi )LN(µ2i , σ2i ) eit = 1
(2)
The Poisson/log-normal mixture can be further replaced by
introducing a trinary indicator zit ∈ {−1, 0, 1} with
p(zit | eit = 0) = δ−1 (zit ) and
p(zit | eit = 1) = πi δ0 (zit ) + (1 − πi )δ1 (zit ). Then
Poi(λi ) I (yit < ci ) zit = −1
2)
p(yit | eit ) = LN(µ1i , σ1i
zit = 0
2
LN(µ2i , σ2i )
zit = 1
(3)
I
Single Graph:
June 12, 2014 9
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
A toy fit of the mixture model
0.4
0.0
0.2
Density of the Mixture p(y)
0.6
0.8
Histogram of the positive histone counts with density estimate
5
10
15
20
Histone Counts
Figure: Fit of a Poisson/lognormal mixture model to the count data of a
feature. The red (peaked) curve is the density of
0.5 × Pois(1)I (yit < 2) + 0.3 × LN(1, 0.4) + 0.2 × LN(2, 0.6). The June 12, 2014 10
Single Graph:
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Joint Posterior
Let θ be the parameter vector for the sampling model.
The joint posterior is given by
p(Y , z, e, θ, G ) ∝ p(Y | z, θ) p(z | e, θ) p(e | β, G ) p(θ) p(β | G ) p(G )
| {z }
| {z }
(3)
(1)
(4)
Single Graph:
June 12, 2014 11
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
MCMC and posterior inference
Posterior MCMC simulation proceeds by iterating over the following
transition probabilities:
[e | G , β, θ, Y ], [z | e, θ, Y ], [θ | z, Y ], [β | e, G ], [G | β, e]
Single Graph:
June 12, 2014 12
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
MCMC and posterior inference
Posterior MCMC simulation proceeds by iterating over the following
transition probabilities:
[e | G , β, θ, Y ], [z | e, θ, Y ], [θ | z, Y ], [β | e, G ], [G | β, e]
• Updating β and G involves evaluating
c(β, G ) = 1/p(0 | β, G ) =
X
e
exp
X
i
βi ei +
X
i<j
βij (ei − νi )(ej − νj )
(5)
Single Graph:
June 12, 2014 12
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
MCMC and posterior inference
Posterior MCMC simulation proceeds by iterating over the following
transition probabilities:
[e | G , β, θ, Y ], [z | e, θ, Y ], [θ | z, Y ], [β | e, G ], [G | β, e]
• Updating β and G involves evaluating
c(β, G ) = 1/p(0 | β, G ) =
X
e
exp
X
i
βi ei +
X
i<j
βij (ei − νi )(ej − νj )
(5)
step 1 Importance sampling to updated β (Chen and Shao, 1997;
Che, Shao and Ibrahim, 2000)
• Approximate the M-H ratio by importance sampling
step 2 With step 1 and reversible jump, updating G .
Single Graph:
June 12, 2014 12
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
CHIP-Seq Example
ChIP-Seq experiment for CD4 T Lymphocytes (Barski et al, 2007;
Wang et al., 2008)
HM count data [yit ] with 50,000 selected locations and 39 types of
HMs.
Posterior inference is based on P̂ij , the posterior probability of
including an edge {i, j}.
1. Edge selection is based on posterior expected FDR
to determine a cutoff c
P
i j [(1 − P̂ij )I (P̂ij > c)]
FDRc =
,
P
i,j I (P̂ij > c)
Single Graph:
so that edges with P̂ij > c are selected.
2. Type of interaction is based on
Pr (βij | βij 6= 0, y ) > 0.5
• Yes: positive
• No: n egative
June 12, 2014 13
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Results – 1: Point Estimate
17
3
7
16
5
12
15
1
6
11
2
9
8
4
10
13
14
Posterior inference for the ChIP-Seq data on 17 HMs under a uniform prior p(G ). The thickness of the
edges indicate the strength of the relationship and is a function of the posterior inclusion probabilities
P̂ij .
Single Graph:
June 12, 2014 14
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Results – 2: Variability Estimate
7
7
12
12
1
1
6
6
8
8
13
13
(a) 55
(b) 15
7
7
12
12
1
1
6
8
6
8
13
(c) 13
13
(d) 12
The four most frequent configurations (a through d) of a subgraph consisting of 4 edges. The posterior
probabilities (in percent) are given below each subgraph.
Single Graph:
June 12, 2014 15
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
DIFFERENTIAL GRAPHs (> 2 graphs)
(Mitra, Müller, Ji, 2014a; 2014b)
Two or More Graphs:
June 12, 2014 16
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Differential Networks of
Assume an informative prior graph G0 . Inference on two graphs G 1
and G 2 . Define δij = |Gij2 − Gij1 |.
G 1 | G0 ∼ U(G0 )
δij ∼ Ber(π), i < j
π ∼ Beta(a, b).
(6)
Together G 1 and δ implicitly define G 2 by
Gij2 = Gij1 (1 − δij ) + (1 − Gij1 )δij
for all edges {i, j} ∈ E0 .
We refer to (6) as the differential graph model , and refer to π as
the global probability of similarity.
Two or More Graphs:
June 12, 2014 17
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Differential Networks of
Assume an informative prior graph G0 . Inference on two graphs G 1
and G 2 . Define δij = |Gij2 − Gij1 |.
G 1 | G0 ∼ U(G0 )
δij ∼ Ber(π), i < j
π ∼ Beta(a, b).
(6)
Together G 1 and δ implicitly define G 2 by
Gij2 = Gij1 (1 − δij ) + (1 − Gij1 )δij
for all edges {i, j} ∈ E0 .
We refer to (6) as the differential graph model , and refer to π as
the global probability of similarity.
When informative prior graph G0 is not available, we assume
instead G 1 ∼ Um (V ), where Um (V ) denotes a uniform distribution
over all subgraphs G = (V , E ) of size |E | = m.
Two or More Graphs:
June 12, 2014 17
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Differential graphs
17
3
17
3
7
7
5
12
7
16
5
12
5
12
15
15
15
1
1
6
11
2
17
3
16
16
6
11
9
8
1
9
8
4
4
10
13
14
(a) Promoters (G 1 )
6
11
2
10
13
14
(b) Insulators (G 2 )
2
9
8
4
10
13
14
(c) Differences δij = |Gij1 − Gij2 |
Figure: Panels (a) through (c) show posterior estimated networks in two
regulatory regions and the posterior estimated differences between them.
The solid lines denote the edges present in promoters, but not in
insulators while dotted lines represent edges in insulators but not in
promoters.
Two or More Graphs:
June 12, 2014 18
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Extension to > 2 graphs
• A latent “baseline” graph G0 ;
• Multiple graph model: For graph G k , k = 1, 2, ...K ,
p(Gijk = 1 | Gij0 = 1) = p11
Two or More Graphs:
and p(Gijk = 1 | Gij0 = 0) = p10
p11 , p10 ∼ Beta(a1 , b1 )
p(Gij0 = 1) = p0 ;
p0 ∼ Beta(a0 , b0 )
June 12, 2014 19
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
200000000 GRAPHS
(Zhu et al., 2014a; 2014b)
Two hundred million graphs:
June 12, 2014 20
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Biological goal
Understand genetic interactions in cancer between different
genomics features of different genes
Two hundred million graphs:
June 12, 2014 21
Single Graph
Two or More Graphs
Two hundred million graphs:
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Zodiac: Blueprint
June 12, 2014 22
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
The Cancer Genome Atlas (TCGA)
• An NIH/NCI pilot project (cancergenome.nih.gov), cost
about $ 1 billion
• Multiple modalities (DNA, RNA, protein), whole genome,
multiple cancer types (>25), MATCHED samples!
• Data production near completion.
Two hundred million graphs:
June 12, 2014 23
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
TCGA-Assembler Retrieves Level-3 TCGA data
Two hundred million graphs:
June 12, 2014 24
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
TCGA-Assembler Produces Mega-Data
Two hundred million graphs:
June 12, 2014 25
Single Graph
Two or More Graphs
Two hundred million graphs:
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Bayesian Graphical Models
June 12, 2014 26
Single Graph
Two or More Graphs
Two hundred million graphs:
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
June 12, 2014 27
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA:
Bioinformatics – Big Data Analysis of TCGA
June 12, 2014 28
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA:
Bioinformatics – Big Data Analysis of TCGA
June 12, 2014 29
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
www.compgenome.org
Bioinformatics – Big Data Analysis of TCGA:
June 12, 2014 30
Single Graph
Two or More Graphs
Two hundred million graphs
Bioinformatics – Big Data Analysis of TCGA
Thank you!
Zodiac 2 – to be continued...
Bioinformatics – Big Data Analysis of TCGA:
June 12, 2014 31
© Copyright 2026 Paperzz