Count Matrix Factorization
Dimension reduction and data visualization
Ghislain DURIF
LBBE UMR 5558 – Université Claude Bernard Lyon 1
27th Januray 2016
ABS4NGS
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
1 / 23
Introduction
Outline
1
Introduction
2
Count matrix factorization
3
Optimization
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
2 / 23
Introduction
Sequencing and count data
Microarray data (in the late 90’s):
luminescence technologies to quantify the
relative abundance of targeted nucleic sequence
in the sample
→ continuous signal
"Microarray2" by Paphrag at English Wikipedia
NGS data (since the mid 2000’s)
Abundance of each nucleic sequence determined
by the number of reads that map to a position
in the genome
→ counts (integer and non-negative)
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
2 / 23
Introduction
NGS data analysis
Issues
High dimensional data: require
specific methods
Counts: transformation of the
data to use methods developed for
microarray data (continuous and
reasonably gaussian)
→ inappropriate (Bullard et al.,
2010; Lee et al., 2013)
Reduce the dimension by compression
Objective of visualization and/or
clustering
To summarize the information
contained within X and expose
latent or hidden structures
Ghislain DURIF (LBBE)
PCA on single cell expression profiles, explained
variance: 22.5% (PC1) and 5.8% (PC2), Buganim
et al. (2012)
Count Matrix Factorization
27th Jan. 2016
3 / 23
Count matrix factorization
Compression
Outline
1
Introduction
2
Count matrix factorization
Compression
Gamma-Poisson model
3
Optimization
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
4 / 23
Count matrix factorization
Compression
Dimension reduction by compression
NGS data
↓
Count data
in high dimension
↓
Examples:
expression profiles,
RNAseq
Standard vs data-specific approaches for zero-inflated single
cell data (Pierson et al., 2015)
↓
Issues:
Compression on
Processing
dimension reduction
on non gaussian data
Ghislain DURIF (LBBE)
→
count data,
“poissonian PCA”
Count Matrix Factorization
27th Jan. 2016
4 / 23
Count matrix factorization
Compression
Matrix factorization
T
Xn×p ≈ Un×K VK
×p
Reduce the dimension by constructing K components or factors with K p
Latent structure:
- Un×K : coordinates or scores of th observations in the subspace
- Vp×K : contribution of the variables to the new axis
Main question: how to define the approximation “≈”?
First example: the principal component analysis
Based on the singular value decomosition (SVD): X = UΣVT
Factorization of the matrix X (centered)
PCA scores: plot U·,1 versus U·,2
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
5 / 23
Count matrix factorization
Compression
Statistical model associated with PCA
The data are supposed to be gaussian: Xij ∼ N
X ∼ N (UVT , Σ2 )
P
K
k=1
uik vjk , σ 2
i.e.
with Σ2 = diag(σ 2 , . . . , σ 2 )
The log-likelihood is
argmin
p n X
X
rk(UVT )=K i =1 j=1
xij −
K
X
uik vjk
2
k=1
which corresponds to SVD and therefore PCA (Eckart and Young, 1936)
To remember
The statistical model associated to the PCA suppose the data to be gaussian
Not suitable for the compression of count data
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
6 / 23
Count matrix factorization
Compression
Probabilistic PCA
(Pθ )θ is a set of distributions in the exponential family
Xij is supposed to follow the distribution Pθij of parameter θij
The matrix Ωn×p gathers the parameters regarding the matrix X
Considering the Xij ’s to be independent, the log-likelihood is:
log L = log P(X | Ω) =
p
n X
X
log P(xij | θij )
i =1 j=1
Generalized PCA (Collins et al., 2001)
Compression over the parameter matrix: Ω = UVT
P
Therefore θij = K
k=1 uik vjk , hence the objective function:
argmin
Ω=UVT ,rk(Ω)=K
−
p
n X
X
log P(xij | ui , vj )
(1)
i =1 j=1
with constraints over U and V so that Ω lies in the “good” space
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
7 / 23
Count matrix factorization
Gamma-Poisson model
Outline
1
Introduction
2
Count matrix factorization
Compression
Gamma-Poisson model
3
Optimization
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
8 / 23
Count matrix factorization
Gamma-Poisson model
Poisson model
Suitable for count data (Srivastava and Chen, 2010; Witten, 2011)
Xij the number of reads for the individual i and the gene j) follows a Poisson
distribution of rate λij :
Xij ∼ Poisson(λij )
Non-negative Matrix Factorization (NMF) by Lee and Seung (1999)
Probabilistic PCA with Poisson distribution
Factorisation: Λn×p = [λij ]ij = UVT , i.e. λij =
PK
k=1
uik vjk = uT
i vj
Log-likelihood:
L(X; U, V) =
p
n X
X
T
xij log(uT
i vj ) − ui vj
i =1 j=1
Optimization with constraints of non-negativity over the entries of U and V
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
8 / 23
Count matrix factorization
Gamma-Poisson model
Gamma-Poisson model
Limits of the Poisson model:
Very constraining: a single parameter fixes the the moments of order 1 and 2
→ lack of flexibility
In particular: NGS data often over-dispersed (i.e. variance > expectation)
→ Negative-binomial distribution more appropriate (Anders and Huber, 2010; Bonafede
et al., 2014)
Gamma-Poisson model
Bayesian approach: addition of a prior distribution onto the Poisson parameter λ:
λ ∼ Gamma(α1 , α2 )
X | λ ∼ Poisson(λ)
with α1 , α2 > 0 the parameters of the gamma distribution
Marginal distribution for the counts:
X ∼ NegBin α1 ,
Ghislain DURIF (LBBE)
α2 α2 + 1
Count Matrix Factorization
27th Jan. 2016
9 / 23
Count matrix factorization
Gamma-Poisson model
“Gamma-Poisson model with latent factors”
Conditional Poisson model:
Xij | (Uik , Vjk )k=1,...,K ∼ Poisson(λij )
Linear dependencies of Poisson parameters on latent components:
λij =
K
X
uik vjk
k=1
i.e.
Xn×p | Un×K , Vp×K ∼ Poisson(UVT )
Prior on the hidden factor:
Uik ∼ Gamma(αk,1 , αk,2 )
Vik ∼ Gamma(βk,1 , βk,2 )
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
10 / 23
Optimization
Outline
1
Introduction
2
Count matrix factorization
3
Optimization
Variational inference
Results
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
11 / 23
Optimization
Strategies
Compression
on counts
.
&
Approach “optimization
Approaches
under constraints”
“model-based”
↓
↓
NMF
+flexible
+rich
→
over-dispersion
zero-inflated
↓
Optimization
←
EM algorithm
issues
↓
&
Variational
inference
Ghislain DURIF (LBBE)
MCMC
Count Matrix Factorization
27th Jan. 2016
11 / 23
Optimization
Target
Objective:
complete likelihood
↓
posterior
↓
E[U | X] (and E[V | X])
↓
Output:
define the components
.
&
Variational inference
MCMC
↓
↓
draw
E[U | X] approximated by a
E[U | X]
factorizable distribution q(. . .)
↓
↓
expensive
Output: approx. distrib.
and moments
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
12 / 23
Optimization
Variational inference
Plan
1
Introduction
2
Count matrix factorization
3
Optimization
Variational inference
Results
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
13 / 23
Optimization
Variational inference
Variational distribution
Approximated posterior (Hoffman et al., 2013)
EX [log p(U, V | X)] is not tractable, hence p(U, V | X) is approximated by a
factorizable distribution q(U, V), called variational distribution
Approximation regarding the Kullback-Leibler divergence:
q(U, V) = argmin KL e
q (U, V) p(U, V | X)
e
distrib q
→ quantify the “proximity” between two probability distribution
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
13 / 23
Optimization
Variational inference
Lower bound on marginal log-likelihood
Evidence Lower Bound (ELBO):
Z
q(U, V)
log p(X) = log p(X, U, V)
dU dV
q(U, V)
p(X, U, V)
= log Eq
q(U, V)
p(X, U, V)
(by Jensen’s inequality)
≥ Eq log
q(U, V)
{z
}
|
bound to maximize
Objective to optimize (Hoffman et al., 2013)
J(q)
=
Eq [log p(X, U, V)] − Eq [log q(U, V)]
=
log p(X) + Eq [log p(U, V | X)] − Eq [log q(U, V)]
log p(X) −KL q(U, V) p(U, V | X)
| {z }
=
const
Maximizing J(q) is equivalent to minimizing KL q(U, V) p(U, V | X)
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
14 / 23
Optimization
Variational inference
Constraints on the distribution q
“Mean-field variational family” (Hoffman et al., 2013)
q is assumed to be factorizable, and each factor lies in the “proper” exponential family:
q(U, V) =
K
n Y
Y
i =1 k=1
q(uik | φik ) ×
p Y
K
Y
q(vjk | θ jk )
j=1 k=1
Each q(uik | φik ) (resp. q(vjk | θjk )) lies in the same exponential family as the prior
over Uik (resp. Vjk )
Parametrization in the exponential family:
Prior:
p(uik | αk ) = h(uik ) exp αT
k t(ujk ) − a` (αk )
Variational distribution:
q(uik | φik ) = h(uik ) exp φT
jk t(vjk ) − a` (φjk )
The base measure h, the vector of sufficient statistics t, the log-normalizer a` (written ag
for the Vjk ’s) are defined by the parametrization in the exponential family
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
15 / 23
Optimization
Variational inference
Conditionally conjugate model
Objective → maximize J(q) = Eq [log p(U, V, X)] − Eq [log q(U, V)]
Separate optimization (example with the φik ’s):
X
e
Jq [φik ]n×K =
Eq [log p(uik , | V, X)] − Eq [log q(uik | φik )] + const
i ,k
Complete conditional distribution
p(uik , | V, X) is conjugate to the prior of Uik :
p(Uik , | V, X) = h(uik ) exp η` (V, X)T t(uik ) − a` η` (V, X)
η` (·) is the vector of parameters defined by the conjugacy relation (explicit in the
exponential family)
Similarly:
p(Vjk , | U, X) = h(vjk ) exp ηg (U, X)T t(vjk ) − ag ηg (U, X)
h, t, a` (resp. ag ) are fixed by the prior Uik (resp. Vjk )
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
16 / 23
Optimization
Variational inference
Coordinate ascent algorithm
Optimization
Explicit criterion regarding each parameter
Resolution of singular point that annul the gradient
Iterative optimization through a fixed point algorithm that reaches a local optimum
(détails)
Upgrade:
Stochastic gradient descent (or sub-sampling) to improve the convergence speed:
θ (t) = θ (t−1) + ρt bt (θ (t−1) )
where (ρt )t is a sequence that converge but not too fast (in o(t −1 ) for instance), and
bt (θ (t−1) ) a realisation of the function B such that Eq [B(θ)] = ∇e
Jq (θ)
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
17 / 23
Optimization
Results
Outline
1
Introduction
2
Count matrix factorization
3
Optimization
Variational inference
Results
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
18 / 23
Optimization
Results
Simulated data
Over-dispersed model for count data
Without zero-inflation
norm. gap
0.04
0.00
−16000
−13000
0.08
log likelihood
0
20
40
60
80
100
0
iteration
20
40
60
80
100
iteration
Log-likelihood and normalized gap between following estimates during iterations (simulations with
n = 100, p = 50, K = 5)
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
18 / 23
Optimization
Results
bV
b T and UVT
KL-divergence between U
Symmetrized Kullback-Leibler divergence: symKL(A, B) =
1
2
KL(AkB) + KL(BkA)
bV
b T and UVT with an increasing number of components from
Symmetrized KL divergence between U
1 to 10 (simulations with n = 100, p = 500, K = 5), NMF1: `2 loss, NMF2: Poisson loss
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
19 / 23
Optimization
Results
Explained variance by Un×K
Percentage of variance of explained by U: trace Var U(UT U)−1 UT X / trace Var X
Variance explained by the components 1 to 10 (simulations with n = 100, p = 500, K = 5),
SVD1: PCA on log(X + 1), SVD2: PCA on X
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
20 / 23
Optimization
Results
Visualization
Component profiles after compression, i.e. columns of U (simulations with n = 100, p = 500, K = 2)
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
21 / 23
Conclusion
Outline
1
Introduction
2
Count matrix factorization
3
Optimization
4
Conclusion
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
22 / 23
Conclusion
Interest of the model
Take over-dispersion into account
Model the covariance structure and the dependencies within X
Principle of “PCA on counts”: decomposition of the signal into a product of matrix
→ interesting for data visualization and clustering
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
22 / 23
Conclusion
Thanks for your attention
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
22 / 23
Bibliographie
Bibliography I
Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data.
Genome biology, 11(10):R106.
Bonafede, E., Picard, F., Viroli, C., Sciences, S., Evolutive, B., Cnrs, U. M. R., and November,
F. (2014). Modelling overdispersion heterogeneity in differential expression analysis using
mixtures. pages 1–22.
Buganim, Y., Faddah, D. A., Cheng, A. W., Itskovich, E., Markoulaki, S., Ganz, K., Klemm,
S. L., van Oudenaarden, A., and Jaenisch, R. (2012). Single-Cell Expression Analyses during
Cellular Reprogramming Reveal an Early Stochastic and a Late Hierarchic Phase. Cell,
150(6):1209–1222.
Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evaluation of statistical
methods for normalization and differential expression in mRNA-Seq experiments. BMC
bioinformatics, 11:94.
Collins, M., Dasgupta, S., and Schapire, R. E. (2001). A generalization of principal components
analysis to the exponential family. Advances in Neural Information Processing Systems, (1).
Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank.
Psychometrika, I(3):211–218.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic Variational Inference.
Journal of Machine Learning Research, 14(2):1303–1347.
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
23 / 23
Bibliographie
Bibliography II
Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788–91.
Lee, D. D. and Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. In
Advances in Neural Information Processing Systems, volume 4029.
Lee, S., Chugh, P. E., Shen, H., Eberle, R., and Dittmer, D. P. (2013). Poisson factor models
with applications to non-normalized microRNA profiling. Bioinformatics (Oxford, England),
29(9):1105–11.
Maaten, L. V. D. and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine
Learning Research, 9:2579–2605.
Pierson, E., Yau, C., Drive, R., and Kingdom, U. (2015). ZIFA: Dimensionality reduction for
zero-inflated single cell gene expression analysis. bioRxiv, pages 1–10.
Risso, D., Ngai, J., Speed, T. P., and Dudoit, S. (2014). Normalization of RNA-seq data using
factor analysis of control genes or samples. Nature Biotechnology, 32(9):1–10.
Srivastava, S. and Chen, L. (2010). A two-parameter generalized Poisson model to improve the
analysis of RNA-seq data. Nucleic Acids Research, 38(17):1–15.
Witten, D. M. (2011). Classification and clustering of sequencing data using a poisson model.
Annals of Applied Statistics, 5(4):2493–2518.
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
24 / 23
Bibliographie
Principal Component Analysis (PCA)
Linear transformation of X by a projection into a lower dimensional subspace
Find orthognal components Tk = Xwk (with k = 1, . . . , K ) with maximum variance:
wk = argmax Var(Tk ) = argmax wkT XT Xwk
w∈Rp
w∈Rp
with X centered
Résolution
w1 , . . . , wK are the first K dominant eigenvectors of XT X
e V
eT
Singular Value Decomposition: Xn×p = UΣ
- Σr ×r = diag(δ1 , . . . , δr ) the (ordered) singular values of X, and eigenvalues of XT X
(and XXT )
e n×r (resp. V
e p×r ) are orthonormal whose columns are the eigenvectors of XXT
- U
(resp. XT X)
- r is the rank of X
e when X is centered, the new components are Tn×K = XV
e 1:K , i.e.
By definition of V,
e
Tn×K = U1:K Σ1:K (therefore orthogonal)
⇐
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
23 / 23
Bibliographie
Geometrical interpretation of PCA
Matrix approximation
UVT given by SVD is the rank-K matrix that minimizes its Frobenius distance to X
(Eckart and Young, 1936), i.e.
UVT = argmin kX − Mk2F
rk(M)=K
with the Frobenius norm defined by kAk2F = trace(AT A)
That is equivalent to an `2 approximation:
Why expanding k.k2F , the rows of UVT minimize (regarding the mi ’s in Rp ):
n
X
kxi − mi k22
i =1
where the xi ’s are the rows from X and the mi ’s their respective projections into a
lower dimensional subspace of dimension K
⇐
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
23 / 23
Bibliographie
EM algorithm
Latent variables: U = [Uik ]n×K et V = [Vjk ]p×K
Optimization
Maximization of the expectation of the complete log-likelihood of the model knowing
the data:
argmax EU,V | X [log L(X, U, V ; α, β)]
α,β
E-step:
EU,V | X [log L(X, U, V ; α, β)] = EU,V | X [log L(X | U, V)] + EU,V | X [log L(U, V ; α, β)]
|
|
{z
}
{z
}
(1)
(2)
Issues
P
(1) Poisson distribution depends on log(λij ) = log( k uik vjk ) that cannot be expanded
(2) Non explicite posterior : E[U | X] et E[V | X] are unknown
→ M-step is compromised
⇐
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
23 / 23
Bibliographie
Detailed optimization for variational inference
Explicit objective function:
h
X
i
e
Jq [φik ]n×K =
Eq η` (V, X)T t(uik ) − a` η` (V, X) − Eq [φT
ik t(uik ) − a` (φik )]
i ,k
Gradient: ∇φik e
Jq = ∇2φ a` (φik ) Eq [η` (V, X)] − φik
ik
because in the exponential family E[t(uik )] = ∇φik a` (φik )
Solution, for ∇ = 0 (Hoffman et al., 2013)
φik = Eq [η` (V, X)]
Similarly for Vjk : θ ik = Eq [ηg (U, X)]
Hence the iterative optimization:
(t)
pour tout i et k
(t)
pour tout j et k
φik = Eq (t−1) [η` (V, X)]
θ jk = Eq (t−1) [ηg (U, X)]
t = 1, 2, . . . , +∞
⇐
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
23 / 23
Bibliographie
Simulated data
Simulations regarding the model
Design of U and V
Control thanks to the parameters of the gamma priors: αk et β k
Example in the observation space:
U structured
clustering :
≈0
>0
>0
≈0
U non structured
vs
>0
>0
easy
difficult
Similarly with V in the variable space: easy versus difficult compression
⇐
Ghislain DURIF (LBBE)
Count Matrix Factorization
27th Jan. 2016
23 / 23
© Copyright 2026 Paperzz