Tuning Pruning in Sparse Non-negative Matrix

Informatics and Mathematical Modelling / Intelligent Signal Processing
Tuning Pruning in Sparse
Non-negative Matrix
Factorization
Morten Mørup
Joint work with
DTU Informatics
Intelligent Signal Processing
Technical University of Denmark
EUSIPCO’09
27 August 2009
Lars Kai Hansen
DTU Informatics
Intelligent Signal Processing
Technical University of Denmark
1
Informatics and Mathematical Modelling / Intelligent Signal Processing
Non-negative Matrix Factorization (NMF)
W
V
H
VWH, V≥0,W≥0,H≥0
Nature 1999
Daniel D. Lee
Sebastian Seung
 Gives part-based representation
(and as such also promotes sparse representations)
(Lee and Seung, 1999)
 Also named Positive Matrix Factorization (PMF)
(Paatero and Tapper, 1994)
 Popularized due to a simple algorithmic
procedure based on multiplicative update
(Lee & Seung, 2001)
EUSIPCO’09
27 August 2009
2
Informatics and Mathematical Modelling / Intelligent Signal Processing
Roadmap: Some important challenges in NMF
 How to efficiently compute the decomposition
(NMF is a non-convex problem)
(first part of this talk)
A good starting point is not to use multiplicative updates
 How to resolve the non-uniqueness of the
decomposition
Positive Orthant
Convex Hull
NMF only unique when data adequately spans
the positive orthant (Donoho & Stodden, 2004)
z
z
z
y
y
y
x
x
x
 How to determine the number of components
(second part of this talk)
We will demonstrate that Automatic Relevance Determination
in Bayesian learning can address these challenges by tuning
the pruning in sparse NMF
EUSIPCO’09
27 August 2009
3
Informatics and Mathematical Modelling / Intelligent Signal Processing
Multiplicative updates
Step size parameter
(Salakhutdinov, Roweis, Ghahramani, 2004)
(Lee & Seung, 2001)
EUSIPCO’09
27 August 2009
4
Informatics and Mathematical Modelling / Intelligent Signal Processing
Other common approaches for solving the NMF problem
 Active set procedure
(Analytic closed form solution wihtin active set for LS-error)
(Lawson and Hansen 1974), (R. Bro and S. de Jong 1997)
 Projected gradient
(C.-J. Lin 2007)
MU do not converge to optimal solution!!!!
EUSIPCO’09
27 August 2009
5
Informatics and Mathematical Modelling / Intelligent Signal Processing
Sparseness has been imposed to alleviate the
non-uniqueness of NMF
(P. Hoyer 2002, 2004), (J. Eggert and E. Körner 2004)
Sparseness motivated by the principle of
parsimony, i.e. forming the simplest account. As
such sparseness is also related to VARIMAX and
ML-ICA based on sparse priors
EUSIPCO’09
27 August 2009
6
Informatics and Mathematical Modelling / Intelligent Signal Processing
Open problems for Sparse NMF (SNMF)
 What is the adequate degree of sparsity imposed 
 What is the adequate number of components K to
model the data
Both issues can be posed as the single problem of tuning the pruning in
sparse NMF (SNMF). Hence, by imposing a component wise sparsity
penalty the above problems boils down to determining k.
k results in kth component turned off (i.e. removed).
EUSIPCO’09
27 August 2009
7
Informatics and Mathematical Modelling / Intelligent Signal Processing
Bayesian Learning and the Principle of Parsimony
The explanation of any phenomenon
should make as few assumptions as
possible, eliminating those that make
no difference in the observable
predictions of the explanatory
hypothesis or theory.
William of Ockham
To get the posterior probability distribution,
multiply the prior probability distribution by
the likelihood function and then normalize
Thomas Bayes
David J.C. MacKay
EUSIPCO’09
27 August 2009
Bayesian learning embodies Occam’s razor, i.e.
Complex models are penalized. The horizontal
axis represents the space of possible data sets D.
Bayes rule rewards models in proportion to how
much they predicted the data that occurred.
These predictions are quantified by a normalized
probability distribution on D.
8
Informatics and Mathematical Modelling / Intelligent Signal Processing
SNMF in a Bayesian formulation
Likelihood function
Prior
(In the hierarchical Bayesian framework priors on  can further be imposed)
EUSIPCO’09
27 August 2009
9
Informatics and Mathematical Modelling / Intelligent Signal Processing
The log posterior for Sparse NMF is now given by
The contribution in the log posterior from the normalization constant of
the priors enables to learn from data the regularization strength 
(This is also known as Automatic Relevance Determination (ARD))
When Inserting this value for  in the objective it can be seen that ARD corresponds to a
reweighted L0-norm optimization scheme of the component activation
EUSIPCO’09
27 August 2009
10
Informatics and Mathematical Modelling / Intelligent Signal Processing
No closed form solution for posterior moments of W
and H due to non-negativity constraint and use of nonconjugate priors.
 Posterior distribution can be estimated by sampling
approaches, c.f. previous talk by Mikkel Schmidt.
 Point estimates of W and H can be obtained by
maximum a posteriori (MAP) estimation forming a
regular sparse NMF optimization problem.
EUSIPCO’09
27 August 2009
11
Informatics and Mathematical Modelling / Intelligent Signal Processing
Data results
Handwritten digits:
X256 Pixels x 7291 digits
EUSIPCO’09
27 August 2009
CBCL face database:
X361 Pixels x 2429 faces
12
Wavelet transformed EEG:
X64 channels x 122976 tim.-freq. bins
Informatics and Mathematical Modelling / Intelligent Signal Processing
Analyzing X vs. XT
Handwritten digits (X):
X256 Pixels x 7291 digits
SNMF has clustering like-properties
(As reported in Ding, 2005)
EUSIPCO’09
27 August 2009
Handwritten digits (XT):
X7291 digits x 256 Pixels
SNMF have part based representation
(As reported in Lee&Seung, 1999)
13
Informatics and Mathematical Modelling / Intelligent Signal Processing
Conclusion
 Bayesian learning forms a simple framework for tuning the
pruning in sparse NMF thereby both establishing the model
order as well as resolving the non-uniqueness of the NMF
representation.
 Likelihood function (i.e. KL (Poisson noise) vs. LS (Gaussian
noise)) heavily impacted the extracted number of components.
In comparison, a tensor decomposition study given in
(Mørup et al., Journal of Chemometrics 2009) demonstrated
that the choice of prior distribution only has limited effect for
the model order estimation.
 Many other conceivable parameterizations of the prior as well
as approaches to parameter estimation. However, Bayesian
learning forms a promising framework for model order
estimation as well as resolving ambiguities in the NMF model
through the tuning of the pruning.
EUSIPCO’09
27 August 2009
14