Nonnegative matrix factorization and applications in audio signal

Nonnegative matrix factorization and applications in
audio signal processing
Cédric Févotte
Laboratoire Lagrange, Nice
Machine Learning Crash Course
Genova, June 2015
1
Outline
Generalities
Matrix factorisation models
Nonnegative matrix factorisation
Majorisation-minimisation algorithms
Audio examples
Piano toy example
Audio restoration
Audio bandwidth extension
Multichannel IS-NMF
2
Matrix factorisation models
Data often available in matrix form.
features
coefficient
samples
3
Matrix factorisation models
Data often available in matrix form.
movies
movie
rating
4
users
4
Matrix factorisation models
Data often available in matrix form.
words
word
count
57
text documents
5
Matrix factorisation models
Data often available in matrix form.
frequencies
Fourier
coefficient
0.3
time
6
Matrix factorisation models
≈
dictionary learning
low-rank approximation
factor analysis
latent semantic analysis
data X
dictionary W
activations H
≈
7
Matrix factorisation models
≈
dictionary learning
low-rank approximation
factor analysis
latent semantic analysis
data X
dictionary W
activations H
≈
8
Matrix factorisation models
for dimensionality reduction (coding, low-dimensional embedding)
≈
9
Matrix factorisation models
for unmixing (source separation, latent topic discovery)
≈
10
Matrix factorisation models
for interpolation (collaborative filtering, image inpainting)
≈
11
pa
tte
rn
s
Nonnegative matrix factorisation
F features
V
H
K
N samples
≈
W
I
data V and factors W, H have nonnegative entries.
I
nonnegativity of W ensures interpretability of the dictionary, because
patterns wk and samples vn belong to the same space.
I
nonnegativity of H tends to produce part-based representations, because
subtractive combinations are forbidden.
Early work by Paatero and Tapper (1994), landmark Nature paper by Lee and Seung (1999)
12
49 images among 2429 from MIT’s CBCL face dataset
13
PCA dictionary with K = 25
red pixels indicate negative values
14
NMF dictionary with K = 25
experiment reproduced from (Lee and Seung, 1999)
15
the word count matrix V is extremely sparse, which speeds up the algorithm.
ve model of Fig. 3, visible variables are few articles,
19 implemented through plasticit
nections.
NMF
is then
in
emission
tomography
and
deconvolution
of blurred
shown
are
after learning
the
update rules
of ,Fig.
2 were
iterated 50 times starting
from
bles by a network containing excitatory The results
the
synaptic
connections.
random initial
conditions
for W20,21
and. H. A full discussion of such a networ
nomical
images
work that infers the hidden from the
beyond
the scope
of generative
this letter.model
Here ofweFig.
only
point variab
out
According
to the
3, visible
addition
of Seung,
inhibitory
feedback
con- 1999)
Received 24 May; accepted 6 August 1999.
(Lee and
1999;
Hofmann,
generated
from
hidden
variables
by
a
network
containing
exc
hen implemented through plasticity in 1. Palmer, S. E. Hierarchical structure in perceptual representation. Cogn. Psychol. 9, 441–474 (1977).
connections.
A neural network that infers the hidden fro
full discussion of such a network is 2. Wachsmuth,
E., Oram, M. W. & Perrett, D. I. Recognition of objects and their component parts:
variables
requires
the
ofCortex
inhibitory
responsesvisible
of single units
in the temporal
cortex of
the addition
macaque. Cereb.
4, 509–522 feedbac
(1994).
letter. Here we only point out the
3. Logothetis,
N. K. & Sheinberg,
D.learning
L. Visual object
recognition.
Annu. Rev. Neurosci.
19, 577–621
nections.
NMF
is
then
implemented
through
plast
president
court
(1996). the synaptic connections. A full discussion of such a netw
government served
Encyclopedia
ent
4. Biederman, I. Recognition-by-components: a theory of human image understanding.
Psychol. Rev. 94,
beyond thegovernor
scope of this letter. Here we 'Constitution
only pointof ot
council
115–147 (1987).
×
NMF for latent semantic analysis
Encyclopedia entry:
'Constitution of the
United States'
≈
president (148)
congress (124)
power (120)
united (104)
constitution (81)
amendment (71)
government (57)
law (49)
.. glass copper lead steel
secretary
culture
States'
5. Ullman,
S. High-Level Vision:
Object Recognition and Visual Cognition (MIT Press,United
Cambridge,
MA,
senate
1996).supreme
constitutional
president
(148)
congress
6. Turk, M. & Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86 (1991).
congress (124)
presidential
7. Field,rights
D. J. What is the goal
of sensory coding? Neural Comput. 6, 559–601 (1994).
justice
power
(120)
elected
8. Foldiak,
P. & Young, M. Sparse
coding in the primate cortex. The Handbook of Brain
Theory
and
president
court
Neural Networks 895–898 (MIT Press, Cambridge, MA, 1995).
united (104)
government served
9. Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties constitution
by Encyclopedia
learning a sparse
(81
disease
flowers
governor
council
'Constitution
code for natural images. Nature 381, 607–609 (1996).
amendment
(71
behaviour
leaves
secretary
culture
United
Sta
10. Lee, D. D. & Seung, H. S. Unsupervised learning by convex and conic coding. Adv.
Neural
Info. Proc.
government
(57
glands
plant
senate
supreme
Syst. 9, 515–521 (1997).
law (49)
contact
perennial
constitutional
president
congress
11. Paatero, P. Least squares formulation of robust non-negative factor analysis. Chemometr. Intell. Lab.(1
symptoms
flower
rights
congress (1
presidential
37, 23–35
(1997).
plants
justice
(120)
elected and perceiving visual surfaces. Science 257,power
12. Nakayama,
K. & Shimojo,skin
S. Experiencing
1357–1363
pain
(1992).growing
united (104)
infection
annual
13. Hinton,
G. E., Dayan, P., Frey,
B. J. & Neal, R. M. The ‘‘wake-sleep’’ algorithm for unsupervised
neural
constitution
disease
flowers
networks. Science 268, 1158–1161 (1995).
amendment
behaviour
leaves
14. Salton, G. & McGill, M. J. Introduction to Modern Information Retrieval (McGraw-Hill,
New York,
government
glands
plant
1983).
n
law (49)
metal
paper
... glass
copper
lead
steel Psychol. Rev. 104,
15. Landauer,
T.perennial
K.process
& Dumais,method
S. T.contact
The latent
semantic
analysis
theory of
knowledge.
symptoms
flower
211–240 (1997).
example
... rules
leads algorithm
law
skin people
16. Jutten,person
C. &plants
Herault,
J. Blind time
separation
of sources,
part I:lead
An adaptive
based on
pain
growing
neuromimetic
architecture. Signal
Proc. 24, 1–10 (1991).
17. Bell, A. J. &annual
T. J. Aninfection
information maximization approach to blind separation and blind
Figure Sejnowski,
4 Non-negative
matrix
factorization (NMF) discovers semantic features of
deconvolution. Neural Comput. 7, 1129–1159 (1995).
16
m ¼M.30;991
articles
the Grolier
encyclopedia.
For each
word in a vocabulary
18. Bartlett,
S., Lades,
H. M. from
& Sejnowski,
T. J. Independent
component
representations
for face o
×
×
≈
vn
≈
≈
×
W
h
reproduced from (Lee and Seung, 1999)
NMF for hyperspectral unmixing
(Berry, Browne, Langville, Pauca, and Plemmons, 2007)
Fig. 1.
Hyperspectral imaging concept.
reproduced from (Bioucas-Dias et al., 2012)
I. I NTRODUCTION
17
All factors are positive-valued: X ≈ W ⋅ H
NMF for audio spectral unmixing
(Smaragdis
and Brown,reconstruction
2003)
! Resulting
is additive
Non-negative
Input music passage
20000
16000
Component
Frequency (Hz)
6000
3500
2000
1000
500
100
3
2
Frequency
1
0.5
1
1.5
Time (sec)
2
2.5
3
4
Component
4
3
2
1
ABOUT
THIS
N O N - N E G A T I V Ereproduced
BUSINESS
from (Smaragdis, 2013)
18
Outline
Generalities
Matrix factorisation models
Nonnegative matrix factorisation
Majorisation-minimisation algorithms
Audio examples
Piano toy example
Audio restoration
Audio bandwidth extension
Multichannel IS-NMF
19
NMF as a constrained minimisation problem
Minimise a measure of fit between V and WH, subject to nonnegativity:
X
min D(V|WH) =
d([V]fn |[WH]fn ),
W,H≥0
fn
where d(x|y ) is a scalar cost function, e.g.,
I
I
I
I
I
I
I
squared Euclidean distance (Paatero and Tapper, 1994; Lee and Seung, 2001)
Kullback-Leibler divergence (Lee and Seung, 1999; Finesso and Spreij, 2006)
Itakura-Saito divergence (Févotte, Bertin, and Durrieu, 2009)
α-divergence (Cichocki et al., 2008)
β-divergence (Cichocki et al., 2006; Févotte and Idier, 2011)
Bregman divergences (Dhillon and Sra, 2005)
and more in (Yang and Oja, 2011)
Regularisation terms often added to D(V|WH) for sparsity, smoothness,
dynamics, etc.
20
Common NMF algorithm design
I
Block-coordinate update of H given W(i−1) and W given H(i) .
I
Updates of W and H equivalent by transposition:
V ≈ WH ⇔ VT ≈ HT WT
I
Objective function separable in the columns of H or the rows of W:
X
D(V|WH) =
D(vn |Whn )
n
I
Essentially left with nonnegative linear regression:
def
min C (h) = D(v|Wh)
h≥0
Numerous references in the image restoration literature. e.g., (Richardson, 1972;
Lucy, 1974; Daube-Witherspoon and Muehllehner, 1986; De Pierro, 1993)
21
Majorisation-minimisation (MM)
Build G (h|h̃) such that G (h|h̃) ≥ C (h) and G (h̃|h̃) = C (h̃).
Optimise (iteratively) G (h|h̃) instead of C (h).
0.5
Objective function C(h)
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
3
22
Majorisation-minimisation (MM)
Build G (h|h̃) such that G (h|h̃) ≥ C (h) and G (h̃|h̃) = C (h̃).
Optimise (iteratively) G (h|h̃) instead of C (h).
0.5
Objective function C(h)
Auxiliary function G(h|h(0))
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
(1)
h
(0)
h
3
22
Majorisation-minimisation (MM)
Build G (h|h̃) such that G (h|h̃) ≥ C (h) and G (h̃|h̃) = C (h̃).
Optimise (iteratively) G (h|h̃) instead of C (h).
0.5
Objective function C(h)
Auxiliary function G(h|h(1))
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
(2)
h
(1)
h
(0)
h
3
22
Majorisation-minimisation (MM)
Build G (h|h̃) such that G (h|h̃) ≥ C (h) and G (h̃|h̃) = C (h̃).
Optimise (iteratively) G (h|h̃) instead of C (h).
0.5
Objective function C(h)
Auxiliary function G(h|h(2))
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
(3)
h
(2)
h
(1)
h
(0)
h
3
22
Majorisation-minimisation (MM)
Build G (h|h̃) such that G (h|h̃) ≥ C (h) and G (h̃|h̃) = C (h̃).
Optimise (iteratively) G (h|h̃) instead of C (h).
0.5
Objective function C(h)
Auxiliary function G(h|h*)
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
*
h
(3)
h
(2)
h
(1)
h
(0)
h
3
22
Majorisation-minimisation (MM)
I
Finding a good & workable local majorisation is the crucial point.
I
For most the divergences mentioned, Jensen and tangent inequalities are
usually enough.
I
In many cases, leads to multiplicative algorithms such that
!γ
∇−
hk C (h̃)
hk = h̃k
∇+
hk C (h̃)
where
I
I
I
+
∇hk C (h) = ∇−
hk C (h) − ∇hk C (h) and the two summands are nonnegative
γ is a divergence-specific scalar exponent.
More details about MM in (Lee and Seung, 2001; Févotte and Idier, 2011;
Yang and Oja, 2011).
23
How to choose a right measure of fit ?
I
Squared Euclidean distance is a common default choice.
I
Underlies a Gaussian additive noise model such that
vfn = [WH]fn + fn .
Can generate negative values – not very natural for nonnegative data.
I
Many other options.
Select a right divergence (for a specific problem) by
I
comparing performances, given ground-truth data.
I
assessing the ability to predict missing/unseen data (interpolation,
cross-validation).
I
probabilistic modelling:
D(V|WH) = − log p(V|WH) + cst
24
How to choose a right measure of fit ?
I
Let V ∼ p(V|WH) such that E[V|WH] = WH
I
then the following correspondences apply with
D(V|WH) = − log p(V|WH) + cst
data support
real-valued
integer
integer
nonnegative
generally
nonnegative
distribution/noise
additive Gaussian
multinomial
Poisson
multiplicative
Gamma
divergence
squared Euclidean
Kullback-Leibler
generalised KL
examples
many
word counts
photon counts
Itakura-Saito
spectral data
Tweedie
β-divergence
generalises
above models
25
Outline
Generalities
Matrix factorisation models
Nonnegative matrix factorisation
Majorisation-minimisation algorithms
Audio examples
Piano toy example
Audio restoration
Audio bandwidth extension
Multichannel IS-NMF
26
Piano toy example
#
" # # ## $ !!!!
!!
!
!
!
!
!!
(MIDI numbers : 61, 65, 68, 72)
!!
!!
Figure: Three representations of data.
27
Piano toy example
IS-NMF on power spectrogram with K = 8
Dictionary W
Coefficients H
Reconstructed components
K=2
K=1
15000
−2
−4
−6
−8
−10
0.2
10000
0
5000
−0.2
0
10000
−2
−4
−6
−8
−10
0.2
0
5000
−0.2
0
K=6
K=5
K=4
K=3
6000
−2
−4
−6
−8
−10
0.2
4000
0
2000
−0.2
0
−2
−4
−6
−8
−10
8000
6000
4000
2000
0
−2
−4
−6
−8
−10
2000
−2
−4
−6
−8
−10
200
0.2
0
−0.2
0.2
0
1000
−0.2
0
0.2
0
100
−0.2
0
K=8
K=7
4
−2
−4
−6
−8
−10
−2
−4
−6
−8
−10
0.2
0
2
−0.2
0
2
0.2
0
1
−0.2
50
100
150
200
250
300
350
400
450
500
0
0
100
200
300
400
500
600
0.5
1
1.5
2
2.5
3
5
x 10
Pitch estimates:
65.0 68.0 61.0 72.0
(True values: 61, 65, 68, 72)
0
0
0
0
28
Piano toy example
KL-NMF on magnitude spectrogram with K = 8
Dictionary W
Coefficients H
Reconstructed components
K=1
100
0.2
−2
−0.2
−6
K=2
0
50
−4
0
60
−2
−4
−6
0.2
40
0
20
−0.2
0
K=3
40
−0.2
K=4
0
30
−2
K=5
0.2
20
−4
0
10
−6
−0.2
0
−2
0.2
40
0
−4
20
−6
K=6
0
20
−4
−6
−0.2
0
15
−2
−4
−6
K=7
0.2
−2
0.2
10
0
5
−0.2
0
−2
0.2
10
−4
0
5
−0.2
−6
0
K=8
4
0.2
−2
0
2
−4
−0.2
−6
50
100
150
200
250
300
350
400
450
500
0
0
100
200
300
400
500
600
0.5
1
1.5
2
2.5
3
5
x 10
Pitch estimates:
65.2 68.2 61.0 72.2 0
(True values: 61, 65, 68, 72)
56.2
0
0
29
Audio restoration
Louis Armstrong and His Hot Five
Log−power spectrogram
120
Freq.
100
80
60
40
20
Original data
Amp.
0.5
0
−0.5
10
20
30
40
50
60
Time (s)
70
80
90
100
30
Audio restoration
Louis Armstrong and His Hot Five
Original mono =
Accompaniment +
{z
}
|
Comp. 1,9
Brass
| {z }
Comp. 2,3,5−8
+ Trombone
| {z } + Noise
| {z }
Comp. 4
Comp. 10
Original mono denoised
Original denoised & upmixed to stereo
31
Audio bandwidth extension
(Sun and Mazumder, 2013)
Full-band training samples
Band-limited samples
V=
Y =
adapted from (Sun and Mazumder, 2013)
32
Audio bandwidth extension
(Sun and Mazumder, 2013)
AC/DC example
band-limited data (Back in Black)
training data (Highway to Hell)
bandwidth extended
ground truth
Examples from http: // statweb. stanford. edu/ ~ dlsun/ bandwidth. html , used with
permission from the author.
33
Multichannel IS-NMF
(Ozerov and Févotte, 2010)
Sources S
NMF: W H
Mixing system A
Mixture X
noise 1
+
+
noise 2
Multichannel NMF problem:
I
I
Estimate W, H and A from X
Best scores on the underdetermined speech and music separation task at the
Signal Separation Evaluation Campaign (SiSEC) 2008.
IEEE Signal Processing Society 2014 Best Paper Award.
34
User-guided multichannel IS-NMF
(Ozerov, Févotte, Blouet, and Durrieu, 2011)
I
the decomposition is guided by the operator: source activation time-codes
are input to the separation system.
I
set forced zeros in H when a source is silent.
35
References I
M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. Algorithms and
applications for approximate nonnegative matrix factorization. Computational Statistics & Data
Analysis, 52(1):155–173, Sep. 2007.
J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot.
Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379,
2012.
A. Cichocki, R. Zdunek, and S. Amari. Csiszar’s divergences for non-negative matrix factorization:
Family of new algorithms. In Proc. International Conference on Independent Component Analysis
and Blind Signal Separation (ICA), pages 32–39, Charleston SC, USA, Mar. 2006.
A. Cichocki, H. Lee, Y.-D. Kim, and S. Choi. Non-negative matrix factorization with α-divergence.
Pattern Recognition Letters, 29(9):1433–1440, July 2008.
M. Daube-Witherspoon and G. Muehllehner. An iterative image space reconstruction algorthm suitable
for volume ECT. IEEE Transactions on Medical Imaging, 5(5):61 – 66, 1986. doi:
10.1109/TMI.1986.4307748.
A. R. De Pierro. On the relation between the ISRA and the EM algorithm for positron emission
tomography. IEEE Trans. Medical Imaging, 12(2):328–333, 1993. doi: 10.1109/42.232263.
I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with Bregman divergences. In
Advances in Neural Information Processing Systems (NIPS), 2005.
C. Févotte and J. Idier. Algorithms for nonnegative matrix factorization with the beta-divergence.
Neural Computation, 23(9):2421–2456, Sep. 2011. doi: 10.1162/NECO a 00168. URL
http://www.unice.fr/cfevotte/publications/journals/neco11.pdf.
36
References II
C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito
divergence. With application to music analysis. Neural Computation, 21(3):793–830, Mar. 2009. doi:
10.1162/neco.2008.04-08-771. URL
http://www.unice.fr/cfevotte/publications/journals/neco09_is-nmf.pdf.
L. Finesso and P. Spreij. Nonnegative matrix factorization and I-divergence alternating minimization.
Linear Algebra and its Applications, 416:270–287, 2006.
T. Hofmann. Probabilistic latent semantic indexing. In Proc. 22nd International Conference on Research
and Development in Information Retrieval (SIGIR), 1999. URL
http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf.
D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature,
401:788–791, 1999.
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural and
Information Processing Systems 13, pages 556–562, 2001.
L. B. Lucy. An iterative technique for the rectification of observed distributions. Astronomical Journal,
79:745–754, 1974. doi: 10.1086/111605.
A. Ozerov and C. Févotte. Multichannel nonnegative matrix factorization in convolutive mixtures for
audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 18(3):
550–563, Mar. 2010. doi: 10.1109/TASL.2009.2031510. URL
http://www.unice.fr/cfevotte/publications/journals/ieee_asl_multinmf.pdf.
A. Ozerov, C. Févotte, R. Blouet, and J.-L. Durrieu. Multichannel nonnegative tensor factorization with
structured constraints for user-guided audio source separation. In Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, May
2011. URL http://www.unice.fr/cfevotte/publications/proceedings/icassp11d.pdf.
37
References III
P. Paatero and U. Tapper. Positive matrix factorization : A non-negative factor model with optimal
utilization of error estimates of data values. Environmetrics, 5:111–126, 1994.
W. H. Richardson. Bayesian-based iterative method of image restoration. Journal of the Optical Society
of America, 62:55–59, 1972.
P. Smaragdis. About this non-negative business. WASPAA keynote slides, 2013. URL
http://web.engr.illinois.edu/~paris/pubs/smaragdis-waspaa2013keynote.pdf.
P. Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music transcription. In
Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA’03),
Oct. 2003.
D. L. Sun and R. Mazumder. Non-negative matrix completion for bandwidth extension: a convex
optimization approach. In Proc. IEEE Workshop on Machine Learning and Signal Processing
(MLSP), 2013.
Z. Yang and E. Oja. Unified development of multiplicative algorithms for linear and quadratic
nonnegative matrix factorization. IEEE Transactions on Neural Networks, 22:1878 – 1891, Dec.
2011. doi: http://dx.doi.org/10.1109/TNN.2011.2170094.
38