Non-negative Matrix Factorization: A possible way

Non-negative Matrix Factorization:
A possible way to learn sound dictionaries
Hiroki Asari
Tony Zador Lab
Watson School of Biological Sciences
Cold Spring Harbor Laboratory
August 22, 2005
Abstract
Auditory system detects sound signals and uses the temporal-frequency information of the sound
signals to conduct sound identification, sound localization, and sound source separation. Thanks
to the past studies, we know that the hair cells at cochlea show frequency-dependent responses
against input sound signals. But, little is known about how the sound information is conducted to
and processed within the auditory cortex.
Recently, B.A.Pearlmutter and A.M.Zador proposed an algorithm for monaural source separation under the assumption that the precise head-related transfer function (HRTF) and all the
sound dictionary elements are known (ref.[1]). To apply this algorithm to the real sound signals,
here I propose that non-negative matrix factorization (NMF) applied to the spectrograms of sound
signals would successfully give a set of sound dictionary elements.
When NMF was applied to solo-music signals with an appropriate value for the rank of factorization, it could extract instrument-specific patterns of basis spectrograms, each of which has
a peak frequency for different notes. Interestingly, the sound signals converted back from the obtained basis spectrograms sounded more or less like the corresponding instrument, which suggests
that the obtained basis spectrograms, or basis sound elements, would be a good candidate for the
sound dictionary.
When NMF was applied to sound signals played with several different instruments, the obtained
basis sound elements can be categorized into each instrument-specific pattern by hand. In addition,
using the categorized elements, sound signals can be reconstructed corresponding to each instrument
part of the original music. The fact that source separation can be done using the basis sound
elements obtained by NMF suggests that NMF would be a possible way to learn sound dictionaries
from sound signals.
1
1.1
Introduction
Monaural source separation
We live in a world with a full variety of sounds from a different combination of sound sources.
Thanks to the great feats of the brain, however, we can listen to only those sounds that we pay
attention to, ignoring all the other irrelevant sounds. It seems that our brain accomplishes such
a complex task with the aid of binaural and monaural sound cues as well as visual cues. One of
the most important cues is the filtering effects (“HRTF”) by the head, torso, and pinnae, which
organisms would learn by experience. With the prior knowledge of HRTF, theoretical work showed
that monaural source separation can be done, if in addition we know all the “sound dictionary”
elements for all the sound sources (ref.[1]).
In this study, I addressed how we or computer can learn the huge sound dictionary. Because
sound signals contain temporal and frequency information, sound signals can be depicted in a twodimensional plane as spectrograms. Therefore, it would be reasonable to think that we can learn
sound dictionaries by learning representative subset of spectrograms for each sound source. Here I
propose that an algorithm called non-negative matrix factorization can be used for this purpose.
1.2
Non-negative matrix factorization (NMF)
NMF is an algorithm to factorize a data matrix under the non-negativity constraints (ref.[2], [3]).
Because the non-negativity constraints allow only additive combination of bases, NMF gives partsbased representation of the original dataset. In addition, compared to other factorization methods
such as principal component analysis (PCA) or independent component analysis (ICA), it is easy
to think of the meanings of the obtained bases. Therefore, when NMF is applied to a set of
spectrograms of music, each basis elements would be expected to represent different notes for
different instruments, which turned out to be true.
1
2
Methods
All the programming and data analyses were done in Matlab1 .
2.1
Non-negative matrix factorization (NMF)
The original data matrix is given as an n × m matrix V , each column of which contains the n data
values for one of the m spectrograms. Then, the data matrix V is approximated by NMF as
r
V ≈ WH =
Wia Haµ
a=1
where the rank of factorization, r, is chosen as nm > nr + rm to compress the original data V into
W H. The dimension of the factors W and H are n × r and r × m, respectively. Each column of
the matrix W contains one of the basis spectrograms, and the matrix H represents the coefficients
for reconstructing the original data matrix V .
The cost function, F , to be maximized by NMF implementation is given by the next equation:
F =
Viµ log(W H)iµ − (W H)iµ
i,µ
proof:
Here it is reasonable to think that sound signals are made up of a sparse
combination of basis elements at each discrete time point. This sparseness allows to
make a model that each data point Viµ is generated by a Poisson distribution with the
mean value (W H)iµ . The likelihood of the Poisson distribution is
Piµ (V |W H) = exp[−(W H)]
(W H)V
,
V!
where the indices are omitted to simplify the equation. The logarithm of both sides is
taken to transform the equation into
log Piµ = V log(W H) − W H − log(V !).
By summing log Piµ over i and µ, the likelihood that V is generated by W H is written
as
log Piµ =
Viµ log(W H)iµ − (W H)iµ − log(Viµ !) .
i,µ
i,µ
Because the value for log(Viµ !) is constant with respect to W and H, we can drop the
term log(Viµ !), and we will get the cost function F as in equation (1).
Q.E.D.
1
Please contact me at [email protected] if you want to use M-files for this analysis.
2
(1)
To maximize the cost function F , the following update rules are applied (ref.[2]).
Haµ ←− Haµ
Wia ←− Wia
Wia ←−
i
Viµ
(W H)iµ
µ
Viµ
Haµ
(W H)iµ
Haµ
(2)
Wia
j Wja
These update rules find a pair of W and H that gives an approximation V ≈ W H by converging
to a local maximum of the cost function F . NMF would be applied several times with different
non-negative initial values for W and H to find the best approximation for the dataset V . The first
two update rules do preserve the non-negativity of W and H, and the third update rule shows the
normalization of the column of W ( i Wia = 1). The monotonic convergence of the cost function
F by these update rules can be proved by using EM algorithm as shown below (ref.[3],[4]).
Expectation-Maximization algorithm Expectation-Maximization (EM) algorithm is a technique to solve maximum likelihood problems with two-step procedures. At the first “expectation”
step, a good auxiliary function for the original cost function is formulated, and the auxiliary function is maximized at the second “maximization” step, which is guaranteed to give a non-decreasing
value for the original cost function. These two steps are repeated as necessary until the original
cost function is converged enough to a local maximum. Therefore, the key for EM algorithm is to
find an appropriate auxiliary function.
Expectation step
Let F be a cost function to be maximized. Then, an auxiliary function G(h, ht ) for F (h) is
defined as a function that satisfies the following conditions:
G(h, ht ) ≤ F (h),
G(h, h) = F (h)
Maximization step
If G is an auxiliary function for F , then F is non-decreasing with the following update rule:
ht+1 = arg max G(h, ht )
h
proof:
F (ht+1 ) ≥ G(ht+1 , ht ) ≥ G(ht , ht ) = F (ht )
Q.E.D
In fact, ht+1 is not necessarily maximizing G but can be chosen so that G is non-decreasing
(Fig.1).
3
F
G
hmax
ht+1 ht
h
Figure 1: Maximizing the auxiliary function G
guarantees to give a non-decreasing value for F .
Proof of the NMF update rules:
The update rules shown in equation (2) can be proved by considering a modified version
of EM algorithm. Under the non-negativity constraints W ≥ 0 and H ≥ 0, define
G1 (H, H t ) and G2 (H, H t ) as
G1 (H, H t ) =
Viµ
i,µ
G2 (H, H t ) = −
Wia Haµ
(W H)iµ
a
Viµ
t
log(Wia Haµ
)−
Wia Haµ
a
i,µ
(W H)iµ
(W H t )iµ
i,µ
t
Wia Haµ
log
.
(W H t )iµ
By fixing W in equation (1), the cost function F can be thought of as a function of H,
satisfying the following relation:
F (H) = G1 (H, H) + G2 (H, H).
Then, F (H) is non-decreasing if G1 (H, H t ) is maximized with respect to H t , because
the inequality G2 (H t , H t+1 ) ≥ G2 (H t , H t ) is true2 for any H t+1 . Therefore, by setting
∂G1 (H, H t )/∂H t = 0, the update rule for H can be derived as shown in equation (2).
The update rule for W can also be derived in a similar method.
Q.E.D.
2.2
Principal Component Analysis and Independent Component Analysis
Principal component analysis (PCA) and independent component analysis (ICA) are methods to
simplify datasets by replacing a group of the original variables with a single new variable. Both
PCA and ICA generate a set of new variables {γi : i = 1, · · · , n}, and the original data x is
decomposed into a linear combination of the new variables:
x=
ai γi .
i
But, PCA chooses the basis set {γi : i = 1, · · · , n} as orthogonal, whereas ICA chooses it not as
orthogonal but as statistically independent.
2
Note:
i
pi log qi ≥
i
qi log qi ,
i
pi =
i
qi = 1
4
Principal component analysis PCA was done by using a built-in matlab code svd (singular
value decomposition, SVD). SVD decomposes a data matrix X into X = U SV T with a diagonal
matrix S and unitary matrices U and V . Then, U S corresponds to principal components given
by PCA.
Independent component analysis Independent Component Analysis (ICA) was done by an
algorithm called “FastICA” (ref.[5]). Independent components Y are derived by rotating the principal components U by an orthogonal matrix B, that is, Y = U B. In “FastICA“ algorithm, the
fixed-point algorithm is used to find B which is based on 4th-cumulant for the criterion of the
independency.
2.3
Preparation of various datasets for NMF analyses
Facial images Three hundred and fifty two facial images were downloaded from F.A.C.E.S directory at the internal web site of Cold Spring Harbor Laboratory3 . All the images were trimmed
and shrunk into 20 × 20 pixel images.
Kanji chinese characters A dataset of kanji chinese characters was prepared by rasterizing
“MS gothic” fonts (30 points). The dataset contains images (30 × 30 pixels) of 1006 characters4
that students in japan would learn by the age of 12.
Images comprised of non-overlapping arbitrary bases Twenty five non-overlapping arbitrary bases were generated by the two-dimensional normal distribution N [(µx , µy ), σ] with mean
(µx , µy ) and standard deviation σ in 25 × 25 pixel images.
(µx , µy ) = (5i − 2, 5j − 2),
(i, j = 1, 2, 3, 4, 5)
σ = 2.5
Each basis image was then normalized to sum to unity. Note that these bases were not completely
non-overlapping, but can reasonably be regarded as non-overlapping (Fig.2). Then, 1000 images
were generated for NMF analyses by randomly-weighted additive combination of the bases.
Images comprised of overlapping arbitrary bases To make overlapping arbitrary bases,
two random points were chosen for each basis image within a two-dimensional plane (x, y) (0 ≤
x, y ≤ 25). A line segment was generated by connecting these two points, and the distance, d, was
calculated between the line segment and a point of interest on the plane. Then, the value for each
point was determined by exp[−d/2], and normalized to sum to unity. These steps were iterated to
generate 25 overlapping arbitrary bases (Fig.3).
Datasets were generated by an additive combination of the bases with either random coefficients
or random sparse coefficients to test the effects of sparseness on the results.
3
URL: http://intranet.cshl.edu/frame.html?link=faces
The web site is not open to public. Instead, face image data are available at the following web sites (Aug 22, 2005).
CBCL database: http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html
ORL database: http://www.uk.research.att.com/facedatabase.html
4
National curriculum standards for elementary school issued by Ministry of Education, Culture, Sports, Science,
and Technology (written in Japanese): http://www.mext.go.jp/b menu/shuppan/sonota/990301b/990301d.htm
The list is available for example at AOZORA-BUNKO: http://www.aozora.gr.jp/kanji table/kyouiku list.zip
5
random
coefficients
0.15
H
0.10
∑
0.05
V = Wini H
[ m = 1000]
arbitrary non-overlapping bases
Wini [ n = 25x25 pixels]
data sets
Figure 2: An example of non-overlapping arbitrary bases Wini and some of the generated data set V .
data sets [m
= 1000]
V0 = Wini H0
∑
H0: random coefficients
Vs = Wini Hs
Hs: sparse random coeff.
(20% of Hs equals to 0)
arbitrary overlapping bases
Wini [ n = 20x20 pixels]
Figure 3: An example of overlapping arbitrary bases Wini and a schematic to generate datasets V0 and Vs
6
2.4
NMF analyses on sound signals
Music data were obtained from music CDs as WAVE format sound files with sampling frequency
of 11.025 ∼ 22.05 kHz. The sound signals were then fragmented into short overlapped pieces (300
msec), and each of them was converted into logarithmic-scale spectrogram (∆f = 1/12 octave,
∆t = 3 msec). Each spectrogram had about 6000 data points, and about 1000 spectrograms were
used for the NMF analyses with the rank of factorization r ∼ 100. The obtained basis spectrograms
were characterized and categorized by the main peak frequency (first formant) or the pattern for
the peaks. In addition, the obtained basis spectrograms were converted back into sound signals
(see below for the algorithm) to confirm if the categorization of the spectrograms was reasonable or
not. Then, according to the categorization, the sound signals were reconstructed to test whether
or not the reconstructed sounds corresponded to each instrument part of the original music.
Algorithm to convert spectrograms back into sound signals First of all, the peak frequencies for each discrete time of a target spectrogram were picked up. Second, the amplitudes for
each peak frequency were estimated by the interpolation of the discrete ones to make sound signals
with an arbitrary sampling frequency. Then, sine-waves, whose phase was zero at the time point
zero, were generated for each peak frequency with the estimated amplitude. By combining all the
sine-waves with different frequencies, the sound signals with the arbitrary sampling frequency can
be obtained from a spectrogram.
Discrete amplitudes for a peak frequency
and the interpolation
+
ed
tuli 0
p
m
A
-
Reconstructed sound signals
0
2
4
6
time point
8
10
Figure 4: Conversion from spectrogram back into sound signal.
Open circles show example values for a specific peak frequency in a spectrogram at each discrete time. The
interpolation is shown in dashed line, and the reconstructed sound signals for this peak frequency is shown
in a solid line.
7
3
Results
In the first section, NMF was applied to various datasets such as facial images to find out characteristics of NMF. The results of the factorization by NMF were also compared with those of other
methods such as PCA to validate the use of NMF. In the second section, NMF was applied to
datasets of spectrograms obtained from music data. See suplementary data for sound signal data
files (WAVE format).
3.1
Characterization of NMF
Example 1: Facial images To reproduce the result of D.D.Lee and H.S.Seung (ref.[2]), I firstly
applied NMF to m = 352 facial images, each consisting of n = 20 × 20 pixels, with the rank
of factorization r = 36. Although the number of the original image data V [m × n] was about
one-seventh of that used by D.D.Lee and H.S.Seung (V [m = 2429, n = 19 × 19]), the consistent
results were obtained, that is, the basis images W given by NMF (V ≈ W H) showed parts-based
representation as shown in figure 5.
1
2
3
4
5
6
Coefficients for
j-th image
1
Hj
2
j-th image
NMF
3
∑
4
5
Original data set
Compressed data set
V
WH
6
n = 20x20 pixels
m = 352 images
W [r = 36 basis images]
Figure 5: NMF finds an approximate factorization of the form V ≈ W H.
Each of the obtained basis images W shows distinct parts of faces. For example, W at the position (column,
row) = (4, 3), (2, 5) show “eyes,” and those at the position (3, 2) and (5, 3) show “mouth.”
8
Example 2: Kanji Chinese characters As another example, I applied NMF to a dataset of
Chinese characters. Because some of the Chinese characters consist of a combination of “left” and
“right” parts or a combination of “top” and “bottom” parts, I expected that NMF would extract
these parts that are used frequently in the original dataset. The result of the factorization with
the rank r = 25 showed that some basis elements represented parts used frequently in the dataset
(Fig.6). On the other hand, PCA and ICA failed to find any “meaningful” basis elements but
learned wholistic representation (Fig.6, data not shown).
Dataset V
[n=30x30 pixels, m=988 images]
NMF [25 basis images]
ICA [25 basis images]
+
+
0
–
0
消part such as 港海池酒流深泳…
31 characters contain 計part such as 評記話語読詩談…
31 characters contain 村part such as 村橋樹梅根桜枝…
26 characters contain 細part such as 絵終細紙線絹縮…
41 characters contain
Figure 6: NMF found “meaningful” parts while ICA failed.
NMF and ICA were applied to a dataset of m = 1006 kanji Chinese characters (n = 30 × 30 pixels). Some
of the basis images obtained by NMF showed “meaningful” parts of kanji Chinese characters as shown in
the bottom panel. In contrast, ICA learns wholistic representation.
9
Example 3: Dataset comprised of non-overlapping arbitrary bases To confirm if NMF
can really extract the basis images for the original dataset, I prepared a set of 1000 images V , each
of which is generated by randomly-weighted (H) summation of arbitrary non-overlapping bases
Wini (V = Wini H, see Methods section and Fig.2 for details). Then, NMF was applied to see if the
dataset V is separated back into the non-overlapping bases Wini . When the rank of factorization
was chosen correctly, that is, chosen to be the exact number of the real basis images Wini (r = 25 in
the case of Fig.2), the bases W obtained by NMF had one-to-one correspondence to the original ones
Wini (Fig.7). When the rank was chosen by far smaller or larger than the correct value, however,
the obtained basis images showed quite different patterns from Wini (data not shown). In contrast,
when the rank was chosen a little smaller or larger than the correct value, some images were screwed
up and showed redunduncy but the others remained to have good one-to-one correspondence to
Wini (data not shown). These results showed that it is important to choose an appropriate value
for the rank of factorization r, although I have not come up with a good criterion or theory to do
so (see discussion).
0.15
0.10
0.05
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
random
coefficients
H
∑
data sets
V = Wini H
[ m = 1000]
arbitrary non-overlapping bases
Wini [ n = 25x25 pixels]
NMF
#1
#9
#17
#7
#8
#20
#22
#18
#24
#14
#13
#2
#12
#5
#25
#11
#19
#4
#16
#6
#21
#23
#3
#15
#10
Wr [ r = 25 basis images]
Figure 7: The original non-overlapping bases (Wini ) and those obtained by NMF (Wr ).
Note that there is one-to-one correspondence between Wini and Wr . The numbering was done by hand.
Example 4: Dataset comprised of overlapping arbitrary bases I then generated datasets
by an additive combination of the arbitrary overlapping bases Wini with either random coefficients
(V0 = Wini H0 ) or random sparse coefficients (Vs = Wini Hs , see Methods section and Fig.3 for
details). The basis images obtained by NMF with the correct value for the rank of factorization
(r = 25) showed that NMF worked better when each image of the dataset is a sparse additive
combination of the original bases Wini (Fig.8). In contrast, PCA and ICA failed to break the
dataset Vs back into the original bases Wini , although some of the basis images obtained by ICA
had one-to-one correspondence to Wini (Fig.8). These results suggest the validity of using NMF
for sound analyses because sound signals are supposed to consist of a sparse combination of sound
dictionaries (ref.[1]).
10
random
coefficients
H0
dataset
∑
NMF
V0 = Wini H0
[r=25]
[ m = 1000]
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
+
#16
#9
#12
#17
#6
#5
#10
#23
#24
#21
#20
#13
#15
#2
#25
#7
#4
#19
#1
#18?
#3
0
arbitrary overlapping bases
Wini [n = 20x20 pixels]
random
sparse
coefficients
HS
#8
∑
dataset
NMF
VS = Wini HS
#11
#22+#6?
#14
[r=25]
[ m = 1000]
PCA
ICA
+
+
0
0
–
–
Figure 8: The original overlapping bases Wini and those obtained by NMF, PCA, and ICA
The basis images obtained by NMF from Vs showed good one-to-one correspondence to Wini (except for
#18), while the other methods failed to show good fidelity.
11
3.2
NMF analyses on sound signals
Solo music Several sets of spectrograms of solo music (cello, violin, flute, piano, etc) were used for
NMF analyses. As shown in figure 9, when NMF was applied to cello solo music, the obtained basis
images showed characteristic patterns for cello, each of which represented different peak frequencies
or “notes.” Moreover, when they were converted back into sound signals, they did sound like cello
(see Supplementary data). This is true for other instruments such as violin and flute, although
NMF did not work very well to extract the characteristic patterns or “sounds” for piano solo music
(data not shown). Interestingly, the basis spectrograms can be categorized into onset-, continuous-,
and ending-patterns for each instruments. In addition, because these basis spectrograms form a
harmonics of each instrument (see Fig.9 for example), and because they did sound like a harmonics
when the basis spectrograms were converted back into sound signals (see Supplementary data), it
can be said that NMF could separate sound signals into sound dictionary elements of each note of
each instrument.
#12
Hz
Power spectrogram of each sound dictionaries
3520
220
#25
3520
880
220
0
100
880
440
0
10
20
ries
30
tiona
40
d d ic
Soun
55
55
Hz
Amplitude
880
220
110
3250
880
220
55 Hz
200 msec
Figure 9: Application of NMF to cello solo music
The original music is “J.S.Bach, cello suite No.1 Prelude (∼ 2 27 , 22.05 kHz).” The dataset consists of
m = 983 log-scaled spectrograms (n = 6532: ∆f = 1/12 octave, ∆t = 3 msec) with the time range of 210
msec. The left two paneles show examples of basis images obtained by NMF (r = 49), both of which clearly
show the characteristic pattern for string instrument, i.e., lots of small resonance peaks in addition to the
main peak frequency. The basis image #12 has an onset-pattern and #25 has a continuous-pattern. The
right panel shows the power spectrogram of sound dictionaries over the whole time range of the spectrograms.
Each of them has different peak frequency or “note,” and they form a harmonics of the cello sounds.
12
Music played with several different instruments I then prepared datasets of music palyed
with several different instruments to try source separation based on the basis elements obtained
by NMF. When NMF was applied to a quartet (flute, violin, viola, and violoncello), the obtained
basis spectrograms (r = 64) can be categorized into at least two groups by hand. Spectrograms
in one category showed a single peak frequency, or “flute” patterns, while those in the other
category showed a strong peak frequency with several resonance frequencies, or “string-instrument”
patterns (Fig.10). Based on this categorization, when only “flute” pattern spectrograms were used
to reconstruct sound signals, I got sound signals of only the flute part of the original music (see
Supplementary data). On the other hand, when the other “string-instrument” pattern spectrograms
were used for the reconstruction of the sound signals, I got sound signals without the “flute” part
(Fig.10, see Supplementary data). This showed that source separation can be done based on the
sound dictionaries NMF gave, and this is true for other cases such as sourse separation between
trumpet and pipe organ5 , or between flute and harp6 (data not shown).
#6
#32
Hz
3520
V ≈ WH
Hz
[ r = 64 ]
3520
880
880
W → W fl + Wst
220
55
0
100
Flute part
200
msec
H → H fl + H st
W fl H fl + Wst H st
220
0
100
55
200
msec
Other parts
(strings)
Figure 10: Sourse separation based on the basis elements obtained by NMF
The original music is “W.A.Mozart, Flute quartet in D, K.285 Rondeau (∼ 1 39 , 22.05 kHz).” The dataset
consists of m = 659 log-scaled spectrograms (n = 6615: ∆f = 1/12 octave, ∆t = 3 msec) with the time
range of 210 msec. The basis spectrograms W obtained by NMF can be categorized into two (Wf l and Wst )
by hand based on the pattern of peak frequencies. The spectrogram #32 is one of the “flute” pattern basis
spectrograms Wf l , and the spectrogram #6 is one of the “string-snstrument” pattern basis spectrograms
Wst . Source separation can be done based on the categorization (Vf l = Wf l Hf l versus Vst = Wst Hst ).
5
J.S.Bach, Cantata “Jesu, joy of Man’s Desiring” BWV147 (∼ 3 45 , 16 kHz)
=⇒ Log-scaled spectrogram data sets V [n = 6090 × m = 1427], the rank of factorization r = 64
6
W.A.Mozart, Concerto for flute and harp in C, K.299:3 Rondeau. Allegro (∼ 8 49 , 16 kHz)
=⇒ Log-scaled spectrogram data sets V [n = 6090 × m = 3527], the rank of ractorization r = 144
13
4
Discussion
I have described some characteristics of NMF compared to PCA and ICA. I show how NMF would
break down a set of spectrograms into basis elements, and also show that the basis elements of
sound signals can be used for source separation of the original music pieces.
This work would contribute to solving monaural source separation problem theoretically. The
algorithm for this problem proposed by B.A.Pearlmutter and A.M.Zador requires two kinds of
prior knowledge, HRTF and sound dictionaries (ref.[1]). HRTF itself can be numerically obtained
by careful measurements7 , and biologically speaking, organisms would learn their own HRTF by
experience with the aid of visual cues. Therefore, the assumption that we know HRTF is quite
reasonable. On the other hand, although we can tell piano or flute sounds as “piano” or “flute” for
example, it has not been addressed how we can acquire this knowledge of the sound dictionaries. I
here used NMF to learn representative subsets of spectrograms for each instrument, and showed it is
possible to do source separation based on the bases obtained by NMF. Because this result suggests
that NMF would be a possible way to learn sound dictionaries, it would be very interesting to
use these basis sound elements as one of the prior knowledge for the monaural source separation
algorithm.
In this study, I used music pieces as sound signals because I could easily expect that the elements
of music would be notes. As I expected, the basis elements obtained by NMF showed different notes
for each instrument, and they formed a harmonics of the instruments on the whole. Now that we
know NMF works on music, I think it is very interesting to apply NMF to natural or vocal sounds
to see what basis sound elements of nature or languages are. In addition, to test the biological
meanings of this algorithm, it would be intriguing to record neural responses against the basis
elements of sound signals. In the visual system, it is well-known that rod and cones detect light
with specific wavelength while neural responses at the visual cortex depend on the orientation of
lines and edges, and their movements in specific directions (ref.[6]). But the “receptive fields” of
auditory system has not been addressed very well. Considering that the sound dictionaries are
obtained by a learning process called NMF, and considering that our responses to familiar sounds
are usually better than those to unfamiliar ones, I think it is possible that sound dictionaries
obtained by NMF form “receptive fields” at some level of the auditory system.
4.1
The rank of factorization for NMF
A big problem for NMF is that the rank of factorization r is an arbitrary number, which is a classic
problem on a “modeling” in general. Although the rank of factorization would be critical for the
results as I showed here, there seems to be no good criterion how to choose an appropriate value
for r. To compress a data matrix V [n × m] into W [m × r] · H[r × n], r should satisfy the inequality
nm > mr + rn. This is the only constraint for r, but unfortunately, this inequality would not be
very helpful to determine an appropriate value for r.
In the case of music, because I expected that the sound dictionary would be notes of each
instrument, I could speculate the value for r, which worked quite well. But in other cases such
as natural sounds, it would be required to do cross-validation, using AIC (Akaike’s information
criterion) for an error criterion for example.
7
For example, see the next web site
URL: http://interface.cipic.ucdavis.edu/CIL html/CIL HRTF database.htm
14
Acknowledgements
I thank Anthony Zador, Barak Pearlmutter, and all the members in Zador laboratory for helpful
discussion. I thank Christian Mathens for allowing me to use his Matlab code for calculating
logarithmic-scale spectrograms. This work is supported by Farish-Gerry Fellowship for Watson
School of Biological Sciences.
References
[1] B.A.Pearlmutter and A.M.Zador. Monaural source separation using spectral cues. Submitted.
[2] D.D.Lee and H.S.Seung. Learning the parts of objects by non-negative matrix factorization.
Nature, 401:788–791, 1999.
[3] D.D.Lee and H.S.Seung. Algorithms for non-nagetive matrix factorization. Adv. Neural Info.
Proc. Syst., 13:556–562, 2001.
[4] T.Hastie, R.Tibshirani, and J.Friedman. The Elements of Statistical Learning. Springer, 2001.
[5] A.Hyvarinen and E.Oja. A fast fixed-point algorithm for independent component analysis.
Neural Comp., 9:1483–1492, 1997.
[6] T.N.Wiesel and D.H.Hubel. Spatial and chromatic interactions in the lateral geniculate body
of the rhesus monkey. J. Neurophysiol., 29:1115–1156, 1966.
15
Supplementary data: list of WAVE files
This is a list of some example WAVE files for the NMF analyses.
Solo music (cello)
J.S.Bach, cello suite No.1 Prelude (∼ 2 27 , 22.05 kHz)
• cello-ori.wav : A part of original sound signals
• cello-sp.wav
: Sound signals for the dataset V (before NMF)
• cello-nmf.wav : Sound signals for the compressed data W H (after NMF)
• cello12.wav
: Sound signals for the basis spectrogram #12
• cello25.wav
: Sound signals for the basis spectrogram #25
• cello-dic.wav : Sound signals for all the basis spectrograms
Quartet
W.A.Mozart, Flute quartet in D, K.285 Rondeau (∼ 1 39 , 22.05 kHz)
• 285-ori.wav : A part of original sound signals
• 285-sp.wav
: Sound signals for the dataset V (before NMF)
• 285-nmf.wav : Sound signals for the compressed data W H (after NMF)
• 285-6.wav
: Sound signals for the basis spectrogram #6
• 285-32.wav
: Sound signals for the basis spectrogram #32
• 285-dic.wav : Sound signals for all the basis spectrograms
• 285-fl.wav
: Sound signals for only ”flute” part Vf l = Wf l Hf l
• 285-st.wav
: Sound signals for the other string parts Vst = Wst Hst
16