Separation of Synchronous Pitched Notes by Spectral Filtering of

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
1845
Separation of Synchronous Pitched Notes
by Spectral Filtering of Harmonics
Mark R. Every and John E. Szymanski
Abstract—This paper discusses the separation of two or more simultaneously excited pitched notes from a mono sound file into separate tracks. In fact, this is an intermediate stage in the longer-term
goal of separating out at least two interweaving melodies of different sound sources from a mono file. The approach is essentially
to filter the set of harmonics of each note from the mixed spectrum
in each time frame of audio. A major consideration has been the
separation of overlapping harmonics, and three filter designs are
proposed for splitting a spectral peak into its constituent partials
given the rough frequency and amplitude estimates of each partial
contained within. The overall quality of separation has been good
for mixes of up to seven orchestral notes and has been confirmed
by measured average signal-to-residual ratios of around 10–20 dB.
Index Terms—Music note separation, partial extraction, separation of overlapping harmonics.
I. INTRODUCTION
T
HIS paper presents a data-driven approach, based upon an
analysis in the spectral domain, to separating multiple simultaneously excited pitched notes from a mono recording. An
attempt has been made to separate a mix of between two and
seven notes into the same number of tracks plus a residual. The
notes have approximately equal energies, are initially excited simultaneously and have almost the same duration. This research
is ultimately directed at separating a longer recording of an instrumental ensemble into its constituent instrumental parts or
melodic lines.
Potential applications of the aforesaid “mono-to-multitrack”
system are numerous. For example, classic recordings only
available in mono could be separated into individual instrumental parts, remastered track by track, and remixed again,
potentially even with new instruments. Alternatively, one
might want to remove a disturbing cough in a live recording,
and this might be achieved by separating the recording into
harmonic and residual components. Other applications exist in
the areas of effects processing, audio spatialization, restoration,
structured compression and coding, and music cataloguing and
retrieval.
The basic idea in this approach to the problem is that if the
pitch of each note in the mix is known in a particular time frame,
Manuscript received September 16, 2004; revised June 30,2005. This work
was carried out when the authors were with the Department of Electronics, University of York, York, U.K. The associate editor coordinating the review of this
manuscript and approving it for publication was Dr. Gerald Schuller.
M. R. Every is with CVSSP, SEPS, University of Surrey, Guildford, GU2
7XH, U.K. (e-mail: [email protected]; [email protected]).
J. E. Szymanski is with the Media Engineering Research Group, Department
of Electronics, University of York, Heslington, York, YO10 5DD, U.K. (e-mail:
[email protected]).
Digital Object Identifier 10.1109/TSA.2005.858528
then it is possible to identify the harmonics of each note in the
spectrum, and then to construct comb-like filters to filter the set
of harmonics of each note out of the composite spectrum. Pitchbased separation techniques have already been applied to speech
separation and enhancement, for example, in [1]–[3], and [4]
reviews a wide range of approaches to sound segregation related
to auditory scene analysis. In [1], vocalic speech was separated
from a mix of two competing talkers. There, crosstalk arising
from spectral peaks that were shared by the two speakers was
identified as a cause of degradation in the separated waveforms.
This issue becomes even more important when one is attempting
to separate out more than two pitched sources, since one would
expect many more partials to be overlapping. Furthermore, as
the occurrence of overlapping partials is far more common in
music than in speech due to the tendency to play harmonically
related notes together, the treatment of overlapping harmonics
is highly relevant to music source separation.
The full separation task consists of 1) detecting all salient
spectral peaks while the spectrum typically contains some lowlevel broadband energy due to noise and spectral leakage, 2)
estimating note pitch trajectories over all time frames using a
multipitch estimator, 3) matching spectral peaks to note harmonics, and 4) constructing filters to remove the individual note
spectra from the mixed spectrum. A multipitch estimator is used
to estimate the pitch trajectory of each note in the mix over all
time frames. Although accuracy in the former stages is essential
to achieving a realistic separation, we prefer here to emphasise
the filtering stage, and, in particular, the problem of overlapping
harmonics, which is felt is the area that has been least explored.
To clarify what class of sounds these algorithms have been
applied to, by the term “pitched,” it is implied that a note has
a perceivable pitch, and it contains most of its energy in harmonics that are at roughly integer multiples of the fundamental
frequency. The algorithms have been tested on bassoon, cello,
B clarinet, E clarinet, flute, French horn, oboe, piano, saxophone, trombone, and violin samples.
II. METHOD
To begin with, in each time frame the original mix of the
individual notes , is split into overlapping time frames and
a fast Fourier transform (FFT) is computed on the Hamming
, where is the window function. Time
windowed data,
samples at
frames of 186 ms in length (FFT length
a sampling rate
kHz) have been used with an 87.5%
indicates the value of the complex FFT specoverlap.
, and
is the
trum at frequency bin
corresponding frequency in Hertz.
1558-7916/$20.00 © 2006 IEEE
1846
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 1.
Thresholding and peak picking of the amplitude spectrum jF
A. Spectral Peak Identification
A reliable method for detecting spectral peaks was necessary
for multipitch estimation and to locate harmonics in the spectrum. Peak detection was performed successfully at frequencies
with a freup to the Nyquist limit, by thresholding
quency dependent threshold
, where is the shape
of the threshold and th is a frequency-independent threshold
height. Peak picking was then performed on the resulting thresholded spectrum.
The reason for using a variable threshold is that the typical
rolloff of harmonic amplitudes at higher frequencies often results in higher harmonics being too small to be detected by ap. These higher harmonics
plying a constant threshold to
are, however, very helpful for pitch estimation and are perceptually significant.
was arrived at in the following manner. The smoothed
with
amplitude envelope was calculated by convolving
a normalized Hamming window of length
samples.
An odd-numbered window length was chosen for symmetry
involves a weighted sum
reasons, i.e., the calculation of
, at an equal number of bins on either side
of terms
of bin . An alternative method for calculating the spectral
envelope is the regularized calculation of the cepstrum coefficients [5]. The Hamming windowing method was preferred due
to its computational efficiency, effectiveness and the fact that
the latter method involves the calculation of a matrix inversion
which was sometimes found to be numerically unstable. Then,
up to the Nyquist limit, where
we define
was used for the
a suitable range for is [0.5, 1), and
results given here. Smaller values of produce a flatter envelope, and this helps to avoid spurious peaks being detected in
regions of low spectral amplitude. To satisfy scaling invariance,
. Fig. 1 shows the amplitude spectrum
j
using a frequency-dependent threshold e
1
E^ (k).
and threshold for a mix of two violin notes with pitches A5
(880 Hz) and E6 (1319 Hz).
Next, a search was made to find all local maxima in
above the threshold. A frequency bin was considered to be a
peak maximum if
(1)
, where
is in the range (0,1],
is the length of vector , and we have chosen
.
This peak picking algorithm incorporates the simplest case,
, of checking whether the amplitude in each discrete Fourier
transform (DFT) bin is larger than only its nearest neighbors, but
can also be adapted to more noisy spectra by assigning a longer
vector to . The algorithm is not computationally expensive and
is easy to implement, although a systematic comparison has not
yet been made with other methods for peak-picking, such as
sinusoidal modeling of the DFT spectrum [6].
Once the spectral peaks had been identified, a refinement
was made
of the center frequency of each DFT maximum
to subfrequency bin resolution using a DFT frequency interpolator. At the same time, the peak amplitudes were refined
using an amplitude interpolator. A number of interpolation
methods were examined [7]: Quinn’s first and second interpolator, Grandke’s interpolator, the quadratic interpolator, the
baricentric interpolator and the DFT method implemented in
the software package InSpect [8]. The accuracy of the various
DFT interpolators are dependent on the type of windowing
applied to the data. It was found that the DFT and Grandke’s
method were both suitable frequency and amplitude interpolation methods for Hamming windowed data, although both
methods are in fact more accurate for Hanning windowed data.
The DFT method involves the calculation of two FFTs in each
time frame, and is not, hence, as computationally efficient as
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
1847
Fig. 2. Estimated pitch trajectories of two synchronous flute notes played with vibrato using the multipitch estimator. The reference lines show the transcribed
note pitches (G5 = 784 Hz and A5 = 880 Hz).
Grandke’s method. However, the DFT performed marginally
better than Grandke’s method in preliminary tests measuring
the frequency and amplitude interpolation errors for a sinusoid
to 20 dB
in white noise, in which the SNR was varied from
and, hence, was chosen as the preferred interpolator. Another
method for estimating peak frequencies [9] follows from minimizing, in a least-squares sense, the difference between the
observed spectrum and the first-order limited expansion of the
Fourier transform of the window function around each peak.
B. Multipitch Estimation
Following spectral peak detection and given a priori, the
number of notes in the mix, a multipitch estimator was designed
and used to estimate the pitch trajectories of all notes in the mix.
As there is typically some variation in pitch over the note duration, for example, due to vibrato, and since the filtering stage is
sensitive to slight pitch variations, it was necessary to estimate
of all
notes in every time
the pitches
frame. Pitch estimates in individual frames were then combined
in such as way as to form smooth note pitch trajectories while
ignoring isolated and clearly incorrect pitch estimates. Fig. 2
shows the results of using the multipitch estimator to estimate
the pitch trajectories of two flute notes played with vibrato. The
multipitch estimator, while being moderately reliable for 2–3
synchronous notes, was markedly worse at higher polyphonies
as discussed in Section IV, and its implementation will not be
expanded upon since better note error rates have been reported
for another multipitch estimator [10].
C. Estimating the Harmonic Frequencies
Once the pitch trajectories of all notes were calculated, then,
in a particular time frame, each detected spectral peak could potentially be matched with any single note that contained a harmonic within a limited range about the peak center frequency.
A match was not made when more than one harmonic from different notes happened to exist within this range. In this case, we
refer to these as overlapping harmonics, and since their amplitude spectra are likely to be significantly overlapping, the resulting spectral content shared by the harmonics will be called
an overlapping peak. To clarify, although, in Fig. 3, two peaks
were detected in the peak detection stage, the term “overlapping
peak” refers to the entire peak shared by both harmonics. On the
other hand, a “nonoverlapping peak” is a spectral peak that is
within range of at most one harmonic. Overlapping peaks will
be discussed in Section II-E. Following the matching process
and an adequate treatment of overlapping peaks, a filter was designed for each note, whose effect when multiplied by the DFT
spectrum, was to remove from the spectrum the peaks that were
matched uniquely to harmonics of that note, and a portion of the
energy in any overlapping peaks that the unmatched harmonics
of this note may have contributed to. The result of filtering the
composite spectrum with this set of filters was a segmentation
of the mixed spectrum into several constituent spectra corresponding to each note, and a residual basically containing the
low-level noise envelope of the mixed spectrum and any inharmonic partials. For the present being, we consider nonoverlapping spectral peaks.
A spectral peak was matched to a note if its frequency
was within a range
of any
, where
is
the frequency of the th harmonic of note . Typically, was
chosen to be in the range [0.01, 0.1]. If more than one peak was
, the largest peak was matched
found within this range of
with note and the others ignored. An identity was not used
in the above expression for the harmonic frequencies due to the
following reasons. First, the deviation of harmonic frequencies
from exact harmonicity can be quite significant, especially in
piano notes which will be discussed shortly. Second, any inacwould be compounded when mulcuracy in a pitch estimate
tiplying by
to find the th harmonic frequency. Last, separation results were improved by using nonrigid estimates of
the harmonic frequencies, and this is believed to be partly due
to the better frequency localization of constituent harmonics in
overlapping spectral peaks. The procedure for extrapolating eswas, first, to determine the frequency of the
timates of the
fundamental frequency component of note . If there existed
was set
a nonoverlapping peak within range of , then
1848
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
its partial frequencies are stretched rather more substantially
than in most Western instruments according to [11]
(3)
This comes about by physical consideration of the piano string
stiffness in the equation of motion for transverse waves in a vibrating bar. is the inharmonicity coefficient, and, for a typin the middle register, the 17th parical value of
tial would be shifted to about the frequency of the 18th par. An iteratial had the note been purely harmonic (
tive method was tested for producing better estimates of the
in (2), based upon (3). A predicpredicted frequencies
tion can be made by forming a joint estimate of
using
dea least-squares error minimization of (3) over the first
, and then substituting and
back
tected partials
. This improved tracking mainly of higher
into (3) to find
piano partials, which become progressively further apart in frequency as increases. The technique was not used, in general,
however, because of the increase in computation time arising
. It predictably yielded results of
from the estimation of
around zero when applied to other pitched orchestral instruments.
Fig. 3. Filtering of a spectral peak arising from two overlapping harmonics: a)
construction of the filters H (k ) using (5) and (6) is determined by the predicted
harmonic frequencies f , and f , and predicted harmonic amplitudes A and
A ; b) comparison of the filtered and original amplitude spectra of the individual
harmonics.
to ; otherwise, it was set to . The remaining harmonic frequencies of note were calculated iteratively according to (2)
a nonoverlap peak
(2)
otherwise
is the predicted frequency of the
where
th harmonic of note . As (2) relies on knowing whether the
peak is nonoverlapping, and this, in turn, depends on whether
multiple harmonics of separate notes have been predicted in
the local vicinity of this peak, this iterative process had to be
performed concurrently for all pitches. The synchronization of
these iterative processes was determined by always applying the
next iteration of (2) to the note corresponding to the minimum
. To clarify,
means the
of the set
th harmonic of note . To begin with,
, and
is incremented with each iteration for note .
We found that, as exact harmonicity was not enforced above,
i.e., the predicted harmonic frequencies were allowed to shift
slightly to coincide with nonoverlapping peak frequencies, the
above method was flexible enough to detect partials slightly detuned from exact harmonicity. However, in the case of the piano,
D. Filtering of Nonoverlapping Harmonics
The width of a nonoverlapping spectral peak , centered at the
frequency bin
, was found by searching
at frequency bins
for the first minima in
and
on opposite sides of the main lobe. If the peak
was
was matched with note , the amplitude of the filter
set to unity across the entire width of the peak,
.
Appreciably better results were achieved using these variable
width filter notches at the peak frequencies, rather than using
fixed width filter notches. Thus, in the resynthesis stage, when
is multiplied by the DFT spectrum, this has the effect of
filtering the entire main lobe of the peak from the original mixed
spectrum.
is real, it follows that the DFT spectrum is
Since
complex conjugate symmetric about the Nyquist frequency,
. Thus, frequency components
i.e.,
above the Nyquist limit are easily removed from the mixed
. However, it is only
spectrum by using
actually exists.
in (10) that an imaginary component in
An advantage of the approach used here as opposed to models
in which separated harmonics are synthesized using well-behaved sinusoids [10], [12]–[15] is that, since the amplitude of
the filter notches is unity across the width of all the nonoverlapping peaks matched to harmonics, then the residual contains
at most some traces of the detected harmonics due to sidelobes,
which do not tend to be noticeable. This also holds for overlapping peaks due to a normalization (5), which will be discussed in
Section II-E. In the case of sinusoidal models, if the harmonics
are not well modeled by sinusoids with slowly time-varying amplitude and frequency, and the residual is calculated by subtracting the set of sinusoids from the original waveform, then
there could be some leakage of the harmonics into the residual.
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
E. Filtering of Overlapping Harmonics
Previous approaches to separating overlapping partials in
both speech and music fields include those that rely on sinusoidal models [1], [10], [12]–[15], a perceptually motivated
smoothing of the amplitude spectrum of each source [12], linear
models for overtone amplitudes [15], spatial mixing models
[16], and a multistrategy approach [17]. Of the techniques
based upon a sinusoidal model, [1] iteratively subtracts larger
amplitude partials to reveal partially hidden weaker partials,
[12], [15] iteratively estimates the phases and amplitudes of
closely spaced sinusoids alternately with frequency estimates,
and [10], [13], [14] use pre-estimated sinusoidal frequency
estimates to calculate amplitude and phase estimates. In [1],
when two overlapping partials were closely spaced and of comparable amplitude, a linear amplitude interpolation between
neighboring harmonics was used to share the peak between the
two vocal sources. Amplitude modulation or beating resulting
from closely spaced sinusoids was used in [10] to resolve the
amplitude trajectories of closely spaced sinusoids. A multistrategy approach was employed in [17] for separating duet
signals. Beating was exploited if two partials were separated
by less than 25 Hz, and the duration of the overlap was longer
than two beat periods. When the overlap was shorter than two
periods, an amplitude interpolation method like the one in [1]
was used. When the partials were separated by between 25 and
50 Hz, a linear equations method was used to determine the
amplitudes of the two partials given the measured composite
spectrum, and partials separated by more than 50 Hz were
not considered to be overlapping. In [14], when the frequency
difference between intersecting sinusoids was less than 25 Hz,
a linear amplitude and cubic phase multiframe interpolation
method was used to interpolate the sinusoids between boundary
frames at which the amplitudes and phases of the individual
sinusoids could be resolved. Finally, a method was developed
in [16] for resolving overlapping partials across multiple time
frames, which combined spatial demixing techniques with
inference based on the fact that neighboring harmonics of a
single note usually have common amplitude and frequency
modulation characteristics. The method effectively estimates
in the frequency regions
frequency masks similar to
where overlaps occur. The technique applies to additive mixing
models in which
microphones in a room record
different
mixtures of the sources, where, in general,
.
Here, when dealing with overlapping harmonics, filters
were designed to split the spectral content shared by
overlapping harmonics into
parts using
overlapping
filters. Three filter designs are proposed for this purpose; the
first two are alternative methods for partitioning the energy in a
shared spectral peak, and the third uses a model of the sum of
DFT spectra to recover the DFTs of the individual harmonics.
The filters designs were all dependent on the extrapolated
harmonic frequencies
, and it was also found beneficial to
include some dependency on the predicted amplitudes
of
the
harmonics. The
were predicted using a simple linear
interpolation between the amplitudes of the nearest harmonics
of each pitch that were matched to nonoverlapping peaks.
1849
The first energy-based filter design for separating overlapping
harmonics is
(4)
where
is the frequency in Hertz of bin , and
are the first minima
in
above and below the set of predicted harmonic frenotes that contain a
quencies. is the set of the particular
harmonic within the overlapping peak, and a suitable value for
is
. For appearance,
, i.e., the index of the harmonic of pitch that constitutes a part of the overlapping peak,
has been shortened to . (4) is followed by a normalization to
obtain
(5)
The second energy-based filter design introduces a dependency on the Fourier transform of the continuous window function
(6)
and this is, again, normalized using (5) to obtain
. In
practice, we approximate the continuous spectrum
by
the DFT of the zero-padded window function (a zero-padding
factor of 64 has been used) and
is rounded to the
describes the shape
nearest equivalent frequency bin.
of the window function in the spectral domain, which is a maximum at
and decreases as increases. Fig. 3(a) illusobtained using (5) and (6)
trates the shape of the filters
applied to a spectral peak comprised of two harmonics. Fig. 3(b)
compares the filtered amplitude spectra using these filters, with
the amplitude spectra of the original unmixed harmonics. This
is, evidently, a good separation of the overlapping peak into its
constituent harmonics.
The above two filter designs, (4) and (6), were proposed
simply as way of splitting the energy in an overlapping peak
into
parts in a way that reflects the predictions of the amplitudes and frequencies of the constituent harmonics. In ideal
conditions in which these predictions are exact, and the peak
arises from stationary sinusoids of nearly equal frequency and
pre-estimated phase offset, then it is possible to separate the
overlapping peak almost exactly into its constituent parts using
complex filters. Suppose we use the following signal model to
describe a cluster of frequencies
giving rise to an
overlapping peak in the DFT
(7)
where
at the start of the current frame and the signal is
assumed to be continuous in the range
. Then,
it can be shown that the Fourier transform of the continuous
1850
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 4. Filtering of an overlapping spectral peak arising from two windowed sinusoids: a) mixed peak is the sum of the peaks arising from the individual sinusoids;
b) resulting peak corresponding to the first sinusoid after filtering [filtered a, b, and c used (4), (6), and (10), respectively]; c) as in b), but for the second sinusoid.
signal multiplied by the continuous window function , is a
convolution of the individual Fourier transforms and results in
Assuming the model is accurate and apart from an arbitrary
constant,
is approximately equal to
evaluated at
the discrete frequency bins. For the moment, we are observing
in a limited frequency range between the minima on opposite sides of the positive frequency peak, to which the second
term in brackets in (8) has very little effect. Thus
across
only approximately accurate, and as
the width of the peak, any leakage of the peak into the residual
is avoided.
Equation (10) is, in practice, applied to our set of
overlapping harmonics by using the substitution
. However, we still need a way
to predict
. Suppose we measure
at
different frequency bins:
, with equivalent frequencies in Hertz:
which are chosen to be the nearest frequency
bins to
under the condition that
.
Then, we obtain a set of independent linear equations which
can be solved by a matrix inversion to find
(9)
(11)
(8)
where
Suppose that all the
possible to design a filter
and
are known. Then, it is
..
.
..
.
(10)
that, when multiplied by
, results in approximately the DFT
of the windowed sinusoid . One could correctly argue that
given
and , it would be easier to compute the expected
shape of the DFT of the windowed sinusoid and simply subtract it from the overlapping peak in
. However, any slight
error in these parameters will result in an imperfect subtraction, and leakage of this sinusoid into the residual. A similar
effect will occur if the original signal model was at all inaccurate, for instance, due to the sinusoid being only approximately
stationary during the frame. The filter design of (10) performs
a reasonably good separation when the parameter estimates are
..
.
..
.
Therefore
(12)
is approximated by the DFT of the zeroOnce again,
padded window function.
Fig. 4 compares the result of filtering a spectral peak arising
from two synthesized overlapping sinusoids, using the filters in
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
Fig. 5.
Hz).
1851
Filter shapes calculated using (5) and (6) and the amplitude spectrum of three violins with pitches (A5 = 880 Hz), (D[6 = 1109 Hz), and (E 6 = 1319
(4), (6), and (10). Whereas, in Fig. 3, the DFT amplitude was
shown, in Fig. 4, the real component of the DFT spectrum is
shown to illustrate that (10) was much better than (4) and (6) at
resolving the phases of the original sinusoids correctly. A similar observation was made for the imaginary spectra. In Fig. 4,
small random errors were added to the known sinusoidal amplitudes and frequencies in (4), (6), and (10) to simulate normal
conditions of operation in which these quantities would be estimated imperfectly from the data.
Finally, Fig. 5 shows the mixed amplitude spectrum of three
violin notes in a single time frame, and Fig. 6 illustrates its separation using (5) and (6) into three source spectra plus a residual
spectrum.
having a random amplitude in the range [0, 1], and a random
, were added together to simuphase offset in the range
late a random overlapping peak. The robustness of the three filter
or
was then evaldesigns as a function of the error in
uated, by substituting into (4), (6), and (10) either the correct
amplitude of both sinusoids and a rough estimate of their frequencies, or vice versa. In the former case, the rough estimates
were produced by adding to each known sinusoidal freof
. In the
quency, a random frequency in the range
were produced by mullatter case, the rough estimates of
tiplying each known sinusoidal amplitude by a random number
. We used as a measure of the error
in the range
between the original and separated DFTs of the two sinusoids,
the quantity
F. Resynthesis of Separated Notes
In any particular frame, the filtered spectrum for note was
by
. The time waveform of
obtained by multiplying
each separated note was synthesized by performing an inverse
FFT of the corresponding filtered spectrum, then dividing by the
original Hamming window used in the analysis, and then using
an overlap-add method with triangular windows to interpolate
the resulting time segments between frames. The residual time
waveform was produced by subtracting the sum of the separated
note waveforms from the original time signal.
III. FILTER PERFORMANCE
To evaluate the relative performance of the filters, they were
applied in turn to the task of separating two overlapping sinusoids, with results shown in Fig. 7. The measurements were
made as follows: Two sinusoids with a random relative frequency difference in the range [0, 4] frequency bins, and each
(13)
is the
where is the DFT of the windowed mixed sinusoids,
is the filter
DFT of the windowed unmixed sinusoid , and
for sinusoid . Fig. 7 shows the average value of
over
iterations for each value of , as is varied from 0 to 1. It is clear
that the separation performances of all filters decrease when the
frequency and amplitude estimates decrease in accuracy, i.e.,
increases. The results reveal that when the frequency and amplitude estimates are accurate, the complex filter (10) is the most
precise, but as these estimates become more inaccurate, eventually, (10) becomes misleading and less robust than (4), and (6)
and (10) appear to have an advantage only when errors in
are less than 0.2 frequency bins. Equation (4) demonstrates the
most stable behavior and is the most accurate method when the
errors in the frequency estimates are large.
1852
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
Fig. 6. Segmentation of the spectrum in Fig. 5 using the filters shown, into three harmonic sources and a residual (note different amplitude scales).
based upon perceptual judgement, since the key aim of the
algorithms is to achieve perceptually acceptable separation
of the sources. Hence, some sound examples have been presented on the internet [18] so that the separated sounds can
be compared directly with the unmixed and mixed originals.
We also use a quantifiable measure, the signal-to-residual ratio
(SRR), to evaluate the similarity between the time waveform
, and its corresponding original
of each separated note
dB
Fig. 7. Filter error R(r ) when separating two overlapping sinusoids, as a
function of the inaccuracy in a) the sinusoidal amplitude estimates A and b)
the sinusoidal frequency estimates f [filtered a, b, and c correspond to (4),
(6), and (10), respectively].
IV. RESULTS
The real sound examples used were all western orchestral
instrument note samples of length 2–8 s, 16 bits and sampled
at 44.1 kHz. All samples aside from the piano were recorded
in an anechoic chamber, although it has been observed that
adding small amounts of reverb diminishes the separation
quality only slightly. These samples were scaled to have equal
mean squared amplitude, and summed to produce mixed note
samples on which the separation algorithms were applied,
allowing the direct comparison of the separated sounds and
the original recordings. The most meaningful way to evaluate the performance of the separation algorithms would be
(14)
The correct matching of the set of original notes to the set of
separated notes was achieved by swapping the order of the separated notes until the maximum of
was achieved
for each .
sources is
The mean signal-to-residual ratio (MSRR) over
defined as
(15)
and the average increase in the sum of SRRs is
(16)
is the mixed original signal. These quanwhere
tities measure how well the original sound has been separated
into individual notes, with larger values indicating better separation performance. No attempt has been made yet to split the
residual waveform any further, and so it is expected that for
mixes of notes containing large nonharmonic components, the
should decrease.
MSRR and
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
1853
TABLE I
SRRs AND =M , FOR THE SEPARATION OF 2–7 SYNCHRONOUS VIOLIN NOTES
assess the performance of the filters in ideal conditions. The
audio samples corresponding to Tables I and Table II are available at [18]. Fig. 8 shows an example of the original, separated
and residual spectrograms obtained when separating a mix of
three notes (sample number in Table II).
C. Average Separation Results
A. Separating Harmonically Unrelated Notes
for sample
Table I shows the calculated SRRs and
mixes of between two and seven violin notes, using (4) and
(5) to separate overlapping harmonics. Although the separation
performances shown are very good, one must take into consideration that these examples mostly consist of notes that are not
related to each other by harmonic intervals like major thirds,
fourths, fifths or octaves. The results should, thus, be interpreted
as showing that the removal of nonoverlapping harmonics with
unit amplitude filters across the width of harmonic peaks is
capable of producing high SRRs.
B. Separating Harmonically Related Notes
To measure the effectiveness of the three filter designs proposed for separating overlapping harmonics, three real sample
mixes were produced consisting of multiple notes from different
instruments, in which the pitches were deliberately chosen to
have harmonic relationships, thus resulting in a higher proportion of overlapping harmonics. Three synthesized sound examples were also produced and the results of the six test samples
are shown in Table II.
is given for the three
For each sample, the MSRR and
filter designs in Section II-E and also using 1) no treatment of
overlapping harmonics, 2) Parsons’ method for splitting overlapping harmonics by interpolating between the amplitudes of
neighboring harmonics [1], and 3) the nonlinear least-squares
(NLS) method for estimating parameters of a model of closely
spaced sinusoids [13]. For Parsons’ method, a filter
was used, followed by the normalization in (5). In [13], the
NLS method is explained in the context of two overlapping harmonics, but the method can easily be generalized to more than
two overlapping harmonics. Nonoverlapping harmonics were
treated identically for all methods as described in Section II-D.
The first synthetic sample is a mix of three synthesized notes
and
Hz. The second synthesized
with
sample is a sum of two notes: the first note has a constant pitch
of 440 Hz and the second is a linear glissando between 400 and
480 Hz. In the last synthetic sample, the first note has a constant
pitch of 300 Hz, and the second note has a frequency-modulated
(FM) pitch centered on 450 Hz with FM amplitude 10 Hz and
FM frequency 5 Hz to simulate vibrato. In all the synthetic examples, the first 20 harmonics are present with decreasing har, and the exact time-varying harmonic amplitudes
monic frequency and amplitude trajectories were provided to
Finally, results are presented in Table III for the MSRR and
for polyphonies of 2–5 instruments, each averaged over
100 random sample mixes, and using the same set of methods
for separating overlapping harmonics as in Table II. Random
mixes were constructed by firstly selecting a random set of
unique instruments from a set of 11 orchestral instrument types
(bassoon, cello, B clarinet, E clarinet, flute, French horn, oboe,
piano, saxophone, trombone, and violin), and then selecting a
note for each instrument randomly from within its complete
pitch range. The individual notes were drawn from a set of 479
samples extending in pitch from A0 (27.5 Hz) to C8 (4186 Hz).
In initial studies, the multipitch estimator was used. It was
able to detect all pitches correctly in a random mix 53.2%,
37.7%, 17.4%, and 4.0% of the time for polyphonies of 2, 3, 4,
and 5, respectively, where a correct pitch estimate is defined as
one that is within 3%, i.e., half a semi-tone, of the known transcribed pitch. The mixes for which multipitch estimation was
unsuccessful tended to occur when there were strong harmonic
relationships between the constituent notes, and for these mixes
lower SRRs would be expected due to the greater likelihood
of overlapping harmonics. It was found that by averaging
SRRs over only those sample mixes for which the multipitch
estimator was correct, the average results were biased to “easy
mixtures,” and resulted in SRRs of approximately 5 dB higher
than those given in Table III. Thus, it was decided that instead,
the rough pitches would be provided in advance, so the results
in Table III show the average performance over all random
mixes. The multipitch estimator was used only to refine and
track each pitch over time within a limited pitch range around
the provided rough pitch.
V. DISCUSSION
The violin separation in Table I showed that by multiplying a
mixed spectrum by a filter of unit amplitude across the width of
each peak containing a single harmonic, these harmonics could
be extracted very effectively. This was done with relatively little
computational expense in comparison to spectral subtraction
methods involving sinusoidal parameter estimation.
Table II provides some insight into the performances of the
three filter designs in comparison to previous approaches to separating overlapping harmonics. In relation to Table I, one notices
an overall decrease in performance, which is not surprising due
to the relatively larger proportion of overlapping harmonics in
Table II.
In Table II, the two energy-based filter designs produced
overall the highest SRRs with real samples (samples {4}–{6}).
They also performed very well for the synthesized samples
(samples 1–3). Equation (10) was the best performer for sample
{1}. As the harmonic frequency and amplitude trajectories
were provided for the synthesized samples, this indicates that
1854
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
TABLE II
SRRs AND =M , FOR THE SEPARATION OF HARMONICALLY RELATED MIXED NOTES
under ideal conditions of stationary sinusoids, (10) performs
the best, which is confirmed by observing Fig. 7 at low values
of . The relatively low performance of Parsons’ method can be
explained by the lack of frequency dependence in the filter design. Nevertheless, it provides some advantage over removing
overlapping harmonics from the separated notes altogether (the
first column of data in Table II). The low performance of the
NLS method is surprising, and upon further examination of the
separated spectra some insight was obtained. The low SRRs are
mostly due to a few instances in which the amplitudes of the
closely spaced sinusoids modeling an overlapping peak were
grossly overestimated. The explanation could be that there is
very little preventing a least-squares method from interpreting
a small overlapping peak as the addition of two sinusoids of
very large amplitude but nearly opposite phase, which destructively interfere. Although this interpretation may correspond
to a minimum least-squares error, given that the spectrum is
unlikely to be composed of stationary sinusoids, the small
overlapping peak would more likely be the addition of two
relatively low amplitude harmonics. The addition of a term into
the NLS estimation that penalizes joint amplitude estimates
with large variance might partially overcome this problem. As
the frequency and amplitude estimates of the closely spaced
sinusoids need to be optimized to find the least-squares fit, the
computational cost of the NLS method is relatively large. There
is also a risk that this optimization does not converge to the
global best fit.
The worst overall SRRs in Table II were obtained for sample
{5}, which corresponds with a general observation that separation performance is worse for lower pitched samples. This is an
unavoidable consequence of using a fixed frequency resolution
transform, the DFT. It means that harmonic frequency estimates
are worse relative to for lower pitched notes. Also, the ratio
of the width of harmonics to the spacing between harmonics becomes larger for lower pitches; hence, there is relatively more
overlapping spectral content for lower pitched note mixes.
Table III again validates the use of frequency-dependent filters by showing a consistent improvement in SRR over Par-
EVERY AND SZYMANSKI: SEPARATION OF SYNCHRONOUS PITCHED NOTES
1855
Fig. 8. Original spectrograms of a cello, soprano saxophone, and flute, and the spectrograms of the corresponding notes after separation (gray scales of all figures
are equivalent).
TABLE III
AVERAGE MSRR AND =M , FOR POLYPHONIES OF 2–5 INSTRUMENTS
of (10) is demonstrated in Fig. 7; notably, (10) is not very robust
to large errors in harmonic frequency and amplitude estimates.
Overall, not only is (4) the most predictable with respect to errors in harmonic amplitude and frequency estimates as shown
in Fig. 7, but Table III shows that it also achieves the highest
SRRs.
The SRRs in Table III for (4) are about 7 dB higher than the
average separation results reported in [15]. In [15], a larger selection of 26 different instruments was used although pitches
were restricted to between 65 and 2100 Hz, and the results reported were the average over clean mixes and mixes with addi-dB pink noise. The cases in which the multipitch estive
timator failed were not accounted for in the average separation
results in [15].
VI. CONCLUSION
sons’ method for the three proposed filter designs. Similarly
to in Table II, the two energy-based filter designs, (4) and (6),
achieved the highest SRRs. The NLS method and (10) were both
derived by assuming that the sinusoids in an overlapping peak
are stationary within each time frame. Given that these methods
performed worse than (4) and (6), and the fact that relatively
long window lengths (186 ms) were used, it would be reasonable to conclude that this assumption is inaccurate for most real
samples. Another contributing factor to the lower performance
Results have been presented for separating mixes of between
two and seven synchronous notes from a mono track. Average
SRRs of around 10–20 dB have been achieved, and mixes of
two notes were separated with almost imperceptible differences
between the separated and original notes. In some sample mixes,
SRRs of up to 30 dB were obtained. With increasing numbers
of notes in the mix, the separation quality predictably decreases,
but mixes of up to seven notes have been separated with enough
fidelity to easily allow the listener to match the separated and
original notes correctly. This work has been extended [19] to
separating synchronous note sequences or instrumental parts.
1856
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
A product of the separation is the residual, which contains
the inharmonic partials and noise characteristics of the original
mix. This has been used elsewhere to make audible subtle attack
characteristics of piano notes, and may have application in the
synthesis of realistic instrument sounds. No attempt has been
made to split the residual into separate sources, and, thus, there
are audible artifacts when attempting to separate notes with significant noise or nonharmonic content.
A solution has been proposed to the problem of separating
overlapping harmonics from different notes, which is an important issue when attempting to separate musical sources. Three
filter designs have been developed for separating multiple harmonics from an overlapping spectral peak. The first two designs
are alternative methods for splitting the energy in the overlapping peak by using predictions of the frequencies and amplitudes of each harmonic constituting the peak. The third filter
design is complex and attempts to recover the DFTs of the individual unmixed harmonics. All of these filter designs resulted
in overall improvements to separation performance, and the first
energy-based filter design, (4) was overall the best performer.
The work has confirmed that via the use of a priori
model-based information, in the form of both prespecified
models of harmonic structures of pitched instruments and a
filtering methodology to handle overlapping spectral peaks,
it is possible to separate the harmonic structure of multiple
instruments from a mono recording to a high fidelity.
ACKNOWLEDGMENT
[8] M. Desainte-Catherine and S. Marchand, “High precision fourier analysis of sounds using signal derivatives,” J. Audio Eng. Soc., vol. 48, no.
7/8, pp. 654–667, Jul./Aug. 2000.
[9] P. Depalle and T. Hélie, “Extraction of spectral peak parameters using
a short-time fourier transform modeling and no sidelobe windows,”
presented at the IEEE Workshop Applications of Signal Processing to
Audio and Acoustics, New Paltz, NY, Oct. 1997.
[10] A. Klapuri, T. Virtanen, and J.-M. Holm, “Robust multipitch estimation
for the analysis and manipulation of polyphonic musical signals,” presented at the COST-G6 Conf. Digital Audio Effects, Verona, Italy, Dec.
2000.
[11] N. H. Fletcher and T. D. Rossing, The Physics of Musical Instruments,
2nd ed. New York: Springer-Verlag, 1998.
[12] T. Virtanen and A. Klapuri, “Separation of harmonic sounds using multipitch analysis and iterative parameter estimation,” presented at the IEEE
Workshop Applications of Signal Processing to Audio and Acoustics,
New Paltz, NY, Oct. 2001.
[13] T. Tolonen, “Methods for separation of harmonic sound sources using
sinusoidal modeling,” presented at the AES 106th Convention, Munich,
Germany, May 1999.
[14] T. F. Quatieri and R. G. Danisewicz, “An approach to co-channel talker
interference suppression using a sinusoidal model for speech,” IEEE
Trans. Acoust., Speech, Signal Process., vol. 38, no. 1, pp. 56–69, Jan.
1990.
[15] T. Virtanen and A. Klapuri, “Separation of harmonic sounds using linear
models for the overtone series,” presented at the IEEE Int. Conf. Acoustics, Speech, Signal Processing, Orlando, FL, May 2002.
[16] H. Viste and G. Evangelista, “A method for separation of overlapping
partials based on similarity of temporal envelopes in multi-channel mixtures,” in IEEE Trans. Audio, Speech, Lang. Process., to be published.
[17] R. C. Maher, “Evaluation of a method for separating digitized duet signals,” J. Audio Eng. Soc., vol. 38, no. 12, pp. 956–979, Dec. 1990.
[18] Note Separation Demonstrations, M. R. Every and J. E. Szymanski.
(2004, Jun.). [Online]. Available: http://www-users.york.ac.uk/~jes1/
Separation1.html
[19] M. R. Every and J. E. Szymanski, “A spectral-filtering approach to music
signal separation,” presented at the 7th Int. Conf. Digital Audio Effects,
Naples, Italy, Oct. 2004.
The authors would like to thank the three anonymous reviewers for their well informed and helpful suggestions.
Mark R. Every received an Honours degree in
physics from the University of the Witwatersrand,
South Africa, in 1999. He is currently pursuing the
M.S. degree in music technology at the University
of York, York, U.K., on a British Commonwealth
Scholarship.
He is currently an academic fellow at the Centre
for Vision, Speech, and Signal Processing, University
of Surrey, Guildford, U.K. His current research interests include music and audio signal processing, content description and extraction, and machine learning
REFERENCES
[1] T. W. Parsons, “Separation of speech from interfering speech by means
of harmonic selection,” J. Acoust. Soc. Amer., vol. 60, no. 4, pp. 911–918,
Oct. 1976.
[2] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch
tracking and amplitude modulation,” IEEE Trans. Neural Netw., vol. 15,
no. 5, pp. 1135–1150, Sep. 2004.
[3] L. Ottaviani and D. Rocchesso, “Separation of speech signal from complex auditory scenes,” in Proc. COST G-6 Conf. Digital Audio Effects,
Limerick, Ireland, Dec. 2001, pp. 87–90.
[4] M. Cooke and D.P.W. Ellis, “The auditory organization of speech and
other sources in listeners and computational models,” Speech Commun.,
vol. 35, no. 3–4, pp. 141–177, 2001.
[5] O. Cappé, J. Laroche, and E. Moulines, “Regularized estimation of
cepstrum envelope from discrete frequency points,” presented at the
Workshop Applications of Signal Processing to Audio and Acoustics,
WASPAA, Mohonk, NY, Oct. 1995.
[6] X. Rodet. Musical sound signal analysis/synthesis: Sinusoidal residual
and elementary waveform models. presented at IEEE Time-Frequency and Time-Scale Workshop. [Online]. Available: http://mediatheque.ircam.fr/articles/textes/Rodet97e/
[7] (Revised 1999, May). How to Interpolate Frequency Peaks.
dspGuru, Iowegian Intern. Corp., M. Donadio. [Online]. Available: http://www.dspguru.com/howto/tech/peakfft2.htm
+
techniques.
John E. Szymanski received an Honours degree
in mathematics and the Ph.D. degree in theoretical
physics from the University of York, York, U.K., in
1980 and 1984, respectively.
He joined the Department of Electronics, University of York, in 1986, where he is currently a Senior
Lecturer within the Media Engineering Group. His
research interests include physical modelling, computational signal processing, inverse problems, and
optimization methods.