Automatic Timbre Mutation of Drum Loops
Jason A. Hockman
NYU Music Technology
Submitted in partial fulfillment of the requirements for the
Master of Music in Music Technology
In the Department of Music and the Performing Arts Professions
In The Steinhardt School
New York University
Advisors: Kenneth J. Peacock, Robert J. Rowe
2007/05/04
Acknowledgements
As this work is an abstraction of previous work, I must use this space to extend thanks to
those people that have helped me in development the algorithm as it exists thus far.
First and foremost, I must credit the knowledge and guidance of Dr. Juan P. Bello, whose
tutoring in the use of the MATLAB coding environment and amassed knowledge of
subject material had aided me through the various phases of the project’s design and
implementation. In particular, Dr. Bello’s previous work on onset detection (Bello et al.,
2004), and rhythm modification of drum loops (Ravelli et al., 2007), had made this project
possible.
I must also thank the NYU Music Technology Program, which has brought me further into
sound construction than I knew possible only two years ago. I had originally applied to the
program thinking that I would continue working in the field of electronic dance music
production, however through the guidance of Kenneth Peacock, Robert Rowe, Dafna
Naphtali, and Richard Boulanger, I have since expanded my scope to include the interests
of research in Music Information Retrieval (MIR) and Digital Audio Effects (DAFX).
The NYU Music Technology Forum, which generally meets on a bi-monthly basis has
been a constant source of inspiration for the project, and has provided essential feedback
on the various decisions chosen throughout the algorithm’s development.
I would also like to extend special thanks to Melissa Czajkowski, Liana Eagle and Ernest
Li, for their considerate critique and thoughtfulness throughout the project.
Finally, I must thank my parents for making such an opportunity possible.
1
Abstract
The presented algorithm demonstrates a method by which a percussive loop is
automatically segmented into its constituent parts, which are then classified and
appropriately resequenced to match the components of a second loop, which undergoes
the same process. The two loops are then synthesized together, component by component,
and output is presented as 1) a unified .wav file, 2) individual slices (also output as .wav
files), and 3) as a MIDI file, for those musicians who desire to work with a traditional
sequencer and sampler. The spectral mix between the two loops is defined by simple
coefficient controls of both magnitude and phase values. The algorithm has build upon
recent work of Ravelli, Bello, and Sandler (Automatic Rhythm Modification of Drum
Loops, 2007) , and has relevance to music composers/producers who wish to easily and
efficiently create rhythmic and timbral variations to static loops.
1
2
CONTENTS
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
Introduction . . . . . . . . . .
1.1
The Early Years . . . .
1.2
Innovation in Timbre .
1.3
Computer Approaches
1.4
Algorithm Overview . .
.
.
.
.
.
5
7
7
8
10
2
Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Complex-Domain Onset Detection . . . . . . . . . . . . . . . . .
2.3
Peak-Picking . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
12
13
15
3
Classification . . . . . . . . .
3.1
Introduction . . . . .
3.2
k-means Classification
3.3
Cluster Labeling . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
18
18
19
4
Resequencing . . . . . . . . . . . . . .
4.1
Substitution Method . . . . . .
4.2
Proportional Centroid Distance .
4.3
Outliers . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
23
23
25
5
Timbre Mutation . . . . . . . . .
5.1
Possible Methods . . . . .
5.2
Phase Vocoding Mutation .
5.3
User-Defined Mix Levels .
5.4
Window Size . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
27
27
30
31
6
Further Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Slicing and Combined Output . . . . . . . . . . . . . . . . . . . .
6.3
Gaps and Clipping . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
Omission of Timestretching Functionality . . . . . . . . . . . . . .
32
33
33
34
34
7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
8
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
TimbreMut.m . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2
DetFuncA.m & DetFuncB.m . . . . . . . . . . . . . . . . . . . .
8.3
LocalMaxima.m . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4
PhaseVoxMut.m . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
39
42
44
44
9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4
1. INTRODUCTION
5
“Rhythm imposes unanimity among the divergent”
- Yehudi Menuhin
Modern electronic musicians and producers benefit from the manipulation and processing
of pre-recorded drum loops. Whether for commercial or personal purposes, musicians
rely increasingly upon the timbres achieved through the use of large databases of drum
loops. Oftentimes however, these samples, both sonically and rhythmically, begin to sound
“flat” or mundane to the ear, and further modification is required for generation of subtle
nuance and uniqueness. This paper presents a method for the automatic timbre and
rhythmic mutation of two percussive loops. The process is the result of the combined
algorithms for segmentation, classification, resequencing and synthesis algorithms. The aim
of the project is to introduce a system by which a mono wav file is systematically separated
into its constituent events, which are then given a simple classification, resequenced to the
rhythmic timing of another percussive wav file, and finally morphed together in a crosssynthesis technique. The main goal of the project is thus 1) to provide a robust
segmentation method that is capable of proper functionality given a large database of a
diverse range of samples, while 2) minimizing the occurrence of class mismatches, such
that a kick will not be synthesized with a non-kick. Interface goals are 1) overall timeliness
of the algorithm, 2) dutiful efficiency and accurate response, and 3) sonic clarity.
Timeliness is simply gauged by timing functions within the coding structure; accuracy is
determined though analysis with database samples; and sonic clarity is a subjective measure
achieved by listener tests.
The remainder of the paper will be presented as follows: after a brief history of loop-based
music, and technology specifically related to loop-based music production, Section 2 will
present the detection algorithm and peak picking process. Section 3 will deal with
classification and event labeling. Resequencing methodology is discussed in Section 4. The
timbre mutation technique is presented in Section 5. The algorithm’s output and further
considerations are explained in Section 6, and conclusions follow in Section 7. The
MATLAB code for the presented algorithm is printed in the Appendix, along with
database information.
6
1.2 The Early Years *
The Sugar Hill Gang released their seminal hit “Rapper’s Delight” in the summer of 1979;
this was the world’s first glimpse of a new form of urban music later defined as hip hop,
and the first popular culture music hit to sample a drum riff from another band. Sylvia
Robinson, Sugar Hill’s producer, had used a drum break (used synonymously with riff, or
loop meaning a brief drum solo, often in backbeat form) from Chic’s chart-topping “Good
Times”. Although not truly the first hip hop song (most recognize the Fatback Band’s
“King Tim III (Personality Jock)” as the original), “Rapper’s Delight” heralded the
beginning of an era of repetitive percussive patterns, spliced together from previously
existing material, as a basic block form by which to layer upon.
“Rapper’s Delight” was nothing new to the residents of east coast tenements in 1979. In
fact there had been a long-standing tradition of privately owned sound system and DJing
culture within urban centers for the majority of the decade. Building upon disco DJ
methods, Clive Campbell, aka Kool Herc, was the first to use a mixer to quickly “cut”
between two copies of the same record to prolong the sonic climax of a break. Campbell,
then Afrika Bambaata and Grand Master Flash with greater precision, soon began to layer
the breaks by beatmatching (a technique made possible by the variable user-defined pitch
control on performance turntables), and cutting in edits, or phrases from different breaks.
During beatmatching, a DJ matches the tempo of one record to that of another. Generally,
this process is performed on percussive sections of rhythm-based music, often for the
purpose of vertically aligning structural components by measure. By 1979, the use of
breaks as a percussion element had become a standard, not just at block parties, but as a
production technique as well. By the end of that same summer, breakbeats had completely
overtaken the backing tracks of top 40 R&B hits.
1.3 Innovation in Timbre
For the next few years, the timbres of hip-hop breaks had remained unchanged from that
of the recorded drum sounds from the previous generation of rhythm and blues hits. The
early 1980s provided users with low-cost drum machines, namely Roland’s TR-808,
equipped with a flexible sequencer as well as duration and pitch control over a variety of
drum timbres; a freedom which had liberated artists from the confines of pre-recorded
loop structures and legal issues . “Planet Rock” (1982) by Afrika Bambaata demonstrated
the possibilities of these new timbres, and with its release created a sub-genre of futurism
that would be continued years later in the layered textures of European techno, then jungle
and drum and bass.
16
* as explained by A. Light, “The History of Hip Hop” New York, Three Rivers Press, 1999
16
7
The need to remain innovative pushed producers to create a hybridization of electronic
and live drum timbres in the mid to late 1980s, and as sampling technology had become
more affordable, the two could exist side-by-side on hardware samplers such as the 8-bit
Ensoniq Mirage (1985), and later 16-bit Akai MPC and E-Mu E-series samplers. On-board
DSP processors on samplers from the early 1990s provided the signal processing
technology necessary for groups such as Eric B & Rakim and Public Enemy the capability
of blending through equalization or sideband compression, and even more recently
convolution between samples (as seen on the E-mu E4 Platinum).
Layering drum loops provides musicians and producers with a simple yet effective method
to create more effective percussive events, as well as add necessary variation to standard
loop configurations which might otherwise sound commonplace and mundane. The
blending of loops, which in itself is a complex process involving segmentation,
compression, equalization, and amplitude control has been sped drastically due to
developments in graphical displays and processor speeds, yet is still a daunting task.
1.3 Computer Approaches
Modern advances in personal computing have created a new market for the laptop/desktop
musician. Sampling and synthesis have been, for the most part, moved into the software
realm, alongside sequencing programs such as Pro Tools, Digital Performer, Cubase, and
Logic. Programs such as Propellerheads Recycle (http://www.propellerheads.se/
products/recycle), have provided digital musicians an alternative to hand-selected divisors
of a loop, with a user-segmentation intensity level (0-100), which operates by analyzing
time-domain horizontal transient-detection coupled with sudden amplitude envelope
deviations. The unfortunate truth is that for most loops, the detection algorithm finds
inaccurate onsets, and for those correctly chosen, places slice points several samples after
the actual inception of onsets, causing clipping at both the start and end of the slices.
Generally, a user must then select a segmentation intensity which provides the closest fit
onset schematic for each loop, and then systematically add new / remove false onsets, and
verify the accuracy of each onset point. A useful function of the Reason program however,
is its ability to output sequential individual slices and MIDI (Musical Instrument Digital
Interface) file for control purposes. As the majority of electronic musicians have not yet
abandoned the MIDI platform, this is a practical implementation that is available in the
presented segmentation process.
FXpansion offers GURU (http://www.fxpansion.com/product-guru-main.php), a software
groove box that provides factory preset loop manipulation, as well as segmentation and
resequencing of input loops. The slicing is quite accurate, however present implementation
only allows the definition of singular note events, such that there can only be one kick, one
snare, and one hat determined per input loop. The appearance of complexity within loopbased music is partially derived from subtle alterations between samples, such that there
are several kicks, snares, and hats, as opposed to one. Still, GURU does offer several
processing effects such as LFOs and tape effects for this manipulation.
8
Ravelli et al. (2007) have created a method of segmentation and resequencing, which
provides the basis for this paper’s presented algorithm . In their work, two mono drum
loops are segmented, spectrally classified, and one loop, the “original”, is resequenced to
the rhythm of the “model” loop. Segmentation is executed through complex-domain onset
detection , classification by the k-means clustering method, and resequencing is performed
via a substitution matrix. A more detailed explanation of this procedure can be found in [1]
and [2], however as I incorporate the similar detection and classification schemes in the
presented algorithm, these will be reviewed within subsequent sections of this paper.
1, 2
3
9
1.4 Algorithm Overview
The algorithm is comprised of three stages, 1) Analysis, 2) Resequencing, and 3) Synthesis.
The analysis stage can be seen as the sequence of two steps: i) onset detection and
segmentation of two percussive samples; file A and file B, ii) categorization via simple
classification iteration. Resequencing involves the matching of file B to file A. The
Synthesis stage involves i) phase vocoder mutation, and ii) resynthesis (combined .wav
output). Figure 1 demonstrates a flow chart of the algorithm.
Figure 1: Timbre Mutation Overview
The TimbreMut.m algorithm is designed to function with a wide variety of standard drum
loops, however its analysis architecture allows it to be relevant with non-traditional musical
timbres, such as found sounds and sound effects. The algorithm may, for example, be used
to either blend characteristics of one loop with another’s, as an effect to a pre-existing drum
loop, or simply to generate interesting non-rhythmic combinations of sounds.
10
2. SEGMENTATION
11
2.1 Onset Detection
The aim of onset detection is to provide a clear inception of an event, prior to the attack
portion of the event, at which the amplitude is elevated and may be measured. Bello et al.
(2005) define an onset as the specific moment chosen to denote the inception of a
temporally extended transient, or period during which the signal may be characterized as in
excitation . There are several methods by which this calculation may be made for
percussive signals, in either the time or frequency domain.
5
The most basic form of onset detection may be performed through analysis of amplitude
elevations in the time-domain. Success of this analysis scheme is limited to those signals in
which clear separation between loud note events exists. Softer onsets, such as those often
found in the kick drum and hat/cymbal classes, are often ignored by these detection
schemes, bypassed for the more prevalent noisy characteristics of the snare drums.
For the purposes of the algorithm discussed here, a more robust detection method is
needed, which will require an examination of the signal within the frequency domain. In
(1), the incoming signal is parsed into individual equal-length frames, windowed, and input
into a discrete Fourier transform .
6
N !1
X(k) = DFT [ x(n)] = # x(n)e! j 2 " k / N
n=0
(1)
The transform produces various spectral features assessed in different detection functions.
Local energy, a detection function which also is possible through the temporal domain, is
achieved spectrally through simple analysis of magnitudes given across the FFT bins.
LE(n) =
1
N
k = N /2
"
k = ! N /2
X k (n)
2
(2)
Rodet and Jaillet (2001) present a more robust analysis devised through weighting the
individual frequency bins, which results in increased sensitivity towards the higher
frequencies, a characteristic which exploits alteration in percussive signals well .
7
1
HFC(n) =
N
k = N /2
"
k = ! N /2
2
X k (n) k
(3)
Both local energy and HFC approaches often bypass softer or tonal onsets, instead opting
for louder, noisy onsets.
12
Alternatively, Bello and Sandler (2003), determine onsets through deviation of phase
outputs of the FFT. Analysis of the mean absolute phase deviation provides an efficient
characterization of tonal onsets. Unlike aforementioned magnitude-based calculations, the
phase-based approach fares well in detection of kick-drums and tonal percussive sounds,
regardless of amplitude levels. In each STFT frame, phases are output and stored from
each bin. During onset excitation, significant differentiation will be present due to a high
variance within the signal within a short period of time, resulting in large peaks, while
steady sinusoidal points will demonstrate derivative values nearer to zero. In (4), the change
in phase for frame n is represented by an analysis of the present ( ! k (n) ) and two prior
frames ( ! k (n " m) ) .
8
!" k (n) = " k (n) # 2" k (n # 1) + " k (n # 2) ! 0
(4)
(5) demonstrates this measure of the summed absolute phases across N . Although the
mean absolute phase deviation is a well-suited measure for tonal onsets, it does not
represent noisy transients as well, as high-frequency behavior is not as easily recorded from
frame to frame.
8
! p (n) =
1 N
$ "# k (n)
N k =1
(5)
2.2 Complex-Domain Onset Detection
Bello et al. (2004) describe a method by which both phase and energy based approaches
may be combined, that is capable of extracting onsets from both high and low energy
events, as well as those events classified by low or high frequencies. The method, fully
! k (n) ), and
explained in [3], creates present magnitude ( X (n) ) and target magnitude ( X
k
present phase ( ! k (n) ) and target phase ( !! k (n) ) values from STFT calculation. The present
values are derived from (6):
X k (n) = X k (n) e j! k (n)
(6)
and the target calculations are derived from (7):
! k (n) = X
! k (n) e j !" k (n)
X
(7)
The target amplitude may then be understood also as the previous frame value as in (8),
and using the phase unwrap method of princarg, the target phase may be achieved by
analysis of the previous two frames (9).
13
! k (n) = X (n ! 1)
X
k
(8)
!! k (n) = princ arg[2"" k (n # 1) # "" k (n # 2)]
(9)
Then, bin stationarity is achieved by a comparison of past to present STFT frames:
{
2
}
! k (n) + X (n) " 2 X
! k (n) i X (n) i cos(#$ (n))
! k (n) = X
k
k
k
2
1
2
(10)
And then summing these values across the bins, k, a fill complex domain onset detection
function may then be represented by:
K
CDOD = " ! k (m)
k =1
(11)
This chosen method for the TimbreMut.m algorithm, is presented in Appendix 8.2, within
DetFuncA.m on lines 39-40, or alternatively in DetFuncB.m, on lines 37-38.
Through this method, onsets may be visualized by simply plotting the output vector given
by onDet (see Appendix 8.2). The derivative of onsets may then be taken to provide a
precise point of event inception. Figure 2 provides a graphical display of three stages of the
detection process: (a) wavread, (b) complex-domain onset detection, and (c) derivative of
the detection function.
14
(a) Wavread of 10.wav
(b) Complex-Domain Onset Detection of 10.wav
(c) Derivative of Complex-Domain Onset Detection of 10.wav
Figure 2: Onset Detection Preparation
2.3 Peak-Picking
Once the detection function has determined its event characterization, a peak-picking
method is required to determine which of the points available are actually onsets. To
ensure a more accurate decision during peak-picking, Ravelli et al. (2007) present the use
of a difference function to streamline the data such that the output becomes even less noisy
.
15
1
Another precursor to the actual peak-picking function is thresholding. The two methods
available are fixed and adaptive thresholding. If the incoming signal was prepared such that
it was always of comparable amplitude, by, say, a hard limiter, then a fixed threshold might
be an ideal choice. However, as percussive events may be encountered across a wide range
of styles and amplitude of onset events, it is difficult to define such a static position for
amplitude which satisfies all (or even many) percussive signals.
2.3.1 Adaptive Thresholding
Automatic threshold calculation is an essential characteristic of the algorithm, to remove
the otherwise constant threshold adjustments necessitated for each specific sample’s
amplitude deviations. Adaptive thresholding is performed through the creation of a mean
absolute filter, efficiently following the shape of the detection vector, successfully
eliminating floor noise, while allowing local maxima to remain after peak-picking. The
adaptive threshold used in TimbreMut.m, a variation of the threshold presented by
Duxbury et al. in [9], is created as the addition of 1) the derivative of (11) multiplied by 0.6,
and 2) the lower value of either the floor threshold or mean absolute derivative of (11).
Fine-tuning of the algorithm demonstrated that a floor threshold of 20 was the most
reliable value tested.
"
diff ( CDOD ) %
! adpt (m) = diff (( CDOD ) i 0.6) + min $ ! floor ,
'
N
#
&
(12)
2.3.2 Local Maxima
As mentioned above, peak-picking is performed by selecting the local maxima within the
vector by utilizing the find command to locate those points proven true for a) point n is
greater than point n-1, b) point n is greater than point n+1, and c) point n is greater than the
adaptive threshold for that point.
LocalMaxima.m thus parses the data given by onDet, by returning a truncated vector
comprised only of sample points at which onsets are initialized.
16
3. CLASSIFICATION
17
3.1 Introduction
Classification of slices between onset points is intrinsic to the future matching stage of the
algorithm, as it provides necessary parsing of the output onsets vector into sub-groups
defined by average spectral makeup. It should be mentioned here that the chosen
classification system provides the method by which later resequencing will be based
(proportional centroid distance), but this will be explained in the following section.
3.2 k-Means Classification
Once the input signal has been analyzed for onset start times, DetFuncA.m and
DetfuncB.m then utilize k-Means (MacQueen, 1967), a simple classification
implementation is to define the information between onset vector points as either
belonging to the family of kick drums (K ), snare drums (S ), or hat/cymbals (H ). The goal
of the k-Means classification procedure is to locate mean vectors of a distribution, serving
to create Voronoi cells, enclosing the data by an ownership labeling (Duda et al., 2001) .
Through an effort to minimize a squared-error distance calculation (13), k-Means produces
maximum-likelihood estimates of ownership . In (13), the sum of absolute difference for
each point x j in set Si from its prescribed mean µi is calculated. The results of these
calculations are then in turn summed for each value of k.
4
4
k
V =#
#
i =1 x j "Si
x j ! µi
2
(13)
During onset detection, the matrix BNC is created containing the first eight STFT
frequency bins of a detected onset frame, along with the following seven frames for validity
of frequency information. The eight bins are selected in the initialization of the
DetFuncA.m and DetFucB.m functions (Appendix 8.2) by the command, 1:5:36, which
represent frequencies between 0-344 Hz (as the sampling frequency is 44100 samples per
second and the STFT window size is 512 samples in length). These bins are chosen, as
this configuration provides an accurate representation of the three classes. Kick drums will
generally be represented with large values in bins 1:3, while snares will demonstrate the
most significant values in the 5:8 bins (however mostly in the eighth bin), while hats,
although present in the vector due to onset detection, will have a minimal presence across
these low bins.
The means of these eight frames for each STFT bin are then calculated, and input into a
adapted version of Kmeans.m (Abonyi, et al. 2004), part of MATLAB’s Fuzzy Clustering
Toolbox . The k-Means method of classification was designed to analyze characteristics of
11
18
data present on a feature space of x given dimensions. In this scenario, there are eight
dimensions, given by the eight STFT bins extracted during onset detection. The data is
represented on the feature space along with k centroids (centers of mass, or barycenters),
whose initial locations are generated randomly. The k value chosen for the algorithm is 3,
representing the 3 classes (K, S, H) for which each data point will be attributed. Through
an iterative process, each data point assigns itself to a particular centroid, and the centroids
each move to the center point of the data set by which it is defined. The process continues
until each point is defined, and no centroid remains in motion. Through this iterative
process each onset may be assigned to a particular centroid, however, because of the
random placement of centroids, there can be no method of predictive description for
centroids. Post-analysis description is therefore necessary.
Because the sample signals being evaluated are drum loops (or perhaps non-rhythmical
percussive sounds), we can assume the sounds to fall in basic categories, based on the
design of the instruments being used to create the sounds. If the loops contain onsets from
the three basic components of a drum kit – kick drum, snare drum, and hat/cymbal, then
we can assume that the data will cluster into three main areas of the feature space. While k
is still assigned to three, floor tom presence most often results in increased K (kick drum)
onsets, however simple tweaking of the k value from 3 to four will provide four centroids
which may be used for definition of drum sounds. For most production loops however, the
k=3 determination is sufficient, as the occasional tom will not skew data, as it will be
generally exist as an outlier, and will not be chosen as an onset for resequencing unless
another outlier exists in the opposing loop.
3.3 Cluster Labeling
Amongst other data, the output of kmeans.m provides the location for all three centroids in
all eight dimensions (given by result.cluster.v), a 1-0 truth table representing the
membership of every data point to each centroid (result.data.f), and a distance metric for
each point within the dataset (result.data.d). The means of result.cluster.v are calculated,
creating three values. In determining which centroid is associated with which instrument
class, some assumptions must be made, and defined in the code. As mentioned above,
kick drums will center around bins 1:3, snares around 5:8, and hats will be uniformly
neutral. Furthermore, the snare centroid will contain the most energy of the three, and
therefore may be defined as the maximum of the mean centroids, and the hats as the
minimum. Because k=3, through the process of elimination, the kick drum centroid is
associated with the remainder from the simple equation:
Centroid K = 6 – Centroid – Centroid
S
H
(14)
where: CentroidS = max(mean(result.cluster.v))
CentroidH = min(mean(result.cluster.v))
19
The marking of ‘k’, ’s’, or ‘h’ is then assigned to each onset on the data set, and a list of
kicks, snares, and hats may then be exported in list form. DetFuncA.m and DetFuncB.m
define these outputs as output.list.kick, output.list.snare, and output.list.hat. A secondary
list group, comprised of output.class.k, output.class.s, and output.class.h, provides
TimbreMut.m with the data point distances from same class centroid, for resequencing
capabilities. Both DetFuncA.m and DetFuncB.m also return a general list of onsets in the
vector output.onsetinfo.
(a) three clusters generated and result.cluster.v
(b) determination of [S]
(c) determination of [H]
(d) determination of [K]
Figure 3: k-Means Output and Categorization Determination
20
Plotting of data, and centroid movement may be viewed in MATLAB’s Figure window,
and separation between should be clearly visible between data groupings. In Figure 3(a),
the kick centroid exists on the bottom left corner, and is data defined by its ownership is
represented by blue [+]. Snares are represented as red [.] with the snare centroid located at
the barycenter of these four points. The hat centroid is located on the lower right side of
the graph; hat data points are represented by green [x]. Figure 3 also demonstrates the steps
by which these determinations are made. Across the eight features, the snare centroid will
demonstrate the highest average energy (Figure 3(b)), and hats the least (Figure 3(c)), as the
bins analyzed are much too low to exhibit any significant energy levels. As kicks are the
only other class remaining, they can simply be determined by simple subtraction (three
groups equals a total value of six (Figure 3(d)); and we can simply subtract the determined
value of snares and hats from this).
21
4.RESEQUENCING
22
4.1 Substitution Method
Once DetFuncA.m and DetfuncB.m have completed and their outputs returned to
TimbreMut.m, a simple matching is needed to resequence the ‘B’ file onset list to that of
the ‘A’ file onsets. Work by Ravelli et al. (2007) in this area introduces a substitution
matrix, as presented in Figure 4(a), a systematic procedure of assigning rewards and
punishments to specific pairings, such that non-class pairings (such as ‘A’ kick with ‘B’
snare) are discouraged, while strong on-beat events such as kick drums on the first beat of a
loop are more likely to be paired. Once rewards and punishments have been assessed for
each pairing, a resequenced vector is created by navigating backwards through the matrix
on a pattern that provides the optimal score (Figure 4(b)). This implementation excels by
following the resequence pattern that provides the highest possible score, and thus most
likely fit.
Figure 4 : (a) substitution matrix with penalties and rewards for same loop
(b) Needleman-Wunsch-Sellers Algorithm for matching two different loops
(Ravelli et al., 2007)
1
Figure 4(b) demonstrates the resequencing of loop ‘B’ (LMLM) to loop ‘A’ (LMLLLM).
On-beat (boldface)/off-beat (regular font) determination is performed by a separate beat
tracking analysis stage. As the presented method does not make such distinctions, it is not
possible to utilize the substitution matrix as indoctrinated in the Ravelli method. As such
any implementation of a substitution matrix would then require some method of
differentiation between similarly-classed events. However, if these events are already parsed
into separate vectors, the need for a substitution matrix is removed. Pairing is then only
possible by a comparative analysis of some feature produced during classification. The
following section presents the only measure output by the classification scheme, that of a
distance metric from a centroid.
4.2 Proportional Centroid Distance (PCD)
Sequence alignment of two loops relies upon the basic assumption that there is some
discernable characteristic within each of the vectors being analyzed that must be used as the
source of comparison. Spectral similarity can only bring us so far, as once we have
delineated between classes, there truly is no reliable source of frequency comparison
23
between loops. A related feature to that of spectral similarity can be created however,
through a stored output of kmeans.m, namely the aforementioned output.class.k/s/and h,
which outputs distance to ownership centroid per class. Unfortunately centroid distances
are not predictable, such that strong variation between in class events such as rimshots and
brushes on a snare drum may result in a large variation in centroid distances. It is therefore
essential to adopt this data to a comparative form. The method undertaken here is to use
the proportional centroid distance (PCD), which is achieved simply by dividing each value
in a particular subset by its largest distance.
To prevent incorrect class mismatches, resequencing in TimbreMut.m is performed
sequentially on individual class lists (TimbreMut.m, lines 68-83), resulting in resequenced
‘B’ class lists (B_kicks_nu, B_snares_nu, B_hats_nu), which must then be combined for
the cross-synthesis technique in the following section. Although PCD is irrelevant in
spectral comparison between ‘A’ and ‘B’ beyond the class delineation (i.e. kick versus
snare), it can be seen to effectively associate those onsets within loop ‘B’ with onsets of ‘A’,
which are of similar deviation from the idealized timbres of the classes, represented by the
centroids.
Figure 5: PCD pairing and Resequencing of A and B kicks
Figure 5 demonstrates the method of comparison by which PCD utilizes for resequencing
‘B’ onsets. Each vector of distances (listed as ‘A’ kicks and ‘B’ kicks) is first divided by its
maximum component, resulting in components ranging between 0-1. This then provides a
method of comparison between the two lists.
To find most appropriate pairings, PCD compares the first ‘A’ onset with each ‘B’ onset,
locating the absolute minimum deviation between the two lists. In the case of A_kicks(1),
the correct pairing becomes B_kicks(13), because the smallest calculated difference exists
between 0.5683 and 0.7186. Continuing in this fashion results in a resequenced pattern of
[13,9,5,1,5,4,5,1,9], and is defined by B_kicks_nu. The process is then repeated for the
creation of B_snares_nu, and B_hats_nu. PCD is an especially quick operation, generally
executed in under 0.025 seconds, and provides reliable results for loops with standard
instrumentation, and is able to provide proper pairing for ghost notes (stick bouncing) and
the occasional odd timbral event, such as a rimshots, so long as these exist in both loops
being matched.
24
One interesting issue created through the use of centroid distance, is that directionality is
completely ignored. As a comparison of spectral characteristics is not performed on intraclass slices, a sort of spectral circumference is generated, and comparisons are based only
on distance from an idealized class structure (ideal kick drum, snare drum, or hat/cymbal).
The result of this is neither good nor bad, the spectral makeup of different loops can not
be assumed to be similar. For example, the composition of a solid down-beat kick for loop
A could possibly most resemble spectra to the off-beat syncopated kick of loop B; the
result of which would be an undesirable pairing. Instead, the distance metric is used
without regard to spectral positioning within the circumference.
8.3 Outliers
It is important to keep in mind that unwanted results may occur if one of the loops
contains a majority of non-standard sounds such as rimshots, and relatively few full snare
hits, as this will move the centroid towards the rimshot timbre, making this the normal
sound. When paired with another loop, which uses the full snare as a comparative centroid
element, the full snare will be matched with a rimshot. Of course, the user of this algorithm
should be aware that such a pairing will be expected, as there are only three centroids, and
if a rimshot exists as the main source of timbral expression for the snare centroid, then it
should come as no surprise that its timbre will be dually represented within the final output
.wav files.
25
5. TIMBRE MUTATION
26
5.1 Possible Methods
There are several methods by which the slices of one loop ‘A’ may be combined with
another ‘B’. Any chosen synthesis technique should provide a consistent quality measure,
and do so in a highly speedy manner, as the algorithm must iterate several times. Other
considerations may include a variable quality control for faster results when choosing loops
and setting timbre configurations.
Certain input characteristics, such as variation in file length and pitch height also play a role
in the decision as to which method should be incorporated. Spectral modeling synthesis
(SMS), linear predictive coding (LPC), and phase vocoder morphing are all appropriate
methods by which to implement repetitive cross-synthesis, however SMS and LPC
methods require longer durations for analysis and high quality output than the phase
vocoder method – LPC requires a large number of coefficients to carefully model the
spectral envelope of a sound. It should be noted here that because specific frequency
blending characteristics are not present (princarg function is not utilized), the phase
vocoder methods should not generally be described as cross-synthesis techniques for tonal
mutation applications, however for this application, the method proves to be quite
appropriate. The function, PhaseVoxMut.m (Appendix 8.4) provides analysis,
transformation, and resynthesis functionality to the TimbreMut.m algorithm.
5.2 PhaseVoxMut.m
Analysis and resynthesis performed on the time-frequency model employed during phase
vocoding allows simplistic modification of magnitude and phase values. In a block-by-block
approach, samples are input into a sliding-time reference FFT calculation, such that:
X(n, k) =
#
$ x(m)h(n ! m)e
m = !#
! j 2 " mk / N
(15)
Here, h(n) is a sliding window whose value is the same as the frame size. It is important to
note here that in the block-by block approach, phases are not unwrapped by the princarg
function, and as such, their values result in values between 0 and 2π. Amplitude values
result in -1 to 1 values.
In resynthesis, the inverse IFFT is applied to each STFT spectrum, and the signal is then
reconstructed through the use of the overlap-add method (Arfib et al., 2002). The
27
combined process, also known as the direct FFT/IFFT approach, is demonstrated below in
Figure 6.
Figure 6: Overview of Overlap-Add Phase Vocoder method
(adapted from Zolzer et al. 2002)
12
Provided that the sum of the overlapped STFT windows is unity, as is seen in Figure 7, a
perfect reconstructed signal is possible .
13
12
Figure 7: Overlap-Add Summation Model (adapted from Zolzer et al. 2002)
Arfin et al. (2002) explain that mutation between two sounds may be performed through
the direct FFT/IFFT method, whereby two sounds are analyzed by individual sliding FFT
windows in tandem . Within each STFT frame, phase and magnitude values of each
sound are extracted.
12
28
The phase components of sound A ( ! A (n) ) and B ( ! B (n) ), are combined:
! A (n) + ! B (n) = ! comb (n)
(16)
and the same method is used for the determination for overall amplitude values from
individual A ( X A (n) ) and B ( X B (n) ) values:
X A (n) + X B (n) = Xcomb (n)
(17)
At the end of each window calculation, phase and magnitude components are then
merged:
y(n) = Xcomb (n) e! j 2" comb (n)
(18)
And finally, the IFFT returns the spectrum to a time-domain representation of the signal:
output PVM =
"
# IFFT ( y(n))
(19)
n = !"
Figure 8 demonstrates the process of phase vocoder mutation.
Figure 8: Phase Vocoder Mutation
In a recursive process, PhaseVoxMut.m is provided correlated ‘A’ and ’B’ (i.e. A_kicks
and B_kicks_nu) samples; it analyzes the lengths, and determines the longer sample (which
becomes the new sample length). The frame-by-frame STFT values from earlier analysis in
DetFuncA.m and DetFuncB.m are not stored, and as such must be recalculated in
PhaseVoxMut.m – this calculation is quite brief, and its separation from the previous
processes is essential for separation of magnitudes and phases, as well as the variable
quality control measure discussed previously in this section.
29
5.3 User-Defined Mix Levels
For innovative phase vocoder mutation, [13] recommend experimenting with adjustments
to the process of addition of magnitudes, r = r1 +r2, where r is overall magnitudes, and r1
and r2 are the ‘A’ sample and ’B’ sample magnitudes (per frame), respectively. To this end,
I have implemented coefficients ! and ! , for magnitude and phase respectively, so that
the user may control the amounts of each of these values from the command line, resulting
in a modified equation which now reads (PhaseVoxMut.m, line 51):
Xcomb (n) = ( X A (n) i(1 ! " )) + ( X B (n) i" )
(20)
This same principle is applied for phase control, described by theta on line 52:
! comb (n) = (! A (n)i (1 " # )) + (! B (n)i #)
(21)
The adjusted magnitudes and phases are then recombined as per equation (18) in line 53:
This returns the imaginary and real values for the IFFT transformation, which returns the
result of (18) to the time domain. PhaseVoxMut.m then outputs amplitude values, similar
to the wavread that it used to create the hybridization, and uses the two sample names to
output a concatenated name.
The phase vocoding method results in the separation of magnitudes and phases within the
frequency domain. Morphing using the phase vocoder technique produces two sets of such
magnitudes and phases, and allows for individual manipulation of magnitudes and phases
in the frequency domain. At the command line, once filenames are called, magnitude and
phase coefficients provide a 0.0-1.0 control over each parameter, such that a 0.2 value
would coincide with a 20% mix (80% ‘A’, 20% ‘B’). For example, the command line may
read:
which corresponds to 80% (0.8) ‘eel.wav’ magnitudes and 20% (0.2) ‘eel.wav’ phases. It is
important here to mention that the decorrelation of magnitude and phases results in
seemingly random pairings of phases with magnitudes, and the larger the divide, the more
pronounced the effect. The most obvious disconnect will result when the magnitude and
phase values differ by 1.0, such that the values either read:
In both cases cavernous phaser effects are created. In the first example, TimbreMut.m will
return a resequenced .wav to the rhythm of the first loop, which utilizes the enveloping of
the sounds of ‘clarknova4bbcutm.wav’ while returning the phase components (and thus
sounds of) ‘eel.wav’. While remaining rhythmically consistent with the first example, the
second case will contain the envelopes of ‘eel.wav’ while maintaining the phases of
30
‘clarknova4bbcutm.wav’ – an interesting effect, considering that this file has also provided
the output file with its rhythmic component.
Both of the aforementioned settings result in extreme variations on the design of the
algorithm, resulting in phaser-like effects upon the output waveform. Standard usage
however, will generally keep both magnitude and phase values consistent with one another
– in fact the default values of each are set to 0.8 and 0.8, which will result in a clean mix of
80% ‘B’ loop of both magnitudes and phases. Equal values throughout the scale of 0.0 –
1.0 will then provide predictable clear results, with correlated phase values.
This setting will result in a full replacement of both phases and magnitudes; the output of
the algorithm will thus be the timbre of the ‘eel.wav’ loop playing the rhythm from the
‘clarknova4bbcutm.wav’ sample.
5.4 Window Size
The final user defined option within the TimbreMut.m algorithm is a option of window
size. The default size is 1024, however any power of two will work. The purpose of this
control is to allow the user to quickly create a transformation for testing. Timbre morphing
is often a difficult process to predict, even without user-defined options such as phase and
magnitude amounts. Larger window sizes are effective for accurate frequency
representation, however tend to create non-linearity within the input–output relationship.
Therefore, to fine-tune the morphing parameters, the user has access to a larger window
size, which will generate low quality versions of the transformation, prior to higher quality
representation. As the loss of frequency resolution is less imperative than temporal
resolution, a final, higher quality version may be produced by using a window size of 512,
256 or even 128 samples.
31
6. FURTHER CONSIDERATIONS
32
6.1 MIDI
This algorithm is designed to not only produce new timbral effects, but to perform
autonomously as a traditional beat slicing program such as Propellerheads Recycle. Of
course the addition of MIDI is not simply for this purpose, but more intrinsically because
MIDI techniques are still widely used within the electronic music community. Specifically,
audio manipulation within software sequencers such as Pro Tools, Digital Performer, Logic
and Cubase are often executed with the use of software samplers and MIDI control. It is
for this expressed reason that TimbreMut.m exports individual .wav slices, and a .mid file
for their control. DetFuncA.m exports an ‘nmat’ file which stores sample points at which
onsets occur, which is then used by TimbreMut.m to create a MIDI file on lines 171 to
172 (Appendix 8.1).
DetFuncA.m also contains coding for MIDI file creation, the result of altered code from
the MATLAB MIDI Toolbox . Notes are assigned from 1 to the length of the onset
vector (output.mMIDI.notes), and their durations (output.mMIDI.dur) are prepared as the
sample point difference between successive onsets. The createnmat.m command from the
MIDI Toolbox is used here to create a notematrix, or a pre-MIDI file containing notes,
durations, note-on and note-off times. The notematrix is then output from DetFuncA.m as
output.mMIDI.nmat. These files are returned to TimbreMut.m and configured as a
renamed MIDI file there.
14
DetfuncB.m does not contain a MIDI preparation block, nor is one necessary, as the flow
of information demonstrates that the second loop is resequenced fully to the rhythm of the
first, and thus the process will never require a second MIDI file. This was a purposeful
decision to limit the outflow of information from the detection algorithms, thereby
reducing analysis processing time significantly. The absence of this feature is the only
difference between the two functions.
6.2 Slicing and Combined Output
PhaseVoxMut.m is called iteratively within TimbreMut.m (Appendix 8.1), on lines 99, 110,
125, 136, 151 and 162. The internal output of the command is, as mentioned above, an
individual slice as a .wav, named for the two input samples. PhaseVoxMut.m also returns
an amplitude vector to TimbreMut.m, which, in combination with the stored ‘notes’ vector,
is used for a final resequenced, combined .wav.
33
6.3 Gaps and Clipping
During rhythm resequencing, substantial slice length deviation between an ‘A’ and ‘B’
samples may result in either a gap between the end of one slice and the beginning of the
next, or clipping if a slice has not completed its full playback. The solution undertaken to
solve both of these problems is two-fold. Before either problem is solved, it must be
understood that if the PhaseVoxMut.m command will, and must, choose a specific sample
length. Therefore the first step prescribes a minimum and maximum title to each sample.
PhaseVoxMut.m always creates an empty vector to the length of the longer, irregardless of
‘A’ or ‘B’’s length, creating a situation whereby an ‘A’ loop with shorter durations will
always clip (so long as the mix is above 0.0 for at least the magnitude value). Once this has
been done, a simple 100-sample fade (Appendix 8.1, line 181-182) placed at the end of the
length of each prescribed MIDI note will remove any non-linearity associated with the
continuance of the note events, and pass unnoticed by listeners.
6.4 Omission of Timestretching Functionality
The decision to omit a timestretching function was made only after attempting varied
techniques and after careful consideration of several unavoidable scenarios. Certain
assumptions must be made in considering the timestretching option. First, and foremost is
that each loop will contain slices of different sample lengths (unless of course the user is
manipulating a loop with itself simply for the slicing benefit of the algorithm). If the
difference between sample lengths is greater than a certain listener-defined tasteful
duration, then how might an algorithm containing only three centroids to define class
(along with centroid distance) deal with length? If the first loop contains long duration ride
cymbals, and the second only closed hats, when pairing occurs, should the algorithm really
stretch the closed hat to the duration of the ride cymbal? Second, and closely related to the
first point, if this issue is permitted, and timestretching is utilized, at which point is sample
difference uncomfortable to the ear? Beyond transient smearing, stretching even the decay
of kick drums often results in a slight ringing and phasy tone, which many electronic
musicians and producers might frown upon. More importantly, a high quality
timestretching algorithm is, in itself another process which takes time to complete. So, for
these reasons, I have chosen to avoid implementing a timestretching block to the algorithm,
and will instead rely on the user to perform any such manipulation prior to input.
34
7. CONCLUSIONS
35
In this paper I have presented a method by which two percussion loops may be
automatically manipulated such that the second loop is 1) modified to the rhythm of the
first, and 2) the loops are then systematically morphed together. The TimbreMut.m
algorithm appears to be a useful tool for segmentation and manipulation of percussive
rhythm and timbre. It consists of: 1) a detection function which effectively locates the
onsets of percussive events, 2) a classification stage which uses basic principles of logic and
inference in a deterministic categorization of the events into one three groups, 3)
resequencing using a component from the classification stage, and finally, 4) a
mutation/morphing element, which relies on phase vocoder techniques. These stages have
been designed with an emphasis on modularity, such that each component is comprised of
block forms for easy access or redefinition via alternate functions.
Quantification of the robustness of the algorithm is a difficult measure without proper
sample licensing. The database on which I have performed my tests is a private collection
of 355 recorded funk and jazz vinyl records/CDs, along with a second set of 30 prepared
studio drum loops, comprised of more basic beat structures and less timbral complexity.
To test the accuracy of the onset detection, TimbreMut.m was modified to output
individual slices of the A loop alone. On studio drum loops, DetFuncA.m performed
impeccably, locating onset points precisely each time. It should be noted here as well, that
these loops have been created with the intention of further manipulation by electronic
music composers, and as such have a moderate amount of separation between elements.
Detection with the recorded funk loops was slightly less predictable, however no false
onsets were found. The most repeated error was found in the detection being too accurate,
in that if a percussionist played a kick and hat closely, but their attacks were not necessarily
indistinct, the function would see two as individual onsets, resulting in a short slice. For
simple slicing of a loop this is fine, however, in timbre morphing or full replacement
scenarios, the replaced event would not fit comfortably, and a ‘chopped’ effect occurs. To
remedy this situation, a simple if-statement could be included in the synthesis block,
whereby slices must be of a given duration to enter PhaseVoxMut.m, otherwise they will
automatically be used the final output.
Listener tests have been optimistic towards the output of the algorithm when the
PhaseVoxMut.m frame size is set to moderate (512 samples) to high quality (128 samples)
settings. Below these, the sound becomes “grainy” (at 1024 samples), or “noisy”(2048
samples and above). The most often received criticism regarding the recombined
waveform was that the individual slice decays were often cut short. As the onset points for
the rhythm cannot be altered, and a timestretching function is not a presently entertained
option, there is little that can be done to prevent the occurrence of this problem. The effect
is, however often a desirable sonic event within electronic dance music genres such as
drum and bass or glitch. Nevertheless, electronic musicians were for the most part
interested in the individual output wav files (along with accompanied .mid file), and
36
possibilities with non-standard drum sounds for alternate percussion timbres. This group
was eager to try the algorithm to generate new timbres quickly for compositions.
Potentially the most problematic area of the algorithm is the Classification stage. The
simple three-class categorization method works well for loops whose constituent events are
comprised of timbres defined by kick drum (K), snare drum (S), or hat/cymbal (H). For
the most part then, k-Means relegates non-K, S, H timbres to a positioning in relation to
the K, S, H tessellation. For example, floor toms are seen as outliers to the K class, and
bells or triangles will be perceived as hats. If the number of non-percussive events such as
triangles predominates the usage of standard hats in a loop, then k-Means will still iterate its
centroid towards the barycenter of all events, making the idealized hat a composite of both
the hats and triangles. Therefore, the more non-hat elements encompassed by the H
centroid, the more loosely the definition will fit the slices. Again, this proves to be an issue
only during the mutation or full replacement, as a triangle from one loop may be blended
with a hat from the second. In operations with the database of funk loops, this situation was
common, as live performances and solos often include a stylistic individuality marked by
not only heightened rhythm complexity, but timbral uniqueness as well. As TimbreMut.m
has been designed to work more robustly with studio drum loops, the user must use her/his
scrutiny in loop choice. In future implementations, a nested k-Means method may be
incorporated, such that the first delineation would be between low, middle, and high
events, and then subclasses could then be derived from a secondary k-Means calculation,
and whose (secondary) k value is generated from the Euclidian distance of each of these
events. Another answer may be found in the k-Medioid function, whereby a centroid is not
chosen; rather the data point closest to the center is chosen as a center for the mass.
Comparative analysis may then be used to determine outlier validity.
The PCD method of resequencing could also be made more robust against outliers,
through an outer limit thresholding, such that once the distances have been brought within
a certain range, and adaptive thresholding function could selectively eliminate all data
points beyond it.
Presently, TimbreMut.m is not intended to operate real-time, and its usage is intended for
processing and preparation of percussive loops, it is not required that it rely on user input
once initial calculations have begun. A future implementation may include real-time
resequencing with live audio, a feature provided by Nick Collins BBCut, for the
Supercollider language, but this would require the reworking of several components in the
algorithm’s architecture within another program.
In further implementations of the program, I would also like to incorporate a multiresolution analysis phase, similar to that found in [15]. This would allow for both a more
accurate representation of the presence of onsets within each frequency range chosen, as
well as a better-suited vector used for the classification stage.
In all however, the TimbreMut.m algorithm provides a fast method of morphing between
percussive loops, generating novel timbres from pre-recorded samples, and its architecture
provides a strong basis on which to grow for further development.
37
8. APPENDIX
38
8.1 TimbreMut.m
function [] = TimbreMut(filename1,filename2,mag,phase,framesize);
%Jason A. Hockman
%NYU Music Technology
%This function, given two different mono drumbeats, will return individual
%blended conglomerate files made up of the like sounds of the two loops, a
%midi file of the same root name, as well as a resequencing of the two
%individual files, made to play to the rhythm of the first loop. All these
%outputs are then found in a folder, within the main MATLAB catalog, with
%a combined name.
%
%
e.g. A) '10.wav' B) 'Brockie.wav' Folder name will be: '10Brockie'
%
cd('/Applications/MATLAB_SV71');%this should be set to main Matlab folder
%===========================
% ---- Onset Detection ---%===========================
if (exist('mag') ~= 1) %optional args for spectral magnitude and phase mix
mag=0.8;
else
mag=str2num(mag);
end
if (exist('phase') ~=1)
phase=0.8;
else
phase=str2num(phase);
end
if (exist('framesize') ~=1)%opt args for framesize...useful to start high
framesize=1024;
else
framesize=str2num(framesize);
end
output1 = DetFuncA(filename1);% complex domain detection function for A
output2 = DetFuncB(filename2);% complex domain detection function for B
fs=output1.audio.SR;
%A_defs
A=output1.audio.file;
A_kicks=output1.list.kick;A_snares=output1.list.snare;
A_hats=output1.list.hat;
A_k=output1.class.k;A_s=output1.class.s;A_h=output1.class.h;
A_ON=output1.onsetinfo;dur=output1.mMIDI.dur;
%B_defs
B=output2.audio.file;
B_kicks=output2.list.kick;B_snares=output2.list.snare;
B_hats=output2.list.hat;
B_k=output2.class.k;B_s=output2.class.s;B_h=output2.class.h;
B_ON=output2.onsetinfo;
%=============================
% -------- Filename --------%=============================
39
fn1=filename1(1:strfind(filename1,'.')-1);
fn2=filename2(1:strfind(filename2,'.')-1);
if length(fn1)>4;
fn1=fn1(1:4);
end
if length(fn2)>4;
fn2=fn2(1:4);
end
fn=[fn1 fn2];mkdir(fn);%used for creation of folder+filename
%===========================
% ----- P C D Pairing ----%===========================
m=1;n=1;p=1; %this is performed via proportional centroid proximity
for i=1:length(A_k);
[Y(:,m),A_feign_kick(:,m)]=min(abs((A_k(i)/max(A_k))-...
((B_k(1:length(B_k)))/max(B_k)))); m=m+1;
end
for i=1:length(A_s);
[Y(:,n),A_feign_snare(:,n)]=min(abs((A_s(i)/max(A_s))-...
((B_s(1:length(B_s)))/max(B_s)))); n=n+1;
end
for i=1:length(A_h);
[Y(:,p),A_feign_hat(:,p)]=min(abs((A_h(i)/max(A_h))-...
((B_h(1:length(B_h)))/max(B_h)))); p=p+1;
end
B_kicks_nu=B_kicks(A_feign_kick);
B_snares_nu=B_snares(A_feign_snare);
B_hats_nu=B_hats(A_feign_hat);
%===============================
% -- Phase Vocoding & Slices -%===============================
E=zeros(max(output1.mMIDI.dur),length(A_ON)-1);
m=1;n=1;p=1;
for i=1:length(A_kicks);
if A_kicks(i)<length(A_ON)-1;
A_kick=A(A_ON(A_kicks(i)):A_ON(A_kicks(i)+1));
if B_kicks_nu(i)<length(B_ON)-1;
B_kick=B(B_ON(B_kicks_nu(i)):B_ON(B_kicks_nu(i)+1));
else
B_kick=B(B_ON(B_kicks_nu(i)):length(B));
end
j=eval(['A_kicks(' num2str(m) ')']);
pvmout=PhaseVoxMut(A_kick, B_kick, ([fn num2str(j)]),...
fn,mag,phase,framesize);
E(1:length(pvmout.result),j)=pvmout.result;
else
A_kick=A((A_ON(A_kicks(i))):A_ON(end));
if B_kicks_nu(i)<length(A_ON)-1;
B_kick=B(B_ON(B_kicks_nu(i)):B_ON(B_kicks_nu(i)+1));
else
B_kick=B(B_ON(B_kicks_nu(i)):B_ON(end));
end
j=eval(['A_kicks(' num2str(m) ')']);
pvmout=PhaseVoxMut(A_kick, B_kick, ([fn num2str(j)]),...
fn,mag,phase,framesize);
E(1:length(pvmout.result),j)=pvmout.result;
end
m=m+1;
end
for i=1:length(A_snares);
if A_snares(i)<length(A_ON)-1;
A_snare=A(A_ON(A_snares(i)):A_ON(A_snares(i)+1));
if B_snares_nu(i)<length(B_ON)-1;
40
B_snare=B(B_ON(B_snares_nu(i)):B_ON(B_snares_nu(i)+1));
else
B_snare=B(B_ON(B_snares_nu(i)):length(B));
end
j=eval(['A_snares(' num2str(n) ')']);
pvmout=PhaseVoxMut(A_snare, B_snare, ([fn num2str(j)]),...
fn,mag,phase,framesize);
E(1:length(pvmout.result),j)=pvmout.result;
else
A_snare=A((A_ON(A_snares(i))):length(A));
if B_snares_nu(i)<length(A_ON)-1;
B_snare=B(B_ON(B_snares_nu(i)):B_ON(B_snares_nu(i)+1));
else
B_snare=B(B_ON(B_snares_nu(i)):length(B));
end
j=eval(['A_snares(' num2str(n) ')']);
pvmout=PhaseVoxMut(A_snare, B_snare, ([fn num2str(j)]),...
fn,mag,phase,framesize);
E(1:length(pvmout.result),j)=pvmout.result;
end
n=n+1;
end
for i=1:length(A_hats);
if A_hats(i)<length(A_ON)-1;
A_hat=A(A_ON(A_hats(i)):A_ON(A_hats(i)+1));
if B_hats_nu(i)<length(B_ON)-1;
B_hat=B(B_ON(B_hats_nu(i)):B_ON(B_hats_nu(i)+1));
else
B_hat=B(B_ON(B_hats_nu(i)):length(B));
end
j=eval(['A_hats(' num2str(p) ')']);
pvmout=PhaseVoxMut(A_hat, B_hat, ([fn num2str(j)]),...
fn,mag,phase,framesize);
E(1:length(pvmout.result),j)=pvmout.result;
else
A_hat=A((A_ON(A_hats(i))):length(A));
if B_hats_nu(i)<length(A_ON)-1;
B_hat=B(B_ON(B_hats_nu(i)):B_ON(B_hats_nu(i)+1));
else
B_hat=B(B_ON(B_hats_nu(i)):length(B));
end
j=eval(['A_hats(' num2str(p) ')']);
pvmout=PhaseVoxMut(A_hat, B_hat, ([fn num2str(j)]),...
fn,mag,phase,framesize);
E(1:length(pvmout.result),j)=pvmout.result;
end
p=p+1;
end
%===========================
% -------- MIDI ---------%===========================
MIDIfileName=[fn '.mid']; %additional functionality for ReCycle-like use
MIDIfile=writemidi(output1.mMIDI.nmat,MIDIfileName);
%=============================
% Resequenced Combined Output
%=============================
df=(-1:.01:0);df=abs(df)';%used for fade
notes=[1:length(A_ON)-1];start=A_ON(1:length(A_ON)-1);
buff=zeros(1,length(A));E=E';
for i=1:length(notes);
drumfade=[ones(dur(i)-101,1); df];%fade for clip prevention
buff(1,start(i):start(i)+dur(i)-1)=E(i,1:dur(i)).*drumfade';
end
41
%===========================
% ------ Resequence ------%===========================
wavwrite(buff,fs,[fn num2str('full.wav')]);
cd('/Applications/MATLAB_SV71');
8.2 DetFuncA.m & DetFuncB.m
function output=DetFuncA(filename);
%Jason A. Hockman
%NYU Music Technology
%This function, which is used to find the onsets of each event within a
%percussive audio file, is based on the research of Bello et al. It
%incorporates both phase (proper for kick detection) and STFT energy
%assessment for high freq elements (snares+hats). This function outputs
%onsets, their start times, as well as a MIDI file.
[y,fs]=wavread(filename);
output.audio.file=y;
output.audio.SR=fs;
% ===========================
% ---- Initializations ---% ===========================
m=1;
v=1:5:36;
q=1;
thresh=20;
N=512;
hop=N/2;
win=blackman(N);
file_name=[];
tic
% ==========================================
% ---- Complex-Domain Onset Detection ---% ==========================================
A=abs(spectrogram(y,win,hop,N,fs));
pin=0;
pout=0;
pend=length(A)-2;
while pin<pend
fta_prior=real(A(1:hop,pin+2));%prior angle
fta_priorprior=real(A(1:hop,pin+1));%priorprior angle
STFT_present=abs(A(1:hop,pin+3));%present STFT
STFT_prior=abs(A(1:hop,pin+2));%prior STFT
newthresh(:,m)=median(STFT_present);
d = princarg((2*fta_prior) - fta_priorprior);%phase unwrap
onDet(:,m)=(((STFT_prior).^2)+((STFT_present).^2)- ...
((2*STFT_prior).*(STFT_present).*(cos(d)))).^0.5;%OnDet
bnc(:,m)=(STFT_present(v));%storing first 8 fft bins
m=m+1;
pin=pin+1;
pout;
end
pk=[zeros(1,2) sum(onDet)];
nt=[zeros(1,2) (newthresh)];
peaks1=(diff(pk));%derivative of onsets
thresh = min(thresh, mean(abs(peaks1)));
adpk=pk.*0.6;
adthresh=diff(adpk);
peaks=[-20 LocalMaxima(peaks1,thresh,adthresh)];%peakpicking
42
for n = 2:length(peaks);
if ((peaks(n))-(peaks(n-1))) > 15
P1(q) = peaks(n);
q = q+1;
end
end
P1 = [P1 size(bnc,2)];
for k = 1:length(P1)-1
Mx(:,k) = mean(bnc(:,P1(k):min(P1 ...
(k)+7,P1(k+1)-1)),2);
end
ONSETS=(P1(1:end-1))*hop;
% ===========================
% ------- K-Means ------% ===========================
data.X=Mx';
[N,n]=size(data.X);
data=clust_normalize(data,'range');%data normalization
plot(data.X(:,8),data.X(:,8),'.')
hold on
param.c=3;% # of Clusters
param.vis=1;% visualization parameter (set to 0 or 1)
param.val=2;
result=Kmeans(data,param);
hold on
plot(result.cluster.v(:,1),result.cluster.v(:,2),'ro')
result=validity(result,data,param);
result.validity;
%plot(result.cluster.v')
CS1=(result.cluster.v(1,:));
CS2=(result.cluster.v(2,:));
CS3=(result.cluster.v(3,:));
%means of centroids
SC1m=(sum(CS1))/8;
SC2m=(sum(CS2))/8;
SC3m=(sum(CS3))/8;
SCm=[SC1m SC2m SC3m];
[h,s]=max(SCm);%these 3 lines are defining the data
[n,h]=min(SCm);
k=6-s-h;
[output.list.kick,J,output.class.k]=find(result.data.d(:,k).*result.data.f(:,k)
);
[output.list.snare,J,output.class.s]=find(result.data.d(:,s).*result.data.f(:,s
));
[output.list.hat,J,output.class.h]=find(result.data.d(:,h).*result.data.f(:,h))
;
m=1;
ON = [ONSETS length(y)];
for i=1:length(ON)-1
dur(m)=ON(i+1)-ON(i);
m=m+1;
end
output.onsetinfo=ON;
output.mMIDI.notes=(1:length(ONSETS))';
43
output.mMIDI.dur=dur;
dur=(dur/fs);
output.mMIDI.nmat=createnmat(output.mMIDI.notes,dur);
output.mMIDI.start=ONSETS';
output.mMIDI.finish=ON(2:length(ON))';
toc
8.3 LocalMaxima.m
function peaks = LocalMaxima(peaks1, thresh, adthresh);
% peaks = LocalMaxima(peaks1): return indices of elements of peaks1
% which are positive peaks
i = 2:length(peaks1)-1;
peaks = find((peaks1(i)>peaks1(i-1))&(peaks1(i)>peaks1(i+1))...
&(peaks1(i)>(thresh + adthresh(i))));
8.4 PhaseVoxMut.m
function pvmout = PhaseVoxMut(file1, file2, file3,
file4,r_coef,t_coef,framesize);
%Jason A. Hockman
%NYU Music Technology
%This function, based on the design of a phase vocoder from the DAFx book,
%provides a simple method of combining the two alike classified sounds
%through spectral analysis and combination. The modification here which
%makes the process different, however, is a two tiered coefficient method,
%r_coef and t_coef, which control the spectral makeup and envelope amounts
%from A or B, respectively; providing user control.
A=file1;B=file2;
output=file3;
outputfldr=(['//Applications/MATLAB_SV71/' num2str(file4)]);
%===========================
% ---- INITIALIZATIONS ---%===========================
n1=framesize;fs=44100;
n2=n1;
N=8192;
win=hanningz(N/2);
c=min(length(A),length(B));d=max(length(A),length(B));
X1=zeros(d,1);X2=zeros(d,1);
X1(1:length(A))=A; X2(1:length(B))=B;
L=min(length(X1),length(X2));
X1=[zeros(N,1);X1;zeros(N-mod(L,n1),1)]/max(abs(X1));
X2=[zeros(N,1);X2;zeros(N-mod(L,n1),1)]/max(abs(X2));
X_out=zeros(length(X1),1);
44
tic
%============================
% ------- Phase Vox -------%============================
pin=40;
pout=0;
pend=length(X1)-2*N;
while pin<pend
grain1 = X1(pin+1:pin+N).* win;
grain2 = X2(pin+1:pin+N).* win;
f1
= fft(fftshift(grain1));
r1
= abs(f1);
theta1 = angle(f1);
f2
= fft(fftshift(grain2));
r2
= (abs(f2));
theta2 = angle(f2);
r
= (r1*(1-r_coef))+(r_coef*r2);
theta = (theta1*(1-t_coef))+(t_coef*theta2);
ft
= (r.* exp(i*theta));
grain = fftshift(real(ifft(ft))).*win; %resynthesis
X_out(pout+1:pout+N) = X_out(pout+1:pout+N) + grain;
pin
= pin + n1;
pout
= pout + n2;
end
toc
%===========================
% -------- Output --------%===========================
cd(outputfldr);
X1 = X1(N+1:N+L);
X_out = X_out(N+1:N+L) / max(abs(X_out));
wavwrite(X_out, fs, output);
pvmout.result=X_out;
pvmout.c=c;
45
9. REFERENCES
46
[1] E. Ravelli, J.P. Bello, M. Sandler, “Automatic Rhythm Modification of Drum Loops,”
IEEE Signal Processing Letters, April 2007
[2] J.P. Bello, E. Ravelli, M. Sandler, “Drum Sound Analysis for the Manipulation of
Rhythm in Drum Loops,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal
Processing, vol. 5, May 2006, pp. 233 - 236.
[3] J.P. Bello, C. Duxbury, M. Davies, and M. Sandler, “On the Use of Phase and Energy
for Musical Onset Detection in the Complex Domain,” IEEE Signal Processing Letters,
vol. 11, no. 6, June 2004
[4] R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification” New York, John Wiley and
Sons – Interscience, 2 edition
nd
[5] J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, M.B. Sandler, “A Tutorial
on Onset Detection in Music Signals,” IEEE Transactions on Speech and Audio
Processing, vol.13, no.5, September 2005
[6] U. Zolzer, “Introduction” DAFx - Digital Audio Effects, Wiley and Sons, London,
2002
[7] X. Rodet and F. Jaillet, “Detection and Modeling of Fast Attack Transients” in
Proceedings of the International Computer Music Conference, 2001
[8] J.P. Bello and M. Sandler. “Phase-Based Note Onset Detection for Music Signals,” in
Proc. IEEE Int. Conference Acoustics, Speech, and Signal Processing (ICASSP-03), Hong
Kong, 2003
[9] C. Duxbury, J.P. Bello, M. Sandler, M. Davies, “Comparison Between Fixed and
Multiresolution Analysis for Onset Detection in Musical Signals” in the 7th Conf. on
Digital Audio Effects. Naples, Italy, October 2004
[10] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate
Observations,” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics
and Probability, vol. 1
[11] B. Balazsko, J. Abonyi, B. Feil, “Fuzzy Clustering and Data Analysis Toolbox”, 2004
[12] D. Arfib, F. Keiler, U. Zolzer, “Time Frequency Processing,” DAFx - Digital Audio
Effects, Wiley and Sons, London, 2002
47
[13] R.E. Crochiere, “A Weighted Overlap-Add Method of Short-Time Fourier
Analysis/Resynthesis,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
281(1), 1980
[14] T. Eerola, P. Toivianinen, “MIDI Toolbox: MATLAB Tools for Music Research,”
Department of Music, University of Jyväskylä, Finland, 2004
[15] C. Duxbury, J.P. Bello, M. Sandler, and M. Davies, “A Comparison Between Fixed
and Multiresolution Analysis for Onset Detection in Music Signals” in the 7th Conference
on Digital Audio Effects. Naples, Italy, October 2004
[16] A. Light, “The Vibe History of Hip Hop,” New York, Three Rivers Press, 1999
[17] http://www.propellerheads.se/products/recycle/index.cmf
[18] http://www.fxpansion.com/product-guru-main.php
48
© Copyright 2026 Paperzz