Computational Rhythm Similarity Development and Verification Through Deep Networks and Musically Motivated Analysis

NEW YORK UNIVERSITY
Computational Rhythm Similarity
Development and Verification Through
Deep Networks and Musically Motivated
Analysis
by
Tlacael Esparza
Submitted in partial fulfillment of the requirements for the Master of Music in
Music Technology in the Department of Music and Performing Arts Professions
in The Steinhardt School New York University
Advisor: Juan Bello
[DATE:2013/12/06] January 2014
NEW YORK UNIVERSITY
Abstract
Steinhardt
Master of Music
by Tlacael Esparza
In developing computational measures of rhythmic similarity in music, validation methods typically rely on proxy classification tasks on common datasets, equating rhythm
sim-ilarity to genre. In this paper, a novel state-of-the-art system for rhythm similarity
is proposed that leverages deep network architectures for feature learning and classification, using this standard approach of genre classification on a well-known dataset for
validation. In addressing this method of validation, an extensive cross-disciplinary analysis of the performance of this system is undertaken. In addition to analyses through
MIR, machine learning and statistical methods, a detailed study of both the results and
the dataset are performed from a musicological perspective, delving into the musical,
historical and cultural specifics that impact the system. Through this study, insights are
gained in further gauging the abilities of this measure of rhythm similarity beyond classification accuracy as well as a deeper understanding of this system design and validation
approach as a musically meaningful exercise.. . .
Acknowledgements
I would like to thank Professor Juan Bello for his guidance, encouragement and dedication to my education, and Eric Humphrey without whom I would have been lost
in a deep network somewhere. Many people have helped me along the way with this
work and I am very grateful for their time and generosity. These include: Uri Nieto,
Mary Farbood, Adriano Santos and Professor Larry Crook, as well as Carlos Silla and
Alessandro Koerich for the Latin Music Dataset and their insights into the data collection process. And most importantly, thanks to my family and my fiancée, Ashley Reeb,
for their unwavering emotional, spiritual, intellectual and financial support. . . .
ii
Contents
Abstract
i
Acknowledgements
ii
List of Figures
iv
List of Tables
v
1 Introduction
1
2 Explication and Literature Review
2.1 Computational Music Similarity Measures . . . . . . . . . . .
2.2 Rhythm Similarity . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Onset Patterns . . . . . . . . . . . . . . . . . . . . . .
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Deep Networks for Feature Learning and Classification
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
8
9
9
3 Approach
3.1 Onset Patterns Implementation . .
3.2 Deep Network Implementation . .
3.2.1 Applying the Deep Network
3.3 Analytic Approach . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
13
14
15
4 System Configuration and Results
4.1 Dataset . . . . . . . . . . . . . . .
4.2 Methodology . . . . . . . . . . . .
4.3 OP Parameter Tuning . . . . . . .
4.3.1 Periodicity Resolution . . .
4.3.2 Frequency Resolution . . .
4.4 Deep Network Parameterization . .
4.4.1 Layer Width . . . . . . . .
4.4.2 Network Depth . . . . . . .
4.5 Optimal System Configuration . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
16
16
17
17
18
18
19
19
19
5 OP Rhythm Similarity Analysis
24
5.1 Tempo Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Fine Grain Rhythmic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 26
iii
Contents
iv
6 Dataset Observations
30
6.1 Ground Truths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Artist/Chronological Distribution . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Brazilian Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 System Verification Issues
36
7.1 Rhythm-Genre Connection . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Inter-genre Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Rhythm as Genre Signifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Conclusions
40
Bibliography
42
List of Figures
2.1
Extraction of Onset Patterns (OP) from the audio signal. . . . . . . . . .
4.1
E↵ect of P on classification. The highest result is highlighted in blue,
while significantly di↵erent results are in red. . . . . . . . . . . . . . . .
E↵ect of F on classification. The highest result is highlighted in blue,
while significantly di↵erent results are in red. . . . . . . . . . . . . . . .
Mean comparison of ANOVA tests on network layer complexity in a 2layer architecture show significantly lower results for small M. . . . . . .
Top: Progression from input to output shows increasingly compact genre
representation from input to output. Bottom: Progression from input to
output shows increasingly distant classes from input to output. . . . . .
4.2
4.3
4.4
8
. 18
. 18
. 19
. 23
5.1
Gaussian modeled tempo distributions by genre in the LMD. . . . . . . . 25
6.1
Left: OP of a modern Sertaneja track. Right: OP of a Tango recording
from 1917. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Top: Geographical spread of Bolero vs. Brazilian genres. Bottom: Detail
of geographical spread of Brazilian genres. . . . . . . . . . . . . . . . . . . 35
6.2
v
List of Tables
2.1
4.1
4.2
4.3
4.4
4.5
5.1
5.2
5.3
5.4
6.1
Summary of main approaches in the literature for computational rhythm
similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ANOVA results for classification scores with varying P values shows that
periodicity resolution is a significant factor. . . . . . . . . . . . . . . . . .
Classification accuracies for di↵erent features on the LMD. . . . . . . . .
Classification accuracies by genre, ordered from highest classification score
to lowest, show Brazilian genres generally performing worse than the rest.
Confusions matrix shows classification affinities between Sertaneja and
several other genres. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of di↵erent classifiers on OP data. The proposed system
outperforms all others by a margin of 2.23%. . . . . . . . . . . . . . . . .
Results of binary logistic regression with classification success as dependents variable, BPM and Density as input show density is significant while
BPM is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hosmer & Lemeshow test shows BPM and density data to be poor predictors for classification success. . . . . . . . . . . . . . . . . . . . . . . .
Feel breakdown by genre showing percentage of tracks in each genre that
are swung. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of actual genre feel versus predicted genre feel for LMD classification results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
17
20
20
21
21
. 26
. 26
. 28
. 28
LMD infometrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vi
Chapter 1
Introduction
A fundamental goal in the field of music information retrieval (MIR) is to extract musically meaningful information from digitized audio signals through computational methods. This, in its vagueness and breadth, describes most MIR tasks. In practice, and
with the field still in its relative infancy, these tasks have often simplified to extracting
musical feature representations that highlight basic characteristics like pitch, harmony,
melody, tempo, timbre, structure and rhythm, among others. With the assumption that
complex musical features such as mood or genre are signified by sets of fundamental
musical attributes, it is hoped that these more abstract characteristics can be identified
through combinations of these methods [1, 2].
There are many motivations for as well as current successful applications of this work.
Pitch and beat tracking algorithms have found widespread use in digital audio workstations such as Pro Tools and Abelton Live, enabling pitch and beat correction, tempo
estimation, and time-stretching of recorded audio. These functionalities have been used
to great e↵ect in current popular music and are often audibly detectable, as with the
music of the artist T-Pain, known for heavy use of auto-tuning software.
Beyond music production, these computational methods can be leveraged to analyze
and annotate the ever-growing and intractably large collections of music that the digital
age has enabled. With an approximated 75,000 official album releases in 2010 alone as a
indication of scale [3], and the primary means of maintaining and consuming this music
through digital transmission, a computational approach to annotating and cataloging
these collections is highly desirable. Indeed, new digital-era companies that serve streams
of music to users on demand, such as Spotify, Soundclound and Pandora, have begun
to employ many MIR methods (and researchers) for genre detection, playlist generation
and music recommendation, among other services.
1
Chapter 1. Introduction
2
A main objective of this thesis aims to examine and further develop computational
methods of measuring rhythm similarity in music signals. The importance of rhythm
to music almost needs no mention. Under composer Edgar Varése’s generous definition
of music as “organized sound”, rhythm remains fundamental in that time is one of the
few dimensions along which sound can be organized. And so, with the goal of the MIR
community to fully parse musical content through computational means, contending
with rhythm is an important step in this endeavor. Combining previous research on
rhythm from the MIR literature with advances in machine learning, this work presents
a state-of-the-art system for measuring rhythm similarity.
In the hope of anchoring this abstracted computational process to its stated goal of
extracting musically, and specifically, rhythmically meaningful information, this work
takes a concerted e↵ort, beyond what is common in the literature, to analyze not only
the results, but also the dataset used and the system’s design from a multi-disciplinary
perspective. Using a standard MIR verification scheme on a well-known dataset, the
Latin Music Dataset [4], through statistical analyses on dataset metadata, analysis of
the dataset from a musical, cultural and historical perspective and scrutiny of the basic
assumptions built into the design, it is hoped that the system’s musical relevance can
be understood with greater clarity beyond the common classification score measuring
stick.
Further, with a great deal of personal interest and domain knowledge in the subject
(rhythm) and the specific area of application for this work (latin rhythms/music), it is
hoped that this research approach will provide a useful and elucidating look into the
analysis of computational rhythm similarity measures, and also act as an encouragement to take on this level of scrutiny in developing computational methods for music
applications in general.
Chapter 2
Explication and Literature
Review
Much of MIR research has largely followed a standard and persistent design model in developing novel methods for parsing music signals. This model is comprised of two steps:
one, a hand-crafted feature extraction stage that aims to single out a given musical
characteristic from an input signal; and two, a semantic interpretation or classification
stage applying some function that allows the mapping of this feature to a symbolic representation. This chapter takes a survey of previous work in developing both of these
system components as they relate to the current task of measuring rhythm similarity.
Sections 2.1 and 2.2 summarizes standard approaches to feature extraction and reviews
the various attempts at characterizing rhythm through feature design, highlighting the
development of Onset Patterns which serve as a jumping o↵ point for further research.
Section 2.3 reviews improvements to feature extraction methods through the use of sophisticated machine learning algorithms, pointing to a blurring in the distinction between
feature extraction and classification and setting the direction for this research.
2.1
Computational Music Similarity Measures
Though some feature extraction methods produce easily interpretable, musically relevant
representations of the signal directly, certain feature representations are imbued with musical meaning only as measures of similarity. For instance, the output of a tempo detection algorithm which looks for strong sub-sonic frequencies can be interpreted easily by
looking for a global maximum in the representation, revealing a beats-per-minute value,
a standard unit of tempo. Conversely, Mel-Frequency Cepstral Coefficients (MFCCs),
widely used as a measure of timbre, are not musically interpretable on their own, but,
3
Chapter 2. Explication and Literature Review
4
paired with a distance metric, can be used to identify sounds based on distance to a
known example, a common application of which is instrument identification [5, 6].
In this paradigm of measuring similarity, musical facets can be seen as existing in some
multi-dimensional space where similar representations are grouped closely together, and
the feature extraction algorithm is a mathematical projection of a given signal into
one of these spaces. Through this approach, a posteriori -defined properties such as
rhythm and structure, esoteric qualities such as timbre and complex characteristics
such as mood can be inferred based on their distance to labelled examples in their
respective feature space. In this way, classification supplies semantic meaning and a
perceptually meaningful framework for analyzing these more complicated features. As
an example of this, the previously mentioned quality of sound referred to as timbre is
not easily defined, difficult to conceptualize and its manifestations difficult to describe
with language. However, agreeing that timbre is the feature that distinguishes the sound
of one instrument from another, we can discuss timbre similarity through the task of
matching multiple recordings of the same instrument, defining it in finite terms (i.e. the
timbre of a flute vs the timbre of a horn).
One of the major obstacles to this approach is the necessity of labeled datasets. The
development, verification and interpretation of these algorithms relies on classification
tasks on pre-labeled examples; without an example of flute timbre to match with, an
unlabeled signal cannot be identified with this characteristic. Ideally, when developing new similarity features, the verification dataset suits the task well by representing
the desired musical feature homogeneously within a given class, but datasets with feature similarity-based ground-truths can be expensive and time-restrictive to produce.
To address this, the MIR community actively compiles and shares labeled datasets for
these purposes; examples of widely used datasets include labeled audio of monophonic
instruments (McGill University Master Samples), audio accompanied by various descriptive tags (Magnatagatune) and many datasets divided along genre membership (LMD,
Ballroom Dance, Turkish Music).
But in practice, this has often lead to the use of datasets not created specifically for
the given similarity measure of concern, employing a proxy identification with some
other more easily identifiable characteristic. This includes the very common use of
genre as a proxy for texture and rhythm similarity (which this thesis research employs
knowingly) [7–12], and cover song identification for harmonic and melodic similarity [13–
15]. Implicit in this approach is the assumption that ground-truths in these datasets
correlate strongly enough with the musical characteristic being measured to provide
meaningful classification results and system verification.
Chapter 2. Explication and Literature Review
2.2
5
Rhythm Similarity
Rhythm is a complex and wide musical concept with varying definitions in di↵erent contexts. Though rhythm exists in music on various time-scales and can describe anything
from the timing of a melody to textural shifts and large scale events, in this paper (and
in the MIR literature on the subject), rhythm is taken to refer to regularly repeating
sound events in time on the musical measure-level (approximately 2-8 seconds); that is,
rhythm as those looping musical phrases that a percussionist or a bass player in a dance
band might play.
In the MIR literature, analyzing rhythmic similarity is distinct from rhythm description
or transcription tasks. Where the latter seeks to transform a musical signal into symbolic
annotations or describe it directly in some manner, the former is concerned only with
isolating rhythm as invariant across di↵erent instances, often using highly abstracted
representations. Although [16] provides a framework for understanding rhythm similarity with symbolic sequences, for the rapidly growing body of recorded audio this kind of
analysis is not applicable for several reasons: the vast majority of audio recordings typically do not have this level of annotation; providing this information by hand is time
restrictive; and computational methods of annotation remain ine↵ective [17]. Hence,
signal-based methods for rhythm analysis are highly desirable.
From a conceptual perspective, isolating this level of rhythm as an invariance requires
removing pitch, tempo and timbre dependence so that a rhythm played on two di↵erent
instruments, using di↵erent pitches and at di↵erent speeds will be recognized as the
same. However, previous approaches tailor this list according to the intended application and sometimes include additional dependencies to be removed: phase, referring to
the position of the start of a repeating rhythm; and temporal variance, referring to the
evolution of a rhythm over longer time frames. Removing phase and temporal variance
is a practical consideration specific to signal processing concerns; though a human can
often easily recognize the beginning of a rhythmic pattern based on larger musical context, recognizing this computationally has been shown to be problematic [18] and when
analyzing a signal, there is no guarantee that the beginning of the signal will correspond
to the beginning of a repeating rhythm. Similarly, for track-wise classification, temporal
invariance works towards minimizing the e↵ects of portions of audio where there is no
discernible rhythm or where changes in rhythm are not representative of the track on
the whole.
Aside from a handful of intuitively motivated rhythm similarity systems that extract unit
rhythm representations and preserve phase by employing often complicated heuristics
to deduce the beginning of a phrase [19–21], most designs remove phase and take a
Chapter 2. Explication and Literature Review
6
more abstracted approach. Though di↵ering in important ways, they typically follow
a common script: 1) calculate a novelty function from the signal, removing pitch-level
frequencies, and highlighting rhythmic content; 2) produce a periodicity and/or rhythmic
decomposition of this novelty function by analyzing consecutive rhythm phrase-length
windows (typically 4 to 12 seconds), capturing local rhythm content on this scale; 3)
transform this local representation by warping, shifting, resampling or normalizing to
remove tempo-dependence; 4) aggregate local representations over time to produce a
track-wise rhythm representation, removing temporal dependence.
Table 2.1 shows a summary of these four steps for each of the main approaches in the
literature. In this table, “E↵ected Dimension” refers to the musical dimension that each
stage acts to either preserve or remove. Though all of these methods remove pitch content in the novelty function calculation, the Scale-Transform implemented in [11, 22, 23]
and the Fluctuation and Onset Patterns implemented in [10, 11, 24] do preserve some
level of timbre through multi-band novelty function representations. Most approaches
produce local rhythm representations by using the Auto-Correlation Function (ACF) or
Discrete Fourier Transform (DFT). [12] notes that these functions are beneficial for their
preservation of the sequential order of rhythmic events, but they also remove phase as a
periodicity representation, where only rhythm-level frequencies are coded. Rhythm Patterns [9, 25] diverge from this approach by including, in addition to periodicity analysis,
Inter-Onset Intervals (IOI) which encodes the spaces between onsets in the novelty function, and Rhythm Patterns which are bar-length representations of the novelty function.
This is a robust approach but relies on unreliable heuristics for extracting the downbeat used to determine the Rhythm pattern. All of the approaches make some e↵ort to
remove temporal variance through temporal aggregation over all frames.
Approach
Novelty
Local Rhythm
Rhythm Scaling/Morphing
E↵ected Dimension
Pitch/Timbre
Phase
Tempo
Aggregation
Temporal
Beat histogram [26–29]
Rhythm patterns [9, 25]
Hybrid ACF/DFT [12, 30]
Scale transform [11, 22, 23]
Fluctuation/Onset patterns [10, 11, 24]
single
single
single
single, multiband
multiband
ACF
rhythm patterns, ACF, IOI
DFT, ACF and hybrids
ACF, DFT
DFT
log-lag + shift detection, sub-sampling
bar-length normalization
resampling with local tempo
scale transform
log-frequency + subsampling
histogram
k-means, histogram
mean
mean
mean
Table 2.1: Summary of main approaches in the literature for computational rhythm
similarity.
The biggest divergences in these designs can be seen in the various methods for removing tempo-sensitivity from the representation. Noting that relative rhythmic structure
can be compared more easily as a shift on a log-scale versus a stretch on a linear scale,
a log-lag mapping in the Beat histogram [28, 29] or a log-frequency mapping in the
Onset pattern [10] allows for reduced sensitivity to tempo changes, where only large
tempo di↵erences are noticeable. In [10, 29], the e↵ect of tempo is further reduced by
sub-sampling in the log-lag/frequency domain to produce a coarser representation. [28]
Chapter 2. Explication and Literature Review
7
employs a shift in the log-lag domain to obtain a fully tempo-insensitive representation,
but this relies on determining the proper shift value which is prone to errors. Subject
to similar problems are the methods employed in calculating Hybrid [12] and, as mentioned before, Rhythm Pattern [9] which rely on determining tempo and bar boundaries
for tempo normalization and bar-length pattern identification. The octave errors common to tempo estimation algorithms are problematic here, leading to inconsistencies in
rhythm representations for these methods. [22, 23] o↵ers a robust, fully tempo-invariant
approach that takes the scale transform of the ACF, resulting in a scale-invariant version
of the already shift invariant ACF, obviating the need for determining a shift amount
to correct for the shift introduced by log-lag mapping.
Though the Beat Histogram, and Hybrid ACF/DFT, if applied successfully, do result
in a fully pitch, tempo, timbre, and phase invariant rhythm representation, these are
less useful when tasked with measuring rhythm similarity in the context of general
similarity in multi-instrumental recorded music. Indeed, with most of these methods
performing verifications through genre identification on standard dance-based datasets,
better classification success has been obtained with the Onset Pattern [10, 11] and the
Scale Transform [22, 23], which both preserve some level of timbre dependence through
a multi-band representation. This makes sense when the question is not “are these
rhythms the same?” but rather “do these two tracks sound similar from a rhythmic
perspective?”, where the listener looks not only for similar rhythms but for similar
orchestrations of those rhythms. While the former might be more conceptually pure
with respect to rhythm similarity, the latter is more amenable to a genre classification
task and as a tool in measuring general music similarity.
It merits repeating here that nearly all of the rhythm similarity studies mentioned above
employ genre identification as a verification method. Recalling the common use of
already available datasets in lieu of ones tailored for the task - in this case, a dataset
labeled according to a specifically defined understanding of rhythm similarity - this has
been a common and generally accepted practice in rhythm similarity research. In using
dance-based datasets (LMD [11], Ballroom Dance [9–12, 25], Turkish Music [22, 23]),
the underlying assumption behind this practice is not only that a rhythm can be reliably
associated with a specific genre, but also that a given genre has a representative rhythm,
justifying a bijective mapping from one to the other.
2.2.1
Onset Patterns
Taking the perspective that a timbre-sensitive approach to rhythm similarity is desirable for application to multi-instrumental music signals, and noting the importance of
Chapter 2. Explication and Literature Review
8
reducing reliance on error-prone heuristics in the design, the Onset Pattern and the
Scale Transform stand out as promising approaches. The primary di↵erence between
these two lies in their approach to tempo-invariance, where the Scale Transform achieves
full tempo invariance and Onset Pattern shows invariance only for local tempo changes.
As [11] e↵ectively shows, tempo can be an important and identifying characteristic for
certain genres. Although the motivation here is not genre identification, this suggests
the idea that perhaps tempo is also important for the perception of rhythm similarity.
If two songs have the same rhythm but have very di↵erent tempos to the point that
they produce a di↵erent e↵ect on the ear, this becomes a characteristic worth tracking.
With this in mind, the Onset Pattern, which encodes only relatively large di↵erences in
tempo, is especially promising for further development as a general measure of rhythm
similarity in music.
Computation of Onset Patterns (OP), as first described in [10] and refined in [11] , are
relatively straight forward to calculate and follow the signal pipeline mentioned above.
As illustrated in Figure 2.1: 1) the signal is transformed to the time-frequency domain, processed to produce a novelty function through spectral-flux, mean removal and
half-wave rectification, and sub-sampled to produce log-spaced frequency sub-bands; 2)
log2 -frequency DFTs are applied to these sub-bands over 6-8 second windows to produce
a periodicity representation; 3) each frame is subsampled in the frequency and periodicity dimension to generalize the representation; and 4) frames are aggregated to produce a
track-level representation. However, not detailed in these steps is the ordering of “pooling” stages, important to [10]’s design, that act to summarize multi-band information
into a smaller representation. In particular, pooling occurs in the frequency dimension
before and after calculating periodicity. Also left out is a normalization step to correct
for artifacts from the various log-compression or pooling steps. However, justifications
for these design choices as well as the implementation of this normalization step is left
unclear in the original paper.
Figure 2.1: Extraction of Onset Patterns (OP) from the audio signal.
[11] refines this process by systematically testing di↵erent designs and parameters. Of
particular note in its findings is the importance of window size in the periodicity calculation and the negligible e↵ect of specific ordering of pooling steps. With an 8-second
Chapter 2. Explication and Literature Review
9
long window (versus 6 seconds in [10]), a single pooling stage can be applied at the
end with no e↵ect to overall efficacy. Through this exhaustive search, [11] was able to
improve OP performance beyond the original design. However, these results are based
on necessarily limited parameter testing, constrained by time and feasibility and largely
reliant on ignoring possible e↵ects of interaction between parameter choices, highlighting
the difficulties in optimizing feature extractions.
2.3
Machine Learning
Until recently, MIR research has taken the approach of designing algorithms to extract
some explicit musical feature, using simple data models and distance measures for verification against ground truths (e.g. [10, 11]’s use of a K Nearest-Neighbor models with a
Euclidean distance on OP features). However, for more complex musical characteristics,
some in the field are turning their focus away from feature design to more sophisticated
classification models and machine learning algorithms such as support vector machines
[31–33], multi-layer perceptrons [34–36], and more recently deep-network architectures
[37, 38].
With the standardization of many feature designs such as chroma, MFCCs, among
many others, these more advanced machine learning methods have been used to squeeze
performance from these features or to extract more complex characteristics from sets
of features. In this line of thought, rather than rely on some specific feature extraction
method, the task is couched in terms of a data classification problem which allows for
leveraging learning algorithms to extract the relevant information based on a desired
outcome.
2.3.1
Deep Networks for Feature Learning and Classification
[39] advocates giving learning algorithms, in particular deep network architectures, a
more fundamental role in system development; with a sufficiently sophisticated learning
algorithm, an optimally designed feature can be automatically extracted from a minimally processed input signal. This has the potential to solve several problems that
have plagued MIR research for over a decade. Besides obviating the need to spend time
rigorously testing algorithms in search of optimal designs and parameters, more importantly, it has the potential to capture musical characteristics that would otherwise be
too complex or abstruse to formulate within a feature extraction signal chain.
Hand-crafted algorithms are necessarily limited by our own perceptual and technical
abilities, and the approach that relies on these alone to explore the function space of
Chapter 2. Explication and Literature Review
10
signal-to-feature mappings limits the range of possible solutions. As initially demonstrated in [40] for music information retrieval, deep network architectures can be used
to this end for their ability to model high-order functions through system depth. By
cascading multiple affine transformation layers of simpler nonlinear functions they allow
for a system complexity sufficient to model abstract musical characteristics.
As [39] argues, using deep architectures to learn features for MIR follows naturally from
the observation that many successful designed features in the literature can be described
in terms of deep architectures themselves, combining multiple steps of affine transformations, non-linearities and pooling. Taking the now standard calculation of MFCCs as
an example, steps include: Mel-scale filterbank transform and discrete cosine projection
(affine transformation); and log-scaling and complex modulus (nonlinearity). Hence,
from this perspective the primary di↵erences between di↵erent feature designs is the
choice of parameters. Further, given that these parameters can be optimized for a given
task with deep networks, not only is it possible to learn better designs for features such
as MFCCs, but this points to the prospect of learning better features altogether that are
unconstrained by the specifics of implementation. In the two step paradigm described
above, here the distinction between feature extraction and classification becomes obscured where step one is reduced to preparing the data for input to step two, a deep
network where each layer is a progressively more refined feature representation and the
final output layer performs classification.
Deep architectures have found strong use in problems of feature learning for machine
vision [41–44], but there has been relatively little research into this approach within the
MIR community. Although SVMs as well as other more sophisticated learning algorithms, as mentioned above, have been used to improve classification rates for designed
features, the e↵orts to learn the features themselves have been few. The initial successful
uses of deep networks for music retrieval tasks in [40] and [45] show that learned features
outperform MFCCs for genre classification and sophisticated temporal pooling methods
can be learned to incorporate multi-scale information for better results. Further use of
deep networks in [38] shows that Convolutional Neural Nets, a specialized ANN deep
network, can be successfully used for the task of chord recognition by extracting chord
representations directly from several seconds of tiled pitch spectra. The positive results
these approaches achieve are encouraging and justify further research into deep networks
for feature design tasks such as rhythm similarity.
It is important to note that in these supervised learning schemes, the data used in training and classification plays a more fundamentally important role in feature design. With
hand-crafted features, designs are based on some idealized concept of a given musical
feature i.e. tempo, timbre, pitch, and classification tasks serve merely as validation of
Chapter 2. Explication and Literature Review
11
the design. However, if the feature itself is learned in the process of supervised training of a classification model, it is necessarily shaped by the relationship between class
labels and signal attributes in the dataset used for training. This is both a positive
characteristic of this approach since, as mentioned, it unhinges the perceived musical
characteristic from a pre-determined algorithm, but it requires care and scrutiny when
creating or using pre-existing datasets as is a common practice. Although, research in
unsupervised deep learning networks shows promise in reducing the reliance on large
datasets [46], this work only considers fully-supervised methods.
Chapter 3
Approach
Based on the observations discussed in the previous chapter, this chapter presents a novel
variation of the onset pattern approach. By treating the pooling and normalization
stages of feature extraction as layers of a deep learning network, these stages can be
optimized to the task of genre classification. In this way, the post processing and pooling
steps that are infeasible to optimize manually can be learned as an extension of the Onset
Pattern feature in this deep architecture context. Once trained, this transformation is
applied independently to all track-wise onset patterns and the outputs are averaged over
time, yielding a summary representation for an entire track.
3.1
Onset Patterns Implementation
OP calculation here generally follows the processes outlined in [10], [11], but for this
application, the calculation is simplified by removing several post processing steps. Operating on mono recordings sampled at 22050Hz, log2 -frequency DFTs are taken over
1024-sample windows with a hop size of 256 samples. Frequencies span six octaves beginning at 150Hz. The frequency resolution of this transform is kept variable to test
optimal resolution levels in later experiments. Multi-band novelty functions are generated by computing spectral flux, removing the mean and half-wave rectifying the result.
From here, eight-second long windows of these novelty functions are analyzed at 0.5 seconds intervals to extract a periodicity-spectrum by applying another log2 -DFT spanning
five octaves beginning at 0.5Hz. This corresponds to a Beat-Per-Minute (BPM) range of
30 to 960BPM. This is referred to here as the periodicity range. As with the log2 -DFT
used in the frequency multi-band calculation, periodicity resolution is left as a variable.
This gives a frame-matrix with dimensions (F, P ) where F is the number of frequency
bins and P is the number of periodicity bins.
12
Chapter 3. Approach
3.2
13
Deep Network Implementation
For feature learning and classification, this research makes heavy use of Eric Humphrey’s
in-development deep learning network Python libraries, informally presented in [47].
Formally, deep networks transform an input Z1 into an output ZL through composition
with nonlinear functions fl (·|✓l ) where l 2 L, indicating total layer depth. For each
layer, Zl
1
is the input to function fl with parameters ✓l . The network is composed of
affine transformations, or fully-connected layers, where the outputs from one layer are
distributed fully over the inputs to the next layer. Precisely:
F (Z1 |⇥) = fL (...f2 (f1 (Z1 |✓1 )|✓2 ))...|✓L )
(3.1)
Where, F = [f1 , f2 , ...fL ] is the set of layer functions, ⇥ = [✓1 , ✓2 , ...✓L ] is the corresponding set of layer parameters, the output of one layer is passed as the input to the
next as fl (Zl ) = Zl+1 and the overall depth of the network is given by L.
Layer fl is a fully-connected, or affine, transformations, defined by the following:
fl (Zl |✓l ) = h(Wl • Zl + bl ), ✓l = [Wl , bl ]
(3.2)
Here, the input Zl is flattened to a column vector of length Nl and the dot-product is
computed with a weight matrix Wl of shape (Ml , Nl ), followed by an additive vector bias
term bl with length Ml . Note that an affine layer transforms an Nl -dimensional input
to an Ml -dimensional output, referred to as the width of the layer. The final operation
is a point-wise nonlinearity, h(·), defined here as tanh(·), which is bounded on ( 1, 1).
When used as a classification system, the first L
1 layers of a deep network can be
viewed as feature extractors, and the last layer, fL , is simply a linear classifier. This
output can be forced to behave as a probability mass function for membership to a given
class by making the length of Zl match the number of classes and by constraining the
L1 -norm of the output to equal 1.
This probability mass function P (·) for an input Z1 is achieved by applying the softmax
operation to the output of the network, ZL , defined as follows:
P (Z1 |⇥) = (ZL ) = PM
L
exp(ZL )
m=1 exp (ZL [m])
(3.3)
Chapter 3. Approach
14
In this supervised learning implementation, the output ZL of this final layer is used to
make a prediction where the most likely class is determined by argmax(P (Z1 |⇥)), that,
with a provided target value y, can be combined into loss function.
With the network defined as a probability mass function for class membership, it can
be trained by iteratively minimizing this loss function using the negative log-likelihood
of the correct class for a set of K observations:
L=
K
X
k=0
log(P (X k = Y k |⇥))
(3.4)
where, Z k and Y k are the input data and corresponding class label, respectively of the
k th observation.
This loss function can then be minimized through gradient descent, which iteratively
searches for the minimum value of the loss function. Here, gradients are computed with
K > 1, but much smaller than the total number of observations, by sampling data points
from the training set and averaging the loss over the batch. Specifically, the update rule
for ⇥ is defined as its di↵erence with the gradient of the scalar loss L with respect to
the parameters ⇥, weighted by the learning rate ⌘, given by the following:
⇥
⇥
⌘⇤
L
⇥
(3.5)
K = 100 is used, where the observations are drawn uniformly from each class, i.e. a 10
observations of each genre, and a constant learning rate of ⌘ = 0.1. Learning proceeded
for 3k iterations without early stopping or model selection.
Note that all input data is preprocessed before input to the network to have zero-mean
and unit variance. This is done by calculating the mean and standard deviation over all
data points and was shown to significantly improve system performance.
3.2.1
Applying the Deep Network
Unlike previous classification schemes for rhythm similarity methods, track-level aggregation is held o↵ until after frame-wise classification. Here, the deep network is applied
independently to a time-series of onset patterns, producing a posteriorgram. Though
there are alternative statistics that could be explored, such as the median or maximum,
mean-aggregation is taken for each class prediction over the entire track.
Chapter 3. Approach
3.3
15
Analytic Approach
Chapter 2 highlights two connected issues that have prompted the analysis and discussion approach taken in this research. The first issue concerns the practice of genre
identification as a proxy task for rhythm similarity. As mentioned, genre classification
is the de-facto proxy task for verifying rhythm similarity measure and there remains a
dearth in the literature for: 1) in depth analysis of the suitability of genre for the given
feature; 2) informed explications of the assumptions made in system design; 3) and a
proper examination of classification results fully taking into account the contents of the
dataset used. The facile assumptions made in system verification and the face-value
interpretations of classification results commonly accepted belies either a general lack
of commitment or naiveté to musical relevance among researchers. As explored in [48]
and stated confidently enough to be used as its title “Classification Accuracy is Not
Enough”.
The second issue concerns the e↵ect of the dataset on learned features in rhythm similarity. As discussed at the end of Section 2.3.1, in a deep network, features are learned
based on provided labeled training examples. Hence, the feature’s characteristics are
molded by the class representations in the dataset. Though desirable if working with an
ideal dataset for the task, in the case of this research which uses genre membership as
a proxy for rhythm similarity, there may be unintended (i.e. not rhythmically relevant)
influences on the feature representation.
In an e↵ort to better understand the musical significance of this rhythm similarity research beyond classification score and in an attempt to account for these various factors,
a multi-disciplinary approach is taken here to examine the results, the dataset, and the
system design. In addition to standard machine learning, MIR and statistical analyses
methods, results are examined through rhythmic, musico-cultural and historical analyses, employing personal domain knowledge and borrowing heavily from the related
musicological literature.
Chapter 4
System Configuration and Results
4.1
Dataset
In keeping with standard methods, a genre classification task is used to evaluate this
measure of rhythm similarity, utilizing the well-known Latin Music Dataset (LMD). The
LMD is a collection of Latin dance music comprised of 3216 tracks1 , split into 10 distinct
genres: Axé, Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja and
Tango. The LMD is used here for several reasons: for this dance-based dataset, genre is
assumed to serve as a good proxy for rhythm; the size of the LMD compares favorably to
other smaller dance-based datasets such as the Ballroom set, a requisite for supervised
deep-learning tasks; and, perhaps more importantly, this research stems from a deeper
interest in Latin music in general. Based on the idea that domain knowledge is important
to the development and analysis of computational music similarity measures, personal
knowledge and interest in the subject is leveraged for the analyses in Chapters V-VII.
Though the LMD provides full recordings, many of the tracks are from live shows and
contain non-musical segments (e.g. clapping, spoken introductions). To reduce this
noise, only the middle 30 seconds from each track are used for analysis.
4.2
Methodology
The following experiments seek to identify the optimal system configuration for genre
classification on the LMD. These experiments are broken into two parts: the first concerns resolution of the OP and the second concerns complexity in the feature-learning
1
Though the original LMD has 3,227 total recording, duplicates and tracks that were too short in
duration for analysis have been removed.
16
Chapter 4. System Configuration and Results
17
Source
SS
df
MS
F
Prob>F
Columns
Error
Total
186.101
212.231
398.332
6
63
69
31.0168
3.3688
9.21
3.11E-07
Table 4.1: ANOVA results for classification scores with varying P values shows that
periodicity resolution is a significant factor.
stages of the network. For the OP, best general feature space is desired, one that is
maximally informative while avoiding over-representation which can slow down, and
even hinder classification. various OP resolutions are examined by testing values for
frequency bins (F ) and periodicity bins (P ) as independent factors. Subsequent network test seek to design a network that appropriately fits the complexity of the task.
System complexity is determined by layer depth (L) and layer output size (M ), several
combinations of value for these parameters are examined.
For baseline classification, the system defined in Section 3.2 with a single layer network
is used, which is simply multi-class linear regression. This is the classifier used for all OP
parameter tests. Scores for all classification tests are averaged over 10 cross-validated
folds, stratified against genre.
4.3
OP Parameter Tuning
Initial tests begin on an OP with F = 30 and P = 25 based on results in [11], taking
the minimal dimensions that were shown to perform well.
4.3.1
Periodicity Resolution
Over the seven tested OP configurations, with P in the range [5, 100], P = 15 provides
the best results. An analysis of variance test on classification scores shows that periodicity resolution plays a significant role in the outcome. This is indicated by a Prop>F
value less than 0.05 as can be seen in Table 4.1.
After applying a Tukey HSD adjustment, a comparison of means, Figure 4.1, presents
a clear trend, with significantly lower scores for OPs with either too few or too many
periodicity bins and the maximum classification rate obtained with P = 15. These
tests, showing better results with fewer dimensions, di↵er from results in [11], but this
disparity most likely arises from di↵erences in data and classification strategy.
Chapter 4. System Configuration and Results
18
30x5
OP Dimension
30x10
30x15
30x25
30x50
30x75
30x100
81
82
83
84
85
86
Mean Accuracies (%)
87
88
89
Figure 4.1: E↵ect of P on classification. The highest result is highlighted in blue,
while significantly di↵erent results are in red.
4.3.2
Frequency Resolution
Setting P = 15 based on the above, F values are then tested in the range [18, 300].
An ANOVA test on these results shows a significant e↵ect for this parameter with
Prop>F = 0.001 and, as can be seen in Figure 4.2, accuracy rates go up with higher
frequency resolution, leveling out for F
240. Results in [11] show minor but statisti-
cally insignificant improvements by increasing the OP frequency resolution, but this is
consistent with results here for F  120 and does not preclude the higher scores seen
for F > 120. Based on these tests, going forward OPs are calculated setting F = 240
and P = 15.
18x15
OP Dimension
30x15
60x15
90x15
120x15
240x15
300x15
84
85
86
87
88
Mean Accuracies (%)
89
90
91
Figure 4.2: E↵ect of F on classification. The highest result is highlighted in blue,
while significantly di↵erent results are in red.
4.4
Deep Network Parameterization
With optimal parameters for this feature set in place, the next step is finding the best
network architecture for this data. Returning to the notation of Section 3.2, here choices
of layer width, Ml , l < L
1, and network depth, L are explored. Note that the input
and output dimensionality are fixed as N1 = 240 ⇤ 15 and ML = 10 due to the previous
discussion and the number of classes in the dataset, respectively.
Chapter 4. System Configuration and Results
4.4.1
19
Layer Width
This parameter search begins with a two-layer network (L = 2), sweeping the width
of the first layer, M1 , over increasing powers of 2 in the range [16, 8192]. Results
demonstrate a performance pivot around M1 = 128, achieving a maximum accuracy
at M1 = 2048 but otherwise insignificant variation with M1
128. An ANOVA on
these results show significance for this factor ( Prob>F = 0.015), but Figure 4.3 indicates
minimal impact for M1
128.
16
32
Hidden Layer Size
64
128
256
512
1024
2048
4096
8192
88.5
89
89.5
90
90.5
91
Mean Accuracies (%)
91.5
92
92.5
Figure 4.3: Mean comparison of ANOVA tests on network layer complexity in a
2-layer architecture show significantly lower results for small M.
4.4.2
Network Depth
Based on the above, deeper architectures are considered by setting Ml = 2048, l < L
1
and incrementally adding layers for a maximum depth of L = 6. This fails to show
any significant changes in accuracy, with an ANOVA test revealing a Prob>F of 0.3684,
greater than the null-hypothesis threshold of 0.05. Importantly, while only a limited
number of interactions between depth and width are explored, independently varying L
or Ml for various values shows no significant di↵erence provided Ml
128, consistent
with previous findings.
4.5
Optimal System Configuration
Further tests continue with a two-layer architecture (L = 2, M1 = 2048) based on the
parameters used for the best score in Figure 4.3, expressed completely by the following:
P (X1 |⇥) = (f2 (f1 (X1 |✓1 )|✓2 ))
(4.1)
For clarity, the dimensionality of the first layer, f1 , is given by (M1 = 2048, N1 = 3600),
and the dimensionality of the second by (M2 = 10, N2 = 2048).
Chapter 4. System Configuration and Results
20
Feature
Accuracy (%)
LPQ (Texture Descriptors) [49]
OP (Holzapfel) [11]
Mel Scale Zoning [50]
OP (Proposed)
80.78
81.80
82.33
91.32
Table 4.2: Classification accuracies for di↵erent features on the LMD.
Merengue
Tango
Bachata
Pagode
Salsa
Axé
Bolero
Gaúcha
Forró
Sertaneja
Total Per
Genre
314
407
312
306
309
313
314
309
312
320
Correctly
Predicted
309
400
304
288
286
284
278
264
260
264
%
Correct
98.41
98.28
97.44
94.12
92.56
90.73
88.54
85.44
83.33
82.50
Total
3216
2937
91.32
Genre
Table 4.3: Classification accuracies by genre, ordered from highest classification score
to lowest, show Brazilian genres generally performing worse than the rest.
With this configuration, classification on the LMD yielded a peak average score of
91.32%, which surpasses previous attempts at genre classification on this dataset. Table
4.2 shows the proposed approach outperforming others by a margin of more than 8%.
One trend that is immediately apparent in the results is a difficulty in classifying Brazilian genres. Table 4.3, with genre-wise classification accuracies ordered from highest to
lowest, shows Axé, Gaúcha, Forró and Sertaneja, all Brazilian genres, occupying four of
the five bottom slots. Also, when looking at class-by-class confusions, as shown in Table
4.4, certain affinities between genres are apparent. The lowest scoring Sertaneja has the
majority of its false tags predicted as Bolero, but also many predicted as Gaúcha and
Forró. While the next three lowest performing classes, Gaúcha, Forró and Bolero, have
most of their false tags predicted as Sertaneja. These trends in class-confusions will be
expanded on in subsequent chapters.
The increase in accuracy from previous attempts may be partially explained by di↵erences in methodology (i.e. aggregation strategies, signal noise reduction, etc.), but the
strength of this deep-network strategy for classification plays a significant role here. Its
e↵ect can be seen in Table 4.5, by comparing the proposed approach to simpler classification methods on the same OP input, the former outperforming the rest by a margin of
Chapter 4. System Configuration and Results
21
Actual Class
Predicted Class
Ax Ba Bo Fo Ga Me Pa Sa Se Ta
Ax 284 0 1 2 8 2 5 1 9 1
Ba 0 304 5 0 0 3 0 0 0 0
Bo 3 5 278 1 5 0 4 3 10 5
Fo 8 0 7 260 11 1 7 3 14 1
Ga 14 0 6 4 264 0 4 1 14 2
Me 3 0 0 0 1 309 0 0 1 0
Pa 6 0 6 1 1 1 288 0 2 1
Sa 1 2 7 0 4 3 2 286 2 2
Se 11 0 18 12 6 1 4 3 264 1
Ta 0 0 6 0 1 0 0 0 0 400
Table 4.4: Confusions matrix shows classification affinities between Sertaneja and
several other genres.
Classifier
Validation Fold %
Training Fold %
LDA
SGD
Baseline
Deep-net
89.06
89.09
88.52
91.32
98.06
94.95
99.69
100
Table 4.5: Comparison of di↵erent classifiers on OP data. The proposed system
outperforms all others by a margin of 2.23%.
2%. With the added depth, the system is able to absorb the subsampling and normalization processes, shown to be important in previous work on the OP [11],[10], boosting
discriminative power in the intermediate layer before linear classification.
This boost can be visualized by calculating within-class and between-class scatter matrices for each output stage in the network - OP, intermediate layer, softmax output and looking at the mean values of their diagonals, which serves as a scaled measure of
variance. With these measures, the desire is to visualize relative changes in the feature
space for each level of the network, indicating how close together the members of a
given class are to each other (within-class scatter) and how far apart the centers of these
classes are from each other (between-class scatter).
Figure 4.4 shows these values, plotted for each class from highest classification score to
lowest, left to right. The ideal case sees a horizontal and relatively low-valued line for
within-class measures, indicating class members are close to each other in the layer’s
feature-space, and a horizontal and relatively high line for between-class measures, indicating classes well separated by distance.
The increase in discriminative power that network depth brings can be seen by following
Chapter 4. System Configuration and Results
22
the progression of the shape and relative values of these plots from input OP to intermediate layer to softmax output. In Figure 4.4-top, the input line (red) is jagged and
above the other lines in the plot, showing high scatter and unevenness across classes.
The intermediate layer line (green) shows a general reduction in scatter with increased
horizontal smoothing, and the output line (blue) shows significantly lower scatter values with a smooth and slow positive slope, reflecting the ordering of the classes from
best-performing to worst.
Figure 4.4-bottom, shows a similar trend but in reverse, with increases from input to
output, showing a greater separation between classes at each stage, ending with a low,
negatively sloped smooth output line. Its clear that these lines progress towards the
ideal linearly separable case, with each class becoming more distinct at each stage. The
increase in separability between the input OP and the intermediate layer explains the
increased classification accuracy seen in Table 4.5.
As an aside, in these plots it is notable that Tango starts with the highest within-class
scatter but also the highest between-class distance while having the highest classification
accuracy. So, despite the fact that representations of tracks in this genre are far from
each other in the OP feature space, it seems that the network learns to identify Tango
based on its distance from the rest of the classes. This facet of Tango as its represented
in this dataset is further investigated in Section 6.2.
Chapter 4. System Configuration and Results
Figure 4.4: Top: Progression from input to output shows increasingly compact genre
representation from input to output. Bottom: Progression from input to output shows
increasingly distant classes from input to output.
23
Chapter 5
OP Rhythm Similarity Analysis
Although classification scores on the LMD obtained with this approach are quite high
compared with previous results, the genre-wise unevenness in classification requires a
closer look to see what the contributing factors to these errors may be. More specifically,
deep analysis aims to measure the e↵ect of the characteristics of the OP as a measure
of rhythm similarity, examine the contents and accuracy of the dataset, and address
the foundational assumption of this verification approach that genre can be used as a
proxy for rhythm. In this, and the following two chapters, each of these factors and their
interactions are examined to elucidate and inform both the success of this approach as
well as its deficits as a system for tracking rhythm similarity.
5.1
Tempo Dependence
The OP preserves relative rhythmic structure across tempos, but does not adjust for
shifts along the log-scale that result from large tempo changes[11, Fig 4]. Given the
importance of this factor for classification with the OP, demonstrated in [11], tempo
data is gathered for the LMD in order to properly determine its e↵ect on the results.
The Echonest’s API [51] is used to calculate estimated beats-per-minute (BPM) values.
These estimates are then hand-corrected for octave errors, either doubling or halving
the estimate based on listening to the audio.
Figure 5.1 shows Gaussian modeled tempo distributions for the LMD. From the plot,
it is clear that there is some genre tempo-dependence in this dataset (comparing with
other common datasets, it shows more tempo dependence than with the Turkish music
dataset, but this dependence is not as stark as with the Ballroom dataset[11, Fig 1]).
Based on this distribution, a superficial view of the classifier output suggests a correlation
24
Chapter 5. OP Rhythm Similarity Analysis
25
between success and tempo separation: Merengue is well separated by tempo and has
high classification success while the genres Bolero and Sertaneja have a lot of tempo
overlap as well as a high rate of mutual confusions (see Table 4.4). But, Figure 5.1 also
shows many of the genres to have a high degree of tempo overlap in the 150-200 BPM
range, contrasting with the generally high classification rates seen for these genres.
Figure 5.1: Gaussian modeled tempo distributions by genre in the LMD.
As a means of more precisely quantifying any e↵ects of tempo-overlap on classification
results, binary logistic regression analysis is performed on classification success using
tempo and tempo density as predictors. Here, tempo density is a measure of how
many out-of-class tracks are within a 1-D tempo-neighborhood of a track belonging
to a given class. Density measurements for each track are estimated by sampling a
gaussian-smoothed tempo histogram of out-of-class tracks.
This analysis initially suggests a measured e↵ect from overlap on classification success.
The results in Table 5.2 show that, while BPM is not a factor, exceeding the 0.05
threshold that indicates significance, density is inversely related to classification success
with a significance value of 0.00. However, after performing several goodness-of-fit tests
for this analysis, the data was shown to have poor predictive power.
A Hosmer and Lemeshow goodness-of-fit test [52], Table 5.2, proves the data to be a
poor fit showing statistical significance (Sig = 0.000) indicating that the data diverges
from the model.1 Further evidence of this can be seen through two more goodness-offit tests, the Cox & Snell [55] and Nagelkerke [56] R2 tests, which serve as pseudo R2
measures for logistic regression models. Both R2 values (R2 = 0.004, 0.010, respectively)
are far less than the minimum threshold of 0.05 that would indicate a predictive model.
1
Although the Hosmer and Lemeshow test is known to be misleading for larger sample sizes [53],
[54] provides a rule of thumb for adjusting the group size parameter which acts to standardize the
test for varying sized datasets. For tests on the LMD, with 3216 tracks, group size is calculated as:
2 + 8( 3216
)2 ⇡ 74.
1000
Chapter 5. OP Rhythm Similarity Analysis
26
Variable
Score
df
Sig.
Density
BPM
BPM*Density
Overall
13.557
2.841
9.526
13.952
1
1
1
3
0.000
0.092
0.002
0.003
Table 5.1: Results of binary logistic regression with classification success as dependents variable, BPM and Density as input show density is significant while BPM is
not.
2
118.6241
df
Sig.
72
0.0004
Table 5.2: Hosmer & Lemeshow test shows BPM and density data to be poor predictors for classification success.
These tests conclude that tempo dependence in this dataset does not play a significant
role in classification results.
5.2
Fine Grain Rhythmic Similarity
A facet of the OP not explored in previous work is its sensitivity to small rhythmic
variations in how a rhythm is played. What musicians will refer to as “feel” is often
an integral part of the character of a rhythm or genre. This feel, as it will be referred
to here, arises from minute variations in how each hit lands on the rhythmic grid. A
useful dichotomy for discussing rhythmic feel in music is through designations of either
“straight” or “swung”. In a piece of metered music, a straight feel will have strict 8thnote subdivisions, dividing the distance between quarter notes evenly. With a swung
feel, subdivisions cleave closer to an 8th-note triplet grid, landing on the second 8th-note
triplet division of the beat, closer in time to each succeeding quarter note. Although
the term “swing” is often used loosely, referring to a more esoteric quality in the music,
this formal definition is applied here.
A first step to examining the e↵ect of feel on classification is to quantify how it is represented in the LMD. Using this swung/straight dichotomy, estimate of the percentage of
tracks that are swung within each genre are gathered through listening tests. Given the
LMD’s size, identifying the feel of each track is time-restrictive, so random samples are
taken of 60 tracks per genre, analyzed and used to extrapolate total percentages. Table
5.3 with the results of these measurements, shows the LMD to be noticeably divided
between straight and swung tracks with divisions, for the most part, following genre
boundaries. Shown in the top half of the table, Brazilian genres, with the exception
Chapter 5. OP Rhythm Similarity Analysis
27
of Sertaneja, are mostly swung, while the genres shown at the bottom of the table are
almost entirely straight.
Brazilian music is marked by rhythms that pull between a straight-eighth note feel and a
swung feel - landing between 8th-note subdivided and 8th-note triplet subdivided notes
- confounding the straight/swung dichotomy and giving it a unique rhythmic identity.
The degree to which these tracks are swung varies. For instance, Forró presents many
fully swung rhythms while Pagode presents many rhythms that are only lightly swung,
at times closer to being straight. The other genres fit the dichotomy well, with most
tracks presenting a precise, straight feel.
However, before addressing this considerable connection between feel and genre in the
data, considering the low periodicity resolution used in calculating the OP, it remains
unclear whether this feel information survives in the signal at all. These di↵erences would
manifest in the OP not as shifts in the periodicity direction as with tempo changes, but
as minor changes in the relationship between peaks in periodicity.
As a means of testing this, a binary classification experiment is performed using the
OP on a toy dataset designed to isolate feel as a characteristic. The dataset consists of
ten rhythm-templates, composed using sequenced percussion samples, with a swung and
straight version of each template, rendered at varying tempos spanning the range 40 to
325BPM at 5 BPM increments. This makes for a total of 1160 tagged examples. Note
that this is an unrealistic tempo range for a single rhythm; although the LMD as a whole
spans this range, each genre typically spans a range 100-150BPM. These rhythms were
chosen based on common rhythms from each genre in the LMD so as to approximate its
rhythmic complexity.
On this data, the system described in 4.5 with an altered output layer width for two
classes (straight/swung), achieved accuracy of 78.8%, averaged over ten folds. Considering the OP’s partial tempo-sensitivity and the extreme range of tempos in the toy
dataset, this system shows relative success at di↵erentiating these labels, indicating that
the OP is indeed sensitive to rhythmic feel.
To see the e↵ect of this sensitivity to feel on classification results in Section 4.5, the
confusions in Table 4.4 are examined from the perspective of rhythmic feel. Misclassified
tracks are each labeled as straight or swung to allow comparison between the track’s
actual feel to the feel makeup of its predicted genre. Table 5.4 shows these comparisons
for the highest instances of confusions between genres Axe, Bolero, Forró, Gaúcha and
Sertaneja. For ease of comparison, the results of Table 5.3 are summarized for these
genres. For each actual genre, the table shows the genres predicted by the classifier, the
Chapter 5. OP Rhythm Similarity Analysis
28
Genre
Percent
Swung
Axé
Forró
Gaúcha
Pagode
Sertaneja
84
98
76
82
25
Bachata
Bolero
Merengue
Salsa
Tango
00
02
00
00
02
Table 5.3: Feel breakdown by genre showing percentage of tracks in each genre that
are swung.
Genre
% Genre
Swung
Confusion
Genre
Sertaneja
Num
Confusions
% Confusions
Swung
Bolero
02
10
10
Sertaneja
25
Bolero
Axe
Forro
18
11
12
00
55
83
Gaucha
76
Bolero
Sertaneja
Axe
06
14
14
00
36
79
Axe
84
Sertaneja
Gaucha
09
08
56
88
Forro
98
Bolero
Sertaneja
Gaucha
Axe
07
14
11
8
86
93
91
88
Table 5.4: Comparison of actual genre feel versus predicted genre feel for LMD classification results.
number of tracks in each genre prediction, and the percentage of those tracks that are
swung.
The swing percentages of these predicted tracks, when compared with the actual swing
percentages of the predicted genres, reveals patterns indicative of rhythmic feel sensitivity. This pattern is easiest to see by examining the misclassified Gaúcha tracks.
Of these tracks, the ones classified as Bolero are all straight, matching the near-zero
swing percentage of that genre. The rest of these Gaúcha tracks similarly match the
feel breakdown of their predicted genre, with a 36% swung rate for the tracks predicted
as Sertaneja, loosely matching that genre’s swing percentage of 25%, and a 79% swung
Chapter 5. OP Rhythm Similarity Analysis
29
rate for the tracks predicted as Axé, close to that genre’s swing percentage of 84%.
Misclassified Sertaneja and Axé tracks show this same pattern as well. Bolero and Forró
do not have significant straight/swung splits and so do not display this e↵ect.
These trends support the claim that this classification is indeed tracking rhythmic feel in
this dataset. And given the low classification performance of the genres with the largest
straight/swung splits, this sensitivity to feel is likely a contributing factor to the errors
seen in the results. A caveat to this conclusion is that, from a musicological perspective,
it may be difficult to disentangle rhythmic feel from larger rhythmic characteristics,
leaving the possibility that this apparent sensitivity to feel in the context of this dataset
simply coincides with the system tracking general rhythmic similarity. But the results
of the controlled swing classification experiment along with anecdotal evidence from
classification results on the LMD support the case for rhythmic feel as a factor.
Cases where the only distinguishing characteristic between genres is feel exemplify this effect. For instance, the track by Sertaneja artists Guilherme & Santiago, titled “Coroção”
has a strong backbeat, emphasizing the 2nd and 4th beats of a 4/4 rhythm, which is
characteristic of the straight-feel tracks in this class, but it has a heavy swung feel and
has been tagged by the classifier as Forró, a genre that is almost entirely characterized
by swung rhythms. Rhythmically, it is closer to the straight Sertaneja tracks, but its
heavily swung beat triggers Forró. Another example is the Roberto Carlos song, “Música
Suave”, a Bolero track tagged as Sertaneja. In this case, these two genres have many
similarities (explored in more detail in the next chapter), and this track in particular,
at 80BPM and with a backbeat rhythm, is characteristic of both genres. But, as a track
among many other similar Bolero tracks, its primary distinguishing characteristic is that
it is swung, pushing it closer to Sertaneja which has a greater percentage of swung tracks
overall.
Chapter 6
Dataset Observations
These characteristics of the OP - partial tempo-dependence and sensitivity to rhythmic
feel - go some ways to explain the error rate seen in Table 4.3, but certain patterns in
the results and direct insights from the classifier point back to the data itself. In this
chapter, the Latin Music Dataset is examined in detail, looking at its construction and
contents, to shed more light on its e↵ect in the current system design and results.
6.1
Ground Truths
As described in [4], the paper that accompanied its publication, the authors took e↵orts
to produce trust-worthy ground truths for the LMD, relying on the expert knowledge
of professional dance instructors with over ten years worth of experience teaching many
of the dances associated with these genres. Further, they indicate that they labeled
the dataset track-by-track, noting that labeling by artist or album would lead to poor
genre boundaries. As a means of enforcing consistency in genre identification, they gave
specific instructions to label each track according to how one would dance to it. This
approach takes the view that the way in which a trained dancer might dance to a song is
indicative of genre. Data collection also relied on the libraries of these dance instructors,
where most of the tracks come from [4].
This approach produced a sizable and, one can argue, rhythmically relevant dataset.
The LMD is bigger than the Ballroom Dance dataset by a factor of 5, and undergirding
the assumption of dance-as-proxy-for-genre is a fundamental connection between dance
and rhythm. But the particular viewpoint of genre boundaries represented in the LMD
at times undermines the rhythmic relevance of the data for these purposes.
30
Chapter 6. Dataset Observations
31
When examining the output of the classifier, and in particular the resulting confusion
matrix of Table 4.4, manifestations of this viewpoint are apparent. Of the 36 Bolero
tracks incorrectly tagged, five clearly stand out as not Boleros, but rather American
or British pop/rock songs by artists Linda Ronstadt, Frank Sinatra, Rick Astley, The
Bangles and Tina Turner. [4] describes Bolero as defined by ballads originating from
Spanish folk dances, strongly associated with Mexico and featuring “Spanish vocals and
a subtle percussion e↵ect, usually implemented with guitar, conga or bongos” [4] 1 .
These outliers are ballads but share none of the other common characteristics of Bolero.
Although they account for less than 2% of the errors, they point to a larger trend in this
class - the inclusion of tracks that are better described as soft-rock. Although singer
José Feliciano is known to record Boleros, four of these misclassified tracks are songs of
his that are easily identified as soft-rock, having few of the characteristics of Bolero, and
there are many more like this in the dataset.
To frame the discussion of genre identification more precisely, the authors in [58], a
study on rapid human recognition of genre, provide some useful observations on genre
boundaries in music. As they note, the boundaries of a particular genre are often loosely
defined, transient and rarely universally agreed upon. A record label with commercial
interests in mind may define genre for marketing purposes, while an individual’s conception of genre relies on a necessarily limited personal experience and memory of music.
[58] further notes that: “Rightly or wrongly, an individual may associate a specific
musical feature with a genre”, using cues as shortcuts to genre identification.
In the LMD, those cues may be rhythmic, but only insofar as that rhythm fits a set of
physical movements. This is a viewpoint that places use over content in assigning genre.
A percussionist may listen for particular rhythms and instrumentations when asked to
identify a track as one genre or another, but a professional dance instructor may listen
instead for tempo, form and general feel.
For these outlier Bolero tracks, the cue for inclusion in this class is tempo and ballad
song-form, perhaps sufficient signifiers given the guidelines of the dataset. But from a
stricter, musicological perspective, the lack of instrumentation and rhythms common to
the genre and the heavy back-beat rhythms that they all have would signal these as
otherwise. In fact, these misclassifications are rather indicative of the system’s success
at identifying rhythm similarity. All of these songs were predicted to be Sertaneja by
the classifier, a genre dominated by Americanized ballads, rather adeptly matching their
rhythms.
1
Though [4] mentions its Spanish origin, Bolero, as represented in this dataset has its stylistic roots in
the Cuban Bolero, in duple-metre, not to be confused with the traditional Spanish Bolero, in triple-metre
[57].
Chapter 6. Dataset Observations
6.2
32
Artist/Chronological Distribution
Beyond issues of ground truth and genre definitions, there are some clear disparities
in the makeup of the LMD. Table 6.1 shows some basic info-metrics on the data and,
although there is a relatively even distribution of tracks over genres, there is an uneven
number of artists per genre. Sertaneja and Tango are on the low end, with less than 10
artists each. Although a few artists may well represent a genre stylistically, a narrow
class representation can skew the data toward some unintended idiosyncrasy.
Here, Tango is an extreme example of this kind of class-wide coloration. Although the
LMD does not specifically include metadata on recording dates, the genre labels themselves reveal chronological details. Axé, Gaúcha, and Sertaneja are all fairly modern,
with most recordings made in the 1980’s or later [59, p.4,220-224]. Contrast this with
Tango and Bolero, which have both been around for more than a century [60, p. A-2, A12] and a significant disparity in recording dates can be expected in this dataset.
This is not problematic for the task of measuring rhythm similarity per sé, but with
the way Tango is represented in this dataset, recording quality becomes a significant
artifact. Though there are seven Tango artists in the dataset, the vast majority of the
tracks come from a single artist. Out of 407, 315 were recorded by Carlos Gardel between
1917 and 19352 . These recordings, some nearly a century old, have a very particular
spectral makeup that distinguishes them from the rest of the dataset. Figure 6.1 shows
a side-by-side comparison of the OP of a modern Sertaneja track and that of a Carlos
Gardel Tango recording dating from 1917. The Tango OP has no activity below 200Hz
or above 3kHz. Contrasted with the wide frequency range OP on the left, it is probable
this disparity accounts for much of the accuracy seen for Tango classification. Rather
than track rhythm, the classifier can easily identify Tango based on spectral content.
For this reason, a broad genre representation with a variety of artists is desirable. Bolero,
though it matches Tango in age, is represented in the LMD by 99 di↵erent artists. This
spread ensures a certain level of diversity in recording quality and protects it from such
facile identification as can be done with Tango.
6.3
Brazilian Skew
The LMD was gathered in Paraná, Brazil, by dance instructors trained not only in
Ballroom dance styles but also in Brazilian cultural dances [4]. So it is not just that, as
mentioned before, this dataset represents the viewpoint of dancers, but more specifically
2
The dates of the tracks were found in the track filenames
Chapter 6. Dataset Observations
33
Genre
# Artists
Axé
Bachata
Bolero
Forró
Gaúcha
Merengue
Pagode
Salsa
Sertaneja
Tango
37
64
99
27
92
96
16
54
9
7
Tracks
313
312
314
312
309
314
306
309
320
407
Table 6.1: LMD infometrics.
Tango: Carlos Gardel − Mi Noche Triste
4800
4800
2400
2400
log−frequency (Hz)
log−frequency (Hz)
Sertaneja: Edson & Hudson − Hei Voce Ai
1200
1200
600
600
300
300
150
30
60
120
240
log−periodicity (BPM)
480
150
30
60
120
240
log−periodicity (BPM)
480
Figure 6.1: Left: OP of a modern Sertaneja track. Right: OP of a Tango recording
from 1917.
Brazilian dancers. Referring back to [58], a study on genre recognition, one of their
hypotheses in the determination of genre is what they call “The Fisheye-Lens E↵ect”.
Essentially, it posits that people are more attuned to stylistic variations in music that
is more familiar to them, defining others based on broad stereotypes to varying degrees
[58]. Here, the dataset collectors had a general knowledge of all the genres as part of
canonized ballroom dance traditions, yet the data skews Brazilian, highlighting their
location and specialized knowledge of regional genres.
This skew can be seen from several levels. The data is split evenly between HispanoAmerican musical genres (i.e. originating in countries colonized by Spain) and Brazilian
ones. In general, the Hispano-American genres are older, internationally known and
stylistically established (Salsa, Tango, Bolero, Merengue [60, p.A2-12]), whereas most of
the Brazilian genres came to commercial prominence in the 1980’s or later (Axé, Gaúcha,
Sertaneja)[61, p.168],[59, p.4,220-224] and are all fairly regionally specific. Figure 6.2
shows the regional specificity of the Brazilian genres versus the international dispersion
Chapter 6. Dataset Observations
34
of Bolero3 . Comparing the genre Gaúcha, a recent and regionally specific genre, with
Bolero, an internationally established genre dating back many decades, the disparity in
scope and granularity is clear.
Referring back to table 4.3 in section 4.5, the fact that of the five worst performing
genres four of them are Brazilian is notable in this context. Also, as the confusion
matrix in Table 4.4 shows, many of the highest instances of confusions are between
these Brazilian genres. Based on the observation that the data is skewed Brazilian,
a possible explanation for these trends may be explained by the fisheye-lens e↵ect as
described above. Because of genre familiarity and regional access (Sertaneja and Gaúcha
are localized in and around Paraná, where the LMD was collected) these genres are
represented in greater detail and include tracks that may diverge stylistically but contain
some regionally understood cues that mark them as one genre or another. Whereas
the Hispano-American genres, being older and translated through various cultures, are
represented more concisely based on stereotypical cues for the genres.
Despite this argument’s plausibility and supporting circumstantial evidence, it is abstruse and difficult to prove. A more informative explanation lies in a musicologically
founded examination of these genres, explored in the next chapter.
3
Location data was only gathered on Bolero and Brazilian genres.
Chapter 6. Dataset Observations
Figure 6.2: Top: Geographical spread of Bolero vs. Brazilian genres. Bottom: Detail
of geographical spread of Brazilian genres.
35
Chapter 7
System Verification Issues
Assuming that the proposed system is indeed tracking musically relevant rhythmic information, then a possible explanation for these confusions is the presence of stylistic
overlaps in the data. Further, considering the amount of within-Brazilian genre confusion, the implied rhythmic similarity between these tracks points to possible regional
influences among these genres. This underscores a weakness in using genre as a proxy for
rhythm in gauging the success of the system. With this in mind, this chapter explores
the musicological aspects of these genres with the aim of evaluating these claims and
ultimately, the general success of this rhythm similarity system.
7.1
Rhythm-Genre Connection
Knowing that the LMD has been labeled based on associated dance styles, it is reasonable
to assume that rhythm would track well with genre. It is the rhythm of a given song that
informs the dancer’s pace and phrasing, among other things. But in this dataset, the
inclusion of genres not well defined by a specific rhythm undermines this relationship.
The genre Sertaneja is a prime example of this and, with its low classification rate,
highlights a divergence between the assumptions of the current validation process and
intended goals.
As the “country pop” genre of Brazil, Sertaneja is less defined by a particular rhythm
or dance than it is by a set of cultural signifiers, musical influences, instrumentation
and lyrical content. In one of the few comprehensive musicological examinations of
the genre, “River of Tears”, the author, Alexander Dent, discusses in great detail the
cultural and musical profile of Sertaneja and the music of Brazil’s interior. Sertaneja,
as a rural music with strong ties to Brazil’s Central-Southern culture, presents specific
36
Chapter 7. System Verification Strategy Biases
37
racial influences, distinct from the other Brazilian genres in the LMD. As Dent writes
of these rural genres: “they emphasize a di↵erent racial geography from that found in
mainstream Brazilian musical production,” having a history rooted less in the African
Diaspora of the coastal cities than with the “early contact between Portuguese Catholics
and Indians in the Central-Southern region”[62, p. 95]. This stands in contrast to more
rhythmically based Brazilian genres, like Axé, that maintain rhythmic traditions from
Africa [62].
But as a pop genre that intentionally sought to modernize rural music while claiming its
mantle, it has been open to, and indeed characterized by, its willingness to borrow and
incorporate new and foreign elements. With electric and electronic instruments, a Rock
drum set and a duo of male singers, its heavy borrowing from American country pop
acts is immediately apparent, but also, its connection to Mexican ballads and Boleros is
both historically rooted and sonically present in the music [61, p. 120].
Sertaneja has come to refer less to the musical traditions of rural Brazil than to rural
music generally, as it can be found worldwide [61, p. 133]. Further, this willingness to
borrow extends to the other popular musical forms of Brazil, particularly other rural
music, insofar as the borrowing, or as some charge, copying, works toward a narrative
within the context of rural life [61, p. 133].
One might argue that many of the other genres in the set should also be understood
through their cultural attachments, but Sertaneja stands out against the other genres as
fundamentally not danced-based. Take for example Salsa music, which is described in the
Grove Encyclopedia of Music as having a “distinctive feel ... based upon a foundation
of interlocking rhythmic ostinati” and they go on to notate what they refer to as its
“rhythmic foundation”[63]. Or Forró, described as “an urbanized northeastern-style
country dance music”, but also referring to a specific rhythm played between a triangle,
zabumba and surdo drum [64, p. 90]. It is these kinds of specific and characteristic
rhythms that this system aims to track.
Sertaneja has a less specific relationship to rhythm, not defined by it, but instead often
using it as a signifier within a larger musical-cultural narrative. Though the genre is
mostly represented here by American style ballads, its general characteristics place less
restraint on what rhythms may be included. It is probable that this largely explains why
Sertaneja has the lowest classification success rate in the results. It is not that Sertaneja
presents challenging rhythms for the system to understand, but that its stylistic tangents
are rhythmically dissimilar and should be expected to not classify well.
Chapter 7. System Verification Strategy Biases
7.2
38
Inter-genre Influence
The fact of inter-genre influence within Brazil looms large over fixed ideas of genre and
notions of musical boundaries. Although the recording industry often pushes conceptions
of genre for commercial purposes, among musicians, musical boundaries are there to be
crossed. And often, from the point of view of musicians at the forefront of musical
creation, these genre terms are already obsolete [58]. Within the Brazilian genres in this
dataset, two forces undermining genre boundaries come to light: regional and global
influences.
Regional influences are most apparent within Sertaneja and Gaúcha which are both
known to take rhythms and instruments from Forró, among others [59, p.224]. There
are several instances where the classifier identifies a Gaúcha or Sertaneja track as Forró,
and indeed the track features the distinctive triangle pattern, bass drum and rhythmic
feel specific to Forró, albeit embellished. (e.g. Alma Serrana’s “Xote Dos Milagres” and
Teodoro & Sampaio’s “O Tocador”, Gaúcha and Sertaneja respectively).
The influence of global popular music has a more dispersed e↵ect here. Despite strong
roots in more traditional Brazilian music practices, Axé, Forró, Gaúcha, Pagode, and
Sertaneja, are all commercialized and modernized genres. With the exception of Forró
and Pagode which became popular in the 1940s and 1970s respectively, as mentioned before, these genres crystalized into their popular commercial forms between 1980-1990’s.
This timing coincides with the rise of globalized pop, largely centered around American acts. And as Brazil’s music became more globally recognized, these genres, while
maintaining their Brazilian identities, increasingly incorporated globalized pop instrumentations and musical cues into themselves [61, p.168], [65, p. 30-31].
As noted before, Sertaneja holds a strong affinity to American country music. This
entails strong backbeats on nearly every track. As such, this rhythmic basis is strongly
associated with Sertaneja by the classifier, and many instances of this strong back beat
in other genres triggers a Sertaneja classification. The many confusions between Bolero
and Sertaneja are due not only to the inclusion of Bolero within Sertaneja as a musical
influence, but also to this flattening e↵ect of global influences on both genres. This is
true for tracks in other genres predicted as Sertaneja as well; many contain a strong
back beat, indicating an influence of globalized popular rhythms, aligning them more
strongly with Sertaneja despite certain musical and cultural cues that might indicate
otherwise.
Chapter 7. System Verification Strategy Biases
7.3
39
Rhythm as Genre Signifier
Given that the confusion trends in the results largely match up to known genre influences,
this indicates that this system is successful at isolating rhythms. But it also underscores
the shortcomings in using genre identification to validate rhythmic similarity. Some of
these genres might be better separated based on their timbre or some other signifier. In
the case of Sertaneja, timbre, lyrics and vocal style might prove useful in reducing errors
and building invariance to cross-genre influences. The authors of [10] note this in their
original paper on the OP and show that a hybrid feature set that includes MFCC’s, a
measure of timbre, provides modest increases in classification. But these considerations
diverge from the focus here of rhythm-based analysis.
What is clear from the preceding analysis is that, despite the fact that the dataset used
in this work is dance-based, genre tags are often unsuitable for validation of rhythmic
similarity. The peak classification score of 91.32% is at once hurt by genres ill-defined
with respect to rhythm as with Sertaneja, and bolstered by artifacts such as the recording
quality of Tango tracks. The classifier was able to produce rhythmically meaningful
groupings as well as provide insights into the data at hand, but a dataset divided along
stricter rhythmic lines might prove more useful for evaluating success in this task.
Chapter 8
Conclusions
In addressing computational rhythm similarity measures for music, this thesis took a
survey of current systems, developed a novel state-of-the-art approach, and presented
a detailed analysis of the results from both a numerical and musicological perspective.
It examined previous approaches to rhythm similarity and identified the onset pattern
(OP) as having high potential for further development for its high-dimensional, semitempo invariant characteristics.
To improve results with the OP, this work then presented a new system architecture
that makes use of deep architectures for feature learning and classification. The OP was
stripped of several of the post-processing and pooling stages detailed in [10] and fed into
a deep-network with the aim of learning the best representation for this task. Using
the widely known Latin Music Dataset [4] for validation purposes, this approach proved
worthwhile, producing average classification scores of 91.32%, well above previously
obtained results on this data. With added network depth, the system was able to
absorb these post-processing and pooling stages that are infeasible to tune manually.
Following this system architecture discussion, a detailed cross-disciplinary look at the
results was taken with the aim of fully assessing the abilities of this approach to rhythm
similarity. Based on observations of the system characteristics, the classification output,
and contents of the dataset, this research identified three factors to examine: rhythmic
characteristics of the OP, dataset idiosyncrasies and basic assumptions built into this
system’s methodology.
It was showed through statistical testing on gathered metadata that, though the OP’s
partial tempo sensitivity does not play a significant role in the results, its sensitivity to
fine-grain rhythmic variations does, grouping tracks, sometimes against class markers,
based on rhythmic feel. Detailed examinations of the dataset highlighted its features
40
Chapter 8 Conclusions
41
that undermine the dataset’s use for the task of rhythm verification. These included
its perspective on genre that favors use over content (i.e. associated dance versus musical content), a regional skewing of the data towards Brazilian genres and class-wide
coloration of recordings arising from unevenness in representation (e.g. the recording
quality of Tango tracks).
Perhaps most importantly, this research underscored the tenuous connection between
rhythm and genre, an assumed connection that this approach’s verification strategy
depended on. Although genre proved generally useful as a proxy for rhythm in the
context of the dance-based LMD, it did not guarantee rhythmic homogeneity within
classes. This was best exemplified in the genre Sertaneja, defined less by rhythm than
by a mix of musical influences, song form, lyrical content, instrumentation and an array
of social cues. This analysis further highlighted the fact that genre definitions often fail
to account for inter-genre influences that can manifest as rhythmic cross-pollination and
showed that, despite class labels, the system was able to identify many of these rhythmic
influences across genres in the LMD.
Beyond presenting a state-of-the-art approach for measuring rhythm similarity in music,
this work hopes to emphasize the importance of this kind of in-depth analysis. By
examining the approach from the perspectives of classification scores, statistical analysis
and genre and rhythm-specific musicological discussions, this e↵ort was able to shed light
not only on the characteristics and performance of this system, but also bring a greater
understanding to the task in general. If the goal is to develop musically relevant methods
of analysis, then this kind of domain-specific discussion is highly constructive in going
beyond the question of numerical efficacy and deeper into an understanding of what is
being expressed musically by these systems.
This approach has been couched in the context of a deep-network architecture for feature
learning and classification, but the current implementation leans heavily on a handdesigned feature extraction stage. Given the success of learning components of the
feature representation, as shown here, future work should focus on folding more of these
feature design aspects into the learning process.
Bibliography
[1] Yajie Hu and Mitsunori Ogihara. Genre classification for million song dataset
using confidence-based classifiers combination. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1083–1084. ACM, 2012.
[2] Lie Lu, Dan Liu, and Hong-Jiang Zhang. Automatic mood detection and tracking of
music audio signals. Audio, Speech, and Language Processing, IEEE Transactions
on, 14(1):5–18, 2006.
[3] Business Matters:
From
2009.
75,000 Albums Released In U.S. In 2010 – Down 22%
http://www.billboard.com/biz/articles/news/1179201/
business-matters-75000-albums-released-in-us-in-2010-down-22-from-2009.
Accessed: 2013-11-6.
[4] Carlos Nascimento Silla Jr, Alessandro L Koerich, and Celso AA Kaestner. The
latin music database. In ISMIR, pages 451–456, 2008.
[5] Juan José Burred, Axel Röbel, and Xavier Rodet. An accurate timbre model for
musical instruments and its application to classification. In Workshop on Learning
the Semantics of Audio Signals, Athens, Greece, 2006.
[6] Jeremiah D Deng, Christian Simmermacher, and Stephen Cranefield. A study on
feature analysis for musical instrument classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 38(2):429–438, 2008.
[7] Elias Pampalk, Arthur Flexer, and Gerhard Widmer. Improvements of audio-based
music similarity and genre classificaton. In ISMIR, volume 5, pages 634–637. London, UK, 2005.
[8] Jean-Julien Aucouturier and Francois Pachet. Finding songs that sound the same.
In Proc. of IEEE Benelux Workshop on Model based Processing and Coding of
Audio, pages 1–8, 2002.
[9] Simon Dixon, Fabien Gouyon, and Gerhard Widmer. Towards characterisation of
music via rhythmic patterns. In ISMIR, 2004.
42
Bibliography
43
[10] Tim Pohle, Dominik Schnitzer, Markus Schedl, Peter Knees, and Gerhard Widmer.
On rhythm and general music similarity. In ISMIR, pages 525–530, 2009.
[11] Andre Holzapfel, Arthur Flexer, and Gerhard Widmer. Improving tempo-sensitive
and tempo-robust descriptors for rhythmic similarity. In Proceedings of the 8th
Sound and Music Computing Conference. SMC’11, 2011.
[12] Geo↵roy Peeters. Spectral and temporal periodicity representations of rhythm for
the automatic classification of music audio signal. Audio, Speech, and Language
Processing, IEEE Transactions on, 19(5):1242–1252, 2011.
[13] Emilia Gómez. Tonal description of music audio signals. Unpublished doctoral
dissertation, Universitat Pompeu Fabra, Barcelona, Spain, 2006.
[14] Joan Serra, Holger Kantz, Xavier Serra, and Ralph G Andrzejak. Predictability
of music descriptor time series and its application to cover song detection. Audio,
Speech, and Language Processing, IEEE Transactions on, 20(2):514–525, 2012.
[15] J. Salamon. Melody Extraction from Polyphonic Music Signals. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.
[16] Godfried Toussaint. The Geometry of Musical Rhythm: What Makes a “good”
Rhythm Good? CRC Press, 2013.
[17] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchho↵, and
Anssi Klapuri. Automatic music transcription: Breaking the glass ceiling. In ISMIR,
pages 379–384, 2012.
[18] Fabien Gouyon and Simon Dixon. A review of automatic rhythm description systems. Computer music journal, 29(1):34–54, 2005.
[19] Emiru Tsunoo, Nobutaka Ono, and Shigeki Sagayama. Rhythm map: Extraction
of unit rhythmic patterns and analysis of rhythmic structure from music acoustic
signals. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE
International Conference on, pages 185–188. IEEE, 2009.
[20] Matthew Wright, W Andrew Schloss, and George Tzanetakis. Analyzing afro-cuban
rhythms using rotation-aware clave template matching with dynamic programming.
In ISMIR, pages 647–652, 2008.
[21] Hui Li Tan, Yongwei Zhu, Susanto Rahardja, and Lekha Chaisorn. Rhythm analysis
for personal and social music applications using drum loop patterns. In Multimedia
and Expo, 2009. ICME 2009. IEEE International Conference on, pages 1672–1675.
IEEE, 2009.
Bibliography
44
[22] Andre Holzapfel and Yannis Stylianou. A scale transform based method for rhythmic similarity of music. In Acoustics, Speech and Signal Processing, 2009. ICASSP
2009. IEEE International Conference on, pages 317–320. IEEE, 2009.
[23] André Holzapfel and Yannis Stylianou. Scale transform in rhythmic similarity of
music. Audio, Speech, and Language Processing, IEEE Transactions on, 19(1):
176–185, 2011.
[24] Elias Pampalk. A matlab toolbox to compute music similarity from audio. In
ISMIR, 2004.
[25] Fabien Gouyon. A computational approach to rhythm description-audio features for
the computation of rhythm periodicity functions and their use in tempo induction
and music content processing. 2005.
[26] Jonathan Foote and Shingo Uchihashi. The beat spectrum: A new approach to
rhythm analysis. In ICME, 2001.
[27] George Tzanetakis and Perry Cook. Musical genre classification of audio signals.
Speech and Audio Processing, IEEE transactions on, 10(5):293–302, 2002.
[28] Matthias Gruhne, Christian Dittmar, and Daniel Gaertner. Improving rhythmic
similarity computation by beat histogram transformations. In ISMIR, pages 177–
182, 2009.
[29] Jesper Højvang Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. A
tempo-insensitive representation of rhythmic patterns. In Proc. of the 17th European
Signal Processing Conf.(EUSIPCO-09). Citeseer, 2009.
[30] Geo↵roy Peeters. Rhythm classification using spectral rhythm patterns. In ISMIR,
pages 644–647, 2005.
[31] Sebastian Krey and Uwe Ligges. Svm based instrument and timbre classification.
In Classification as a Tool for Research, pages 759–766. Springer, 2010.
[32] Adam R Tindale, Ajay Kapur, George Tzanetakis, and Ichiro Fujinaga. Retrieval
of percussion gestures using timbre classification techniques. In ISMIR, 2004.
[33] Giulio Agostini, Maurizio Longari, and Emanuele Pollastri. Musical instrument
timbres classification with spectral features. EURASIP Journal on Applied Signal
Processing, 2003:5–14, 2003.
[34] Róisı́n Loughran, Jacqueline Walker, Michael O’Neill, and Marion O’Farrell. Musical instrument identification using principal component analysis and multi-layered
perceptrons. In Audio, Language and Image Processing, 2008. ICALIP 2008. International Conference on, pages 643–648. IEEE, 2008.
Bibliography
45
[35] Aliaksandr Paradzinets, Hadi Harb, and Liming Chen. Multiexpert system for
automatic music genre classification. Teknik Rapor, Ecole Centrale de Lyon, Departement MathInfo, 2009.
[36] Sander Dieleman, Philémon Brakel, and Benjamin Schrauwen. Audio-based music
classification with a pretrained convolutional network. 2011.
[37] Philippe Hamel, Sean Wood, and Douglas Eck. Automatic identification of instrument classes in polyphonic and poly-instrument audio. In ISMIR, pages 399–404,
2009.
[38] Eric J Humphrey and Juan P Bello. Rethinking automatic chord recognition with
convolutional neural networks. In Machine Learning and Applications (ICMLA),
2012 11th International Conference on, volume 2, pages 357–362. IEEE, 2012.
[39] Eric J Humphrey, Juan P Bello, and Yann LeCun. Feature learning and deep architectures: new directions for music informatics. Journal of Intelligent Information
Systems, pages 1–21, 2013.
[40] Philippe Hamel and Douglas Eck. Learning features from music audio with deep
belief networks. In ISMIR, pages 339–344. Utrecht, The Netherlands, 2010.
[41] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large
vocabulary image annotation. In Proceedings of the International Joint Conference
on Artificial Intelligence (IJCAI), 2011.
[42] Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning
hierarchical features for scene labeling. 2013.
[43] Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards ai. LargeScale Kernel Machines, 34, 2007.
[44] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E
Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to
handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
[45] Philippe Hamel, Simon Lemieux, Yoshua Bengio, and Douglas Eck. Temporal
pooling and multiscale learning for automatic annotation and ranking of music
audio. In ISMIR, pages 729–734, 2011.
[46] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional
deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine
Learning, pages 609–616. ACM, 2009.
Bibliography
[47] Deep
46
Learning
Tutorial
code
repository.
http://marl.smusic.nyu.
edu/wordpress/projects/feature-learning-deep-architectures/
deep-learning-tutorial/#source_code. Accessed: 2013-11-6.
[48] Bob L Sturm. Classification accuracy is not enough: On the evaluation of music
genre recognition systems. Journal of Intelligent Information Systems, 2013.
[49] Yandre Costa, Luiz Oliveira, Alessandro Koerich, and Fabien Gouyon.
Music
genre recognition using gabor filters and lpq texture descriptors. In Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 2013.
[50] C.N. Silla, A.L. Koerich, and C.A.A. Kaestner. Feature selection in automatic music
genre classification. In Multimedia, 2008. ISM 2008. Tenth IEEE International
Symposium on, pages 39–44, 2008. doi: 10.1109/ISM.2008.54.
[51] The echonest api overview. http://developer.echonest.com/docs/v4. Accessed:
2013-11-19.
[52] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic
regression. Wiley. com, 2013.
[53] Andrew A Kramer and Jack E Zimmerman. Assessing the calibration of mortality
benchmarks in critical care: The hosmer-lemeshow test revisited*. Critical care
medicine, 35(9):2052–2056, 2007.
[54] Prabasaj Paul, Michael L Pennell, and Stanley Lemeshow. Standardizing the power
of the hosmer–lemeshow goodness of fit test in large data sets. Statistics in Medicine,
32(1):67–80, 2013.
[55] David R Cox and EJ Snell. On test statistics calculated from residuals. Biometrika,
58(3):589–594, 1971.
[56] Nico JD Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78(3):691–692, 1991.
[57] Willi Kahl and Israel J. Katz. Salsa, volume Oxford Music Online of Grove Music
Online. Oxford University Press. URL http://www.oxfordmusiconline.com/
subscriber/article/grove/music/03444.
[58] Robert O Gjerdingen and David Perrott. Scanning the dial: The rapid recognition
of music genres. Journal of New Music Research, 37(2):93–100, 2008.
[59] C. McGowan and R. Pessanha. The Brazilian Sound: Samba, Bossa Nova, and the
Popular Music of Brazil. Temple University Press, 1998. ISBN 9781566395458.
Bibliography
47
[60] R.D. Moore, J. Koegel, and W.A. Clark. Musics of Latin America. W W Norton &
Company Incorporated, 2012. ISBN 9780393929652. URL http://books.google.
com/books?id=txbKygAACAAJ.
[61] A. Dent. River of Tears: Country Music, Memory, and Modernity in Brazil. e-Duke
books scholarly collection. Duke University Press, 2009. ISBN 9780822391098.
[62] C.B. Henry. Let’s Make Some Noise: Axé and the African Roots of Brazilian
Popular Music. University Press of Mississippi, 2010. ISBN 9781604733341.
[63] Lise Waxer. Salsa, volume Oxford Music Online of Grove Music Online. Oxford University Press. URL http://www.oxfordmusiconline.com/subscriber/
article/grove/music/24410.
[64] L. Crook. Brazilian Music: Northeastern Traditions and the Heartbeat of a Modern
Nation. Number v. 1 in ABC-CLIO world music series. ABC-CLIO, 2005. ISBN
9781576072875. URL http://books.google.com/books?id=kSH9HQox_K8C.
[65] Charles A Perrone and Christopher Dunn. Brazilian popular music and globalization. Journal of Popular Music Studies, 14(2):163–165, 2002.