NEW YORK UNIVERSITY Computational Rhythm Similarity Development and Verification Through Deep Networks and Musically Motivated Analysis by Tlacael Esparza Submitted in partial fulfillment of the requirements for the Master of Music in Music Technology in the Department of Music and Performing Arts Professions in The Steinhardt School New York University Advisor: Juan Bello [DATE:2013/12/06] January 2014 NEW YORK UNIVERSITY Abstract Steinhardt Master of Music by Tlacael Esparza In developing computational measures of rhythmic similarity in music, validation methods typically rely on proxy classification tasks on common datasets, equating rhythm sim-ilarity to genre. In this paper, a novel state-of-the-art system for rhythm similarity is proposed that leverages deep network architectures for feature learning and classification, using this standard approach of genre classification on a well-known dataset for validation. In addressing this method of validation, an extensive cross-disciplinary analysis of the performance of this system is undertaken. In addition to analyses through MIR, machine learning and statistical methods, a detailed study of both the results and the dataset are performed from a musicological perspective, delving into the musical, historical and cultural specifics that impact the system. Through this study, insights are gained in further gauging the abilities of this measure of rhythm similarity beyond classification accuracy as well as a deeper understanding of this system design and validation approach as a musically meaningful exercise.. . . Acknowledgements I would like to thank Professor Juan Bello for his guidance, encouragement and dedication to my education, and Eric Humphrey without whom I would have been lost in a deep network somewhere. Many people have helped me along the way with this work and I am very grateful for their time and generosity. These include: Uri Nieto, Mary Farbood, Adriano Santos and Professor Larry Crook, as well as Carlos Silla and Alessandro Koerich for the Latin Music Dataset and their insights into the data collection process. And most importantly, thanks to my family and my fiancée, Ashley Reeb, for their unwavering emotional, spiritual, intellectual and financial support. . . . ii Contents Abstract i Acknowledgements ii List of Figures iv List of Tables v 1 Introduction 1 2 Explication and Literature Review 2.1 Computational Music Similarity Measures . . . . . . . . . . . 2.2 Rhythm Similarity . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Onset Patterns . . . . . . . . . . . . . . . . . . . . . . 2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Deep Networks for Feature Learning and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 5 8 9 9 3 Approach 3.1 Onset Patterns Implementation . . 3.2 Deep Network Implementation . . 3.2.1 Applying the Deep Network 3.3 Analytic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 14 15 4 System Configuration and Results 4.1 Dataset . . . . . . . . . . . . . . . 4.2 Methodology . . . . . . . . . . . . 4.3 OP Parameter Tuning . . . . . . . 4.3.1 Periodicity Resolution . . . 4.3.2 Frequency Resolution . . . 4.4 Deep Network Parameterization . . 4.4.1 Layer Width . . . . . . . . 4.4.2 Network Depth . . . . . . . 4.5 Optimal System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 16 17 17 18 18 19 19 19 5 OP Rhythm Similarity Analysis 24 5.1 Tempo Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2 Fine Grain Rhythmic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 26 iii Contents iv 6 Dataset Observations 30 6.1 Ground Truths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2 Artist/Chronological Distribution . . . . . . . . . . . . . . . . . . . . . . . 32 6.3 Brazilian Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7 System Verification Issues 36 7.1 Rhythm-Genre Connection . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.2 Inter-genre Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.3 Rhythm as Genre Signifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8 Conclusions 40 Bibliography 42 List of Figures 2.1 Extraction of Onset Patterns (OP) from the audio signal. . . . . . . . . . 4.1 E↵ect of P on classification. The highest result is highlighted in blue, while significantly di↵erent results are in red. . . . . . . . . . . . . . . . E↵ect of F on classification. The highest result is highlighted in blue, while significantly di↵erent results are in red. . . . . . . . . . . . . . . . Mean comparison of ANOVA tests on network layer complexity in a 2layer architecture show significantly lower results for small M. . . . . . . Top: Progression from input to output shows increasingly compact genre representation from input to output. Bottom: Progression from input to output shows increasingly distant classes from input to output. . . . . . 4.2 4.3 4.4 8 . 18 . 18 . 19 . 23 5.1 Gaussian modeled tempo distributions by genre in the LMD. . . . . . . . 25 6.1 Left: OP of a modern Sertaneja track. Right: OP of a Tango recording from 1917. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Top: Geographical spread of Bolero vs. Brazilian genres. Bottom: Detail of geographical spread of Brazilian genres. . . . . . . . . . . . . . . . . . . 35 6.2 v List of Tables 2.1 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 6.1 Summary of main approaches in the literature for computational rhythm similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ANOVA results for classification scores with varying P values shows that periodicity resolution is a significant factor. . . . . . . . . . . . . . . . . . Classification accuracies for di↵erent features on the LMD. . . . . . . . . Classification accuracies by genre, ordered from highest classification score to lowest, show Brazilian genres generally performing worse than the rest. Confusions matrix shows classification affinities between Sertaneja and several other genres. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of di↵erent classifiers on OP data. The proposed system outperforms all others by a margin of 2.23%. . . . . . . . . . . . . . . . . Results of binary logistic regression with classification success as dependents variable, BPM and Density as input show density is significant while BPM is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hosmer & Lemeshow test shows BPM and density data to be poor predictors for classification success. . . . . . . . . . . . . . . . . . . . . . . . Feel breakdown by genre showing percentage of tracks in each genre that are swung. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of actual genre feel versus predicted genre feel for LMD classification results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 17 20 20 21 21 . 26 . 26 . 28 . 28 LMD infometrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 vi Chapter 1 Introduction A fundamental goal in the field of music information retrieval (MIR) is to extract musically meaningful information from digitized audio signals through computational methods. This, in its vagueness and breadth, describes most MIR tasks. In practice, and with the field still in its relative infancy, these tasks have often simplified to extracting musical feature representations that highlight basic characteristics like pitch, harmony, melody, tempo, timbre, structure and rhythm, among others. With the assumption that complex musical features such as mood or genre are signified by sets of fundamental musical attributes, it is hoped that these more abstract characteristics can be identified through combinations of these methods [1, 2]. There are many motivations for as well as current successful applications of this work. Pitch and beat tracking algorithms have found widespread use in digital audio workstations such as Pro Tools and Abelton Live, enabling pitch and beat correction, tempo estimation, and time-stretching of recorded audio. These functionalities have been used to great e↵ect in current popular music and are often audibly detectable, as with the music of the artist T-Pain, known for heavy use of auto-tuning software. Beyond music production, these computational methods can be leveraged to analyze and annotate the ever-growing and intractably large collections of music that the digital age has enabled. With an approximated 75,000 official album releases in 2010 alone as a indication of scale [3], and the primary means of maintaining and consuming this music through digital transmission, a computational approach to annotating and cataloging these collections is highly desirable. Indeed, new digital-era companies that serve streams of music to users on demand, such as Spotify, Soundclound and Pandora, have begun to employ many MIR methods (and researchers) for genre detection, playlist generation and music recommendation, among other services. 1 Chapter 1. Introduction 2 A main objective of this thesis aims to examine and further develop computational methods of measuring rhythm similarity in music signals. The importance of rhythm to music almost needs no mention. Under composer Edgar Varése’s generous definition of music as “organized sound”, rhythm remains fundamental in that time is one of the few dimensions along which sound can be organized. And so, with the goal of the MIR community to fully parse musical content through computational means, contending with rhythm is an important step in this endeavor. Combining previous research on rhythm from the MIR literature with advances in machine learning, this work presents a state-of-the-art system for measuring rhythm similarity. In the hope of anchoring this abstracted computational process to its stated goal of extracting musically, and specifically, rhythmically meaningful information, this work takes a concerted e↵ort, beyond what is common in the literature, to analyze not only the results, but also the dataset used and the system’s design from a multi-disciplinary perspective. Using a standard MIR verification scheme on a well-known dataset, the Latin Music Dataset [4], through statistical analyses on dataset metadata, analysis of the dataset from a musical, cultural and historical perspective and scrutiny of the basic assumptions built into the design, it is hoped that the system’s musical relevance can be understood with greater clarity beyond the common classification score measuring stick. Further, with a great deal of personal interest and domain knowledge in the subject (rhythm) and the specific area of application for this work (latin rhythms/music), it is hoped that this research approach will provide a useful and elucidating look into the analysis of computational rhythm similarity measures, and also act as an encouragement to take on this level of scrutiny in developing computational methods for music applications in general. Chapter 2 Explication and Literature Review Much of MIR research has largely followed a standard and persistent design model in developing novel methods for parsing music signals. This model is comprised of two steps: one, a hand-crafted feature extraction stage that aims to single out a given musical characteristic from an input signal; and two, a semantic interpretation or classification stage applying some function that allows the mapping of this feature to a symbolic representation. This chapter takes a survey of previous work in developing both of these system components as they relate to the current task of measuring rhythm similarity. Sections 2.1 and 2.2 summarizes standard approaches to feature extraction and reviews the various attempts at characterizing rhythm through feature design, highlighting the development of Onset Patterns which serve as a jumping o↵ point for further research. Section 2.3 reviews improvements to feature extraction methods through the use of sophisticated machine learning algorithms, pointing to a blurring in the distinction between feature extraction and classification and setting the direction for this research. 2.1 Computational Music Similarity Measures Though some feature extraction methods produce easily interpretable, musically relevant representations of the signal directly, certain feature representations are imbued with musical meaning only as measures of similarity. For instance, the output of a tempo detection algorithm which looks for strong sub-sonic frequencies can be interpreted easily by looking for a global maximum in the representation, revealing a beats-per-minute value, a standard unit of tempo. Conversely, Mel-Frequency Cepstral Coefficients (MFCCs), widely used as a measure of timbre, are not musically interpretable on their own, but, 3 Chapter 2. Explication and Literature Review 4 paired with a distance metric, can be used to identify sounds based on distance to a known example, a common application of which is instrument identification [5, 6]. In this paradigm of measuring similarity, musical facets can be seen as existing in some multi-dimensional space where similar representations are grouped closely together, and the feature extraction algorithm is a mathematical projection of a given signal into one of these spaces. Through this approach, a posteriori -defined properties such as rhythm and structure, esoteric qualities such as timbre and complex characteristics such as mood can be inferred based on their distance to labelled examples in their respective feature space. In this way, classification supplies semantic meaning and a perceptually meaningful framework for analyzing these more complicated features. As an example of this, the previously mentioned quality of sound referred to as timbre is not easily defined, difficult to conceptualize and its manifestations difficult to describe with language. However, agreeing that timbre is the feature that distinguishes the sound of one instrument from another, we can discuss timbre similarity through the task of matching multiple recordings of the same instrument, defining it in finite terms (i.e. the timbre of a flute vs the timbre of a horn). One of the major obstacles to this approach is the necessity of labeled datasets. The development, verification and interpretation of these algorithms relies on classification tasks on pre-labeled examples; without an example of flute timbre to match with, an unlabeled signal cannot be identified with this characteristic. Ideally, when developing new similarity features, the verification dataset suits the task well by representing the desired musical feature homogeneously within a given class, but datasets with feature similarity-based ground-truths can be expensive and time-restrictive to produce. To address this, the MIR community actively compiles and shares labeled datasets for these purposes; examples of widely used datasets include labeled audio of monophonic instruments (McGill University Master Samples), audio accompanied by various descriptive tags (Magnatagatune) and many datasets divided along genre membership (LMD, Ballroom Dance, Turkish Music). But in practice, this has often lead to the use of datasets not created specifically for the given similarity measure of concern, employing a proxy identification with some other more easily identifiable characteristic. This includes the very common use of genre as a proxy for texture and rhythm similarity (which this thesis research employs knowingly) [7–12], and cover song identification for harmonic and melodic similarity [13– 15]. Implicit in this approach is the assumption that ground-truths in these datasets correlate strongly enough with the musical characteristic being measured to provide meaningful classification results and system verification. Chapter 2. Explication and Literature Review 2.2 5 Rhythm Similarity Rhythm is a complex and wide musical concept with varying definitions in di↵erent contexts. Though rhythm exists in music on various time-scales and can describe anything from the timing of a melody to textural shifts and large scale events, in this paper (and in the MIR literature on the subject), rhythm is taken to refer to regularly repeating sound events in time on the musical measure-level (approximately 2-8 seconds); that is, rhythm as those looping musical phrases that a percussionist or a bass player in a dance band might play. In the MIR literature, analyzing rhythmic similarity is distinct from rhythm description or transcription tasks. Where the latter seeks to transform a musical signal into symbolic annotations or describe it directly in some manner, the former is concerned only with isolating rhythm as invariant across di↵erent instances, often using highly abstracted representations. Although [16] provides a framework for understanding rhythm similarity with symbolic sequences, for the rapidly growing body of recorded audio this kind of analysis is not applicable for several reasons: the vast majority of audio recordings typically do not have this level of annotation; providing this information by hand is time restrictive; and computational methods of annotation remain ine↵ective [17]. Hence, signal-based methods for rhythm analysis are highly desirable. From a conceptual perspective, isolating this level of rhythm as an invariance requires removing pitch, tempo and timbre dependence so that a rhythm played on two di↵erent instruments, using di↵erent pitches and at di↵erent speeds will be recognized as the same. However, previous approaches tailor this list according to the intended application and sometimes include additional dependencies to be removed: phase, referring to the position of the start of a repeating rhythm; and temporal variance, referring to the evolution of a rhythm over longer time frames. Removing phase and temporal variance is a practical consideration specific to signal processing concerns; though a human can often easily recognize the beginning of a rhythmic pattern based on larger musical context, recognizing this computationally has been shown to be problematic [18] and when analyzing a signal, there is no guarantee that the beginning of the signal will correspond to the beginning of a repeating rhythm. Similarly, for track-wise classification, temporal invariance works towards minimizing the e↵ects of portions of audio where there is no discernible rhythm or where changes in rhythm are not representative of the track on the whole. Aside from a handful of intuitively motivated rhythm similarity systems that extract unit rhythm representations and preserve phase by employing often complicated heuristics to deduce the beginning of a phrase [19–21], most designs remove phase and take a Chapter 2. Explication and Literature Review 6 more abstracted approach. Though di↵ering in important ways, they typically follow a common script: 1) calculate a novelty function from the signal, removing pitch-level frequencies, and highlighting rhythmic content; 2) produce a periodicity and/or rhythmic decomposition of this novelty function by analyzing consecutive rhythm phrase-length windows (typically 4 to 12 seconds), capturing local rhythm content on this scale; 3) transform this local representation by warping, shifting, resampling or normalizing to remove tempo-dependence; 4) aggregate local representations over time to produce a track-wise rhythm representation, removing temporal dependence. Table 2.1 shows a summary of these four steps for each of the main approaches in the literature. In this table, “E↵ected Dimension” refers to the musical dimension that each stage acts to either preserve or remove. Though all of these methods remove pitch content in the novelty function calculation, the Scale-Transform implemented in [11, 22, 23] and the Fluctuation and Onset Patterns implemented in [10, 11, 24] do preserve some level of timbre through multi-band novelty function representations. Most approaches produce local rhythm representations by using the Auto-Correlation Function (ACF) or Discrete Fourier Transform (DFT). [12] notes that these functions are beneficial for their preservation of the sequential order of rhythmic events, but they also remove phase as a periodicity representation, where only rhythm-level frequencies are coded. Rhythm Patterns [9, 25] diverge from this approach by including, in addition to periodicity analysis, Inter-Onset Intervals (IOI) which encodes the spaces between onsets in the novelty function, and Rhythm Patterns which are bar-length representations of the novelty function. This is a robust approach but relies on unreliable heuristics for extracting the downbeat used to determine the Rhythm pattern. All of the approaches make some e↵ort to remove temporal variance through temporal aggregation over all frames. Approach Novelty Local Rhythm Rhythm Scaling/Morphing E↵ected Dimension Pitch/Timbre Phase Tempo Aggregation Temporal Beat histogram [26–29] Rhythm patterns [9, 25] Hybrid ACF/DFT [12, 30] Scale transform [11, 22, 23] Fluctuation/Onset patterns [10, 11, 24] single single single single, multiband multiband ACF rhythm patterns, ACF, IOI DFT, ACF and hybrids ACF, DFT DFT log-lag + shift detection, sub-sampling bar-length normalization resampling with local tempo scale transform log-frequency + subsampling histogram k-means, histogram mean mean mean Table 2.1: Summary of main approaches in the literature for computational rhythm similarity. The biggest divergences in these designs can be seen in the various methods for removing tempo-sensitivity from the representation. Noting that relative rhythmic structure can be compared more easily as a shift on a log-scale versus a stretch on a linear scale, a log-lag mapping in the Beat histogram [28, 29] or a log-frequency mapping in the Onset pattern [10] allows for reduced sensitivity to tempo changes, where only large tempo di↵erences are noticeable. In [10, 29], the e↵ect of tempo is further reduced by sub-sampling in the log-lag/frequency domain to produce a coarser representation. [28] Chapter 2. Explication and Literature Review 7 employs a shift in the log-lag domain to obtain a fully tempo-insensitive representation, but this relies on determining the proper shift value which is prone to errors. Subject to similar problems are the methods employed in calculating Hybrid [12] and, as mentioned before, Rhythm Pattern [9] which rely on determining tempo and bar boundaries for tempo normalization and bar-length pattern identification. The octave errors common to tempo estimation algorithms are problematic here, leading to inconsistencies in rhythm representations for these methods. [22, 23] o↵ers a robust, fully tempo-invariant approach that takes the scale transform of the ACF, resulting in a scale-invariant version of the already shift invariant ACF, obviating the need for determining a shift amount to correct for the shift introduced by log-lag mapping. Though the Beat Histogram, and Hybrid ACF/DFT, if applied successfully, do result in a fully pitch, tempo, timbre, and phase invariant rhythm representation, these are less useful when tasked with measuring rhythm similarity in the context of general similarity in multi-instrumental recorded music. Indeed, with most of these methods performing verifications through genre identification on standard dance-based datasets, better classification success has been obtained with the Onset Pattern [10, 11] and the Scale Transform [22, 23], which both preserve some level of timbre dependence through a multi-band representation. This makes sense when the question is not “are these rhythms the same?” but rather “do these two tracks sound similar from a rhythmic perspective?”, where the listener looks not only for similar rhythms but for similar orchestrations of those rhythms. While the former might be more conceptually pure with respect to rhythm similarity, the latter is more amenable to a genre classification task and as a tool in measuring general music similarity. It merits repeating here that nearly all of the rhythm similarity studies mentioned above employ genre identification as a verification method. Recalling the common use of already available datasets in lieu of ones tailored for the task - in this case, a dataset labeled according to a specifically defined understanding of rhythm similarity - this has been a common and generally accepted practice in rhythm similarity research. In using dance-based datasets (LMD [11], Ballroom Dance [9–12, 25], Turkish Music [22, 23]), the underlying assumption behind this practice is not only that a rhythm can be reliably associated with a specific genre, but also that a given genre has a representative rhythm, justifying a bijective mapping from one to the other. 2.2.1 Onset Patterns Taking the perspective that a timbre-sensitive approach to rhythm similarity is desirable for application to multi-instrumental music signals, and noting the importance of Chapter 2. Explication and Literature Review 8 reducing reliance on error-prone heuristics in the design, the Onset Pattern and the Scale Transform stand out as promising approaches. The primary di↵erence between these two lies in their approach to tempo-invariance, where the Scale Transform achieves full tempo invariance and Onset Pattern shows invariance only for local tempo changes. As [11] e↵ectively shows, tempo can be an important and identifying characteristic for certain genres. Although the motivation here is not genre identification, this suggests the idea that perhaps tempo is also important for the perception of rhythm similarity. If two songs have the same rhythm but have very di↵erent tempos to the point that they produce a di↵erent e↵ect on the ear, this becomes a characteristic worth tracking. With this in mind, the Onset Pattern, which encodes only relatively large di↵erences in tempo, is especially promising for further development as a general measure of rhythm similarity in music. Computation of Onset Patterns (OP), as first described in [10] and refined in [11] , are relatively straight forward to calculate and follow the signal pipeline mentioned above. As illustrated in Figure 2.1: 1) the signal is transformed to the time-frequency domain, processed to produce a novelty function through spectral-flux, mean removal and half-wave rectification, and sub-sampled to produce log-spaced frequency sub-bands; 2) log2 -frequency DFTs are applied to these sub-bands over 6-8 second windows to produce a periodicity representation; 3) each frame is subsampled in the frequency and periodicity dimension to generalize the representation; and 4) frames are aggregated to produce a track-level representation. However, not detailed in these steps is the ordering of “pooling” stages, important to [10]’s design, that act to summarize multi-band information into a smaller representation. In particular, pooling occurs in the frequency dimension before and after calculating periodicity. Also left out is a normalization step to correct for artifacts from the various log-compression or pooling steps. However, justifications for these design choices as well as the implementation of this normalization step is left unclear in the original paper. Figure 2.1: Extraction of Onset Patterns (OP) from the audio signal. [11] refines this process by systematically testing di↵erent designs and parameters. Of particular note in its findings is the importance of window size in the periodicity calculation and the negligible e↵ect of specific ordering of pooling steps. With an 8-second Chapter 2. Explication and Literature Review 9 long window (versus 6 seconds in [10]), a single pooling stage can be applied at the end with no e↵ect to overall efficacy. Through this exhaustive search, [11] was able to improve OP performance beyond the original design. However, these results are based on necessarily limited parameter testing, constrained by time and feasibility and largely reliant on ignoring possible e↵ects of interaction between parameter choices, highlighting the difficulties in optimizing feature extractions. 2.3 Machine Learning Until recently, MIR research has taken the approach of designing algorithms to extract some explicit musical feature, using simple data models and distance measures for verification against ground truths (e.g. [10, 11]’s use of a K Nearest-Neighbor models with a Euclidean distance on OP features). However, for more complex musical characteristics, some in the field are turning their focus away from feature design to more sophisticated classification models and machine learning algorithms such as support vector machines [31–33], multi-layer perceptrons [34–36], and more recently deep-network architectures [37, 38]. With the standardization of many feature designs such as chroma, MFCCs, among many others, these more advanced machine learning methods have been used to squeeze performance from these features or to extract more complex characteristics from sets of features. In this line of thought, rather than rely on some specific feature extraction method, the task is couched in terms of a data classification problem which allows for leveraging learning algorithms to extract the relevant information based on a desired outcome. 2.3.1 Deep Networks for Feature Learning and Classification [39] advocates giving learning algorithms, in particular deep network architectures, a more fundamental role in system development; with a sufficiently sophisticated learning algorithm, an optimally designed feature can be automatically extracted from a minimally processed input signal. This has the potential to solve several problems that have plagued MIR research for over a decade. Besides obviating the need to spend time rigorously testing algorithms in search of optimal designs and parameters, more importantly, it has the potential to capture musical characteristics that would otherwise be too complex or abstruse to formulate within a feature extraction signal chain. Hand-crafted algorithms are necessarily limited by our own perceptual and technical abilities, and the approach that relies on these alone to explore the function space of Chapter 2. Explication and Literature Review 10 signal-to-feature mappings limits the range of possible solutions. As initially demonstrated in [40] for music information retrieval, deep network architectures can be used to this end for their ability to model high-order functions through system depth. By cascading multiple affine transformation layers of simpler nonlinear functions they allow for a system complexity sufficient to model abstract musical characteristics. As [39] argues, using deep architectures to learn features for MIR follows naturally from the observation that many successful designed features in the literature can be described in terms of deep architectures themselves, combining multiple steps of affine transformations, non-linearities and pooling. Taking the now standard calculation of MFCCs as an example, steps include: Mel-scale filterbank transform and discrete cosine projection (affine transformation); and log-scaling and complex modulus (nonlinearity). Hence, from this perspective the primary di↵erences between di↵erent feature designs is the choice of parameters. Further, given that these parameters can be optimized for a given task with deep networks, not only is it possible to learn better designs for features such as MFCCs, but this points to the prospect of learning better features altogether that are unconstrained by the specifics of implementation. In the two step paradigm described above, here the distinction between feature extraction and classification becomes obscured where step one is reduced to preparing the data for input to step two, a deep network where each layer is a progressively more refined feature representation and the final output layer performs classification. Deep architectures have found strong use in problems of feature learning for machine vision [41–44], but there has been relatively little research into this approach within the MIR community. Although SVMs as well as other more sophisticated learning algorithms, as mentioned above, have been used to improve classification rates for designed features, the e↵orts to learn the features themselves have been few. The initial successful uses of deep networks for music retrieval tasks in [40] and [45] show that learned features outperform MFCCs for genre classification and sophisticated temporal pooling methods can be learned to incorporate multi-scale information for better results. Further use of deep networks in [38] shows that Convolutional Neural Nets, a specialized ANN deep network, can be successfully used for the task of chord recognition by extracting chord representations directly from several seconds of tiled pitch spectra. The positive results these approaches achieve are encouraging and justify further research into deep networks for feature design tasks such as rhythm similarity. It is important to note that in these supervised learning schemes, the data used in training and classification plays a more fundamentally important role in feature design. With hand-crafted features, designs are based on some idealized concept of a given musical feature i.e. tempo, timbre, pitch, and classification tasks serve merely as validation of Chapter 2. Explication and Literature Review 11 the design. However, if the feature itself is learned in the process of supervised training of a classification model, it is necessarily shaped by the relationship between class labels and signal attributes in the dataset used for training. This is both a positive characteristic of this approach since, as mentioned, it unhinges the perceived musical characteristic from a pre-determined algorithm, but it requires care and scrutiny when creating or using pre-existing datasets as is a common practice. Although, research in unsupervised deep learning networks shows promise in reducing the reliance on large datasets [46], this work only considers fully-supervised methods. Chapter 3 Approach Based on the observations discussed in the previous chapter, this chapter presents a novel variation of the onset pattern approach. By treating the pooling and normalization stages of feature extraction as layers of a deep learning network, these stages can be optimized to the task of genre classification. In this way, the post processing and pooling steps that are infeasible to optimize manually can be learned as an extension of the Onset Pattern feature in this deep architecture context. Once trained, this transformation is applied independently to all track-wise onset patterns and the outputs are averaged over time, yielding a summary representation for an entire track. 3.1 Onset Patterns Implementation OP calculation here generally follows the processes outlined in [10], [11], but for this application, the calculation is simplified by removing several post processing steps. Operating on mono recordings sampled at 22050Hz, log2 -frequency DFTs are taken over 1024-sample windows with a hop size of 256 samples. Frequencies span six octaves beginning at 150Hz. The frequency resolution of this transform is kept variable to test optimal resolution levels in later experiments. Multi-band novelty functions are generated by computing spectral flux, removing the mean and half-wave rectifying the result. From here, eight-second long windows of these novelty functions are analyzed at 0.5 seconds intervals to extract a periodicity-spectrum by applying another log2 -DFT spanning five octaves beginning at 0.5Hz. This corresponds to a Beat-Per-Minute (BPM) range of 30 to 960BPM. This is referred to here as the periodicity range. As with the log2 -DFT used in the frequency multi-band calculation, periodicity resolution is left as a variable. This gives a frame-matrix with dimensions (F, P ) where F is the number of frequency bins and P is the number of periodicity bins. 12 Chapter 3. Approach 3.2 13 Deep Network Implementation For feature learning and classification, this research makes heavy use of Eric Humphrey’s in-development deep learning network Python libraries, informally presented in [47]. Formally, deep networks transform an input Z1 into an output ZL through composition with nonlinear functions fl (·|✓l ) where l 2 L, indicating total layer depth. For each layer, Zl 1 is the input to function fl with parameters ✓l . The network is composed of affine transformations, or fully-connected layers, where the outputs from one layer are distributed fully over the inputs to the next layer. Precisely: F (Z1 |⇥) = fL (...f2 (f1 (Z1 |✓1 )|✓2 ))...|✓L ) (3.1) Where, F = [f1 , f2 , ...fL ] is the set of layer functions, ⇥ = [✓1 , ✓2 , ...✓L ] is the corresponding set of layer parameters, the output of one layer is passed as the input to the next as fl (Zl ) = Zl+1 and the overall depth of the network is given by L. Layer fl is a fully-connected, or affine, transformations, defined by the following: fl (Zl |✓l ) = h(Wl • Zl + bl ), ✓l = [Wl , bl ] (3.2) Here, the input Zl is flattened to a column vector of length Nl and the dot-product is computed with a weight matrix Wl of shape (Ml , Nl ), followed by an additive vector bias term bl with length Ml . Note that an affine layer transforms an Nl -dimensional input to an Ml -dimensional output, referred to as the width of the layer. The final operation is a point-wise nonlinearity, h(·), defined here as tanh(·), which is bounded on ( 1, 1). When used as a classification system, the first L 1 layers of a deep network can be viewed as feature extractors, and the last layer, fL , is simply a linear classifier. This output can be forced to behave as a probability mass function for membership to a given class by making the length of Zl match the number of classes and by constraining the L1 -norm of the output to equal 1. This probability mass function P (·) for an input Z1 is achieved by applying the softmax operation to the output of the network, ZL , defined as follows: P (Z1 |⇥) = (ZL ) = PM L exp(ZL ) m=1 exp (ZL [m]) (3.3) Chapter 3. Approach 14 In this supervised learning implementation, the output ZL of this final layer is used to make a prediction where the most likely class is determined by argmax(P (Z1 |⇥)), that, with a provided target value y, can be combined into loss function. With the network defined as a probability mass function for class membership, it can be trained by iteratively minimizing this loss function using the negative log-likelihood of the correct class for a set of K observations: L= K X k=0 log(P (X k = Y k |⇥)) (3.4) where, Z k and Y k are the input data and corresponding class label, respectively of the k th observation. This loss function can then be minimized through gradient descent, which iteratively searches for the minimum value of the loss function. Here, gradients are computed with K > 1, but much smaller than the total number of observations, by sampling data points from the training set and averaging the loss over the batch. Specifically, the update rule for ⇥ is defined as its di↵erence with the gradient of the scalar loss L with respect to the parameters ⇥, weighted by the learning rate ⌘, given by the following: ⇥ ⇥ ⌘⇤ L ⇥ (3.5) K = 100 is used, where the observations are drawn uniformly from each class, i.e. a 10 observations of each genre, and a constant learning rate of ⌘ = 0.1. Learning proceeded for 3k iterations without early stopping or model selection. Note that all input data is preprocessed before input to the network to have zero-mean and unit variance. This is done by calculating the mean and standard deviation over all data points and was shown to significantly improve system performance. 3.2.1 Applying the Deep Network Unlike previous classification schemes for rhythm similarity methods, track-level aggregation is held o↵ until after frame-wise classification. Here, the deep network is applied independently to a time-series of onset patterns, producing a posteriorgram. Though there are alternative statistics that could be explored, such as the median or maximum, mean-aggregation is taken for each class prediction over the entire track. Chapter 3. Approach 3.3 15 Analytic Approach Chapter 2 highlights two connected issues that have prompted the analysis and discussion approach taken in this research. The first issue concerns the practice of genre identification as a proxy task for rhythm similarity. As mentioned, genre classification is the de-facto proxy task for verifying rhythm similarity measure and there remains a dearth in the literature for: 1) in depth analysis of the suitability of genre for the given feature; 2) informed explications of the assumptions made in system design; 3) and a proper examination of classification results fully taking into account the contents of the dataset used. The facile assumptions made in system verification and the face-value interpretations of classification results commonly accepted belies either a general lack of commitment or naiveté to musical relevance among researchers. As explored in [48] and stated confidently enough to be used as its title “Classification Accuracy is Not Enough”. The second issue concerns the e↵ect of the dataset on learned features in rhythm similarity. As discussed at the end of Section 2.3.1, in a deep network, features are learned based on provided labeled training examples. Hence, the feature’s characteristics are molded by the class representations in the dataset. Though desirable if working with an ideal dataset for the task, in the case of this research which uses genre membership as a proxy for rhythm similarity, there may be unintended (i.e. not rhythmically relevant) influences on the feature representation. In an e↵ort to better understand the musical significance of this rhythm similarity research beyond classification score and in an attempt to account for these various factors, a multi-disciplinary approach is taken here to examine the results, the dataset, and the system design. In addition to standard machine learning, MIR and statistical analyses methods, results are examined through rhythmic, musico-cultural and historical analyses, employing personal domain knowledge and borrowing heavily from the related musicological literature. Chapter 4 System Configuration and Results 4.1 Dataset In keeping with standard methods, a genre classification task is used to evaluate this measure of rhythm similarity, utilizing the well-known Latin Music Dataset (LMD). The LMD is a collection of Latin dance music comprised of 3216 tracks1 , split into 10 distinct genres: Axé, Bachata, Bolero, Forró, Gaúcha, Merengue, Pagode, Salsa, Sertaneja and Tango. The LMD is used here for several reasons: for this dance-based dataset, genre is assumed to serve as a good proxy for rhythm; the size of the LMD compares favorably to other smaller dance-based datasets such as the Ballroom set, a requisite for supervised deep-learning tasks; and, perhaps more importantly, this research stems from a deeper interest in Latin music in general. Based on the idea that domain knowledge is important to the development and analysis of computational music similarity measures, personal knowledge and interest in the subject is leveraged for the analyses in Chapters V-VII. Though the LMD provides full recordings, many of the tracks are from live shows and contain non-musical segments (e.g. clapping, spoken introductions). To reduce this noise, only the middle 30 seconds from each track are used for analysis. 4.2 Methodology The following experiments seek to identify the optimal system configuration for genre classification on the LMD. These experiments are broken into two parts: the first concerns resolution of the OP and the second concerns complexity in the feature-learning 1 Though the original LMD has 3,227 total recording, duplicates and tracks that were too short in duration for analysis have been removed. 16 Chapter 4. System Configuration and Results 17 Source SS df MS F Prob>F Columns Error Total 186.101 212.231 398.332 6 63 69 31.0168 3.3688 9.21 3.11E-07 Table 4.1: ANOVA results for classification scores with varying P values shows that periodicity resolution is a significant factor. stages of the network. For the OP, best general feature space is desired, one that is maximally informative while avoiding over-representation which can slow down, and even hinder classification. various OP resolutions are examined by testing values for frequency bins (F ) and periodicity bins (P ) as independent factors. Subsequent network test seek to design a network that appropriately fits the complexity of the task. System complexity is determined by layer depth (L) and layer output size (M ), several combinations of value for these parameters are examined. For baseline classification, the system defined in Section 3.2 with a single layer network is used, which is simply multi-class linear regression. This is the classifier used for all OP parameter tests. Scores for all classification tests are averaged over 10 cross-validated folds, stratified against genre. 4.3 OP Parameter Tuning Initial tests begin on an OP with F = 30 and P = 25 based on results in [11], taking the minimal dimensions that were shown to perform well. 4.3.1 Periodicity Resolution Over the seven tested OP configurations, with P in the range [5, 100], P = 15 provides the best results. An analysis of variance test on classification scores shows that periodicity resolution plays a significant role in the outcome. This is indicated by a Prop>F value less than 0.05 as can be seen in Table 4.1. After applying a Tukey HSD adjustment, a comparison of means, Figure 4.1, presents a clear trend, with significantly lower scores for OPs with either too few or too many periodicity bins and the maximum classification rate obtained with P = 15. These tests, showing better results with fewer dimensions, di↵er from results in [11], but this disparity most likely arises from di↵erences in data and classification strategy. Chapter 4. System Configuration and Results 18 30x5 OP Dimension 30x10 30x15 30x25 30x50 30x75 30x100 81 82 83 84 85 86 Mean Accuracies (%) 87 88 89 Figure 4.1: E↵ect of P on classification. The highest result is highlighted in blue, while significantly di↵erent results are in red. 4.3.2 Frequency Resolution Setting P = 15 based on the above, F values are then tested in the range [18, 300]. An ANOVA test on these results shows a significant e↵ect for this parameter with Prop>F = 0.001 and, as can be seen in Figure 4.2, accuracy rates go up with higher frequency resolution, leveling out for F 240. Results in [11] show minor but statisti- cally insignificant improvements by increasing the OP frequency resolution, but this is consistent with results here for F 120 and does not preclude the higher scores seen for F > 120. Based on these tests, going forward OPs are calculated setting F = 240 and P = 15. 18x15 OP Dimension 30x15 60x15 90x15 120x15 240x15 300x15 84 85 86 87 88 Mean Accuracies (%) 89 90 91 Figure 4.2: E↵ect of F on classification. The highest result is highlighted in blue, while significantly di↵erent results are in red. 4.4 Deep Network Parameterization With optimal parameters for this feature set in place, the next step is finding the best network architecture for this data. Returning to the notation of Section 3.2, here choices of layer width, Ml , l < L 1, and network depth, L are explored. Note that the input and output dimensionality are fixed as N1 = 240 ⇤ 15 and ML = 10 due to the previous discussion and the number of classes in the dataset, respectively. Chapter 4. System Configuration and Results 4.4.1 19 Layer Width This parameter search begins with a two-layer network (L = 2), sweeping the width of the first layer, M1 , over increasing powers of 2 in the range [16, 8192]. Results demonstrate a performance pivot around M1 = 128, achieving a maximum accuracy at M1 = 2048 but otherwise insignificant variation with M1 128. An ANOVA on these results show significance for this factor ( Prob>F = 0.015), but Figure 4.3 indicates minimal impact for M1 128. 16 32 Hidden Layer Size 64 128 256 512 1024 2048 4096 8192 88.5 89 89.5 90 90.5 91 Mean Accuracies (%) 91.5 92 92.5 Figure 4.3: Mean comparison of ANOVA tests on network layer complexity in a 2-layer architecture show significantly lower results for small M. 4.4.2 Network Depth Based on the above, deeper architectures are considered by setting Ml = 2048, l < L 1 and incrementally adding layers for a maximum depth of L = 6. This fails to show any significant changes in accuracy, with an ANOVA test revealing a Prob>F of 0.3684, greater than the null-hypothesis threshold of 0.05. Importantly, while only a limited number of interactions between depth and width are explored, independently varying L or Ml for various values shows no significant di↵erence provided Ml 128, consistent with previous findings. 4.5 Optimal System Configuration Further tests continue with a two-layer architecture (L = 2, M1 = 2048) based on the parameters used for the best score in Figure 4.3, expressed completely by the following: P (X1 |⇥) = (f2 (f1 (X1 |✓1 )|✓2 )) (4.1) For clarity, the dimensionality of the first layer, f1 , is given by (M1 = 2048, N1 = 3600), and the dimensionality of the second by (M2 = 10, N2 = 2048). Chapter 4. System Configuration and Results 20 Feature Accuracy (%) LPQ (Texture Descriptors) [49] OP (Holzapfel) [11] Mel Scale Zoning [50] OP (Proposed) 80.78 81.80 82.33 91.32 Table 4.2: Classification accuracies for di↵erent features on the LMD. Merengue Tango Bachata Pagode Salsa Axé Bolero Gaúcha Forró Sertaneja Total Per Genre 314 407 312 306 309 313 314 309 312 320 Correctly Predicted 309 400 304 288 286 284 278 264 260 264 % Correct 98.41 98.28 97.44 94.12 92.56 90.73 88.54 85.44 83.33 82.50 Total 3216 2937 91.32 Genre Table 4.3: Classification accuracies by genre, ordered from highest classification score to lowest, show Brazilian genres generally performing worse than the rest. With this configuration, classification on the LMD yielded a peak average score of 91.32%, which surpasses previous attempts at genre classification on this dataset. Table 4.2 shows the proposed approach outperforming others by a margin of more than 8%. One trend that is immediately apparent in the results is a difficulty in classifying Brazilian genres. Table 4.3, with genre-wise classification accuracies ordered from highest to lowest, shows Axé, Gaúcha, Forró and Sertaneja, all Brazilian genres, occupying four of the five bottom slots. Also, when looking at class-by-class confusions, as shown in Table 4.4, certain affinities between genres are apparent. The lowest scoring Sertaneja has the majority of its false tags predicted as Bolero, but also many predicted as Gaúcha and Forró. While the next three lowest performing classes, Gaúcha, Forró and Bolero, have most of their false tags predicted as Sertaneja. These trends in class-confusions will be expanded on in subsequent chapters. The increase in accuracy from previous attempts may be partially explained by di↵erences in methodology (i.e. aggregation strategies, signal noise reduction, etc.), but the strength of this deep-network strategy for classification plays a significant role here. Its e↵ect can be seen in Table 4.5, by comparing the proposed approach to simpler classification methods on the same OP input, the former outperforming the rest by a margin of Chapter 4. System Configuration and Results 21 Actual Class Predicted Class Ax Ba Bo Fo Ga Me Pa Sa Se Ta Ax 284 0 1 2 8 2 5 1 9 1 Ba 0 304 5 0 0 3 0 0 0 0 Bo 3 5 278 1 5 0 4 3 10 5 Fo 8 0 7 260 11 1 7 3 14 1 Ga 14 0 6 4 264 0 4 1 14 2 Me 3 0 0 0 1 309 0 0 1 0 Pa 6 0 6 1 1 1 288 0 2 1 Sa 1 2 7 0 4 3 2 286 2 2 Se 11 0 18 12 6 1 4 3 264 1 Ta 0 0 6 0 1 0 0 0 0 400 Table 4.4: Confusions matrix shows classification affinities between Sertaneja and several other genres. Classifier Validation Fold % Training Fold % LDA SGD Baseline Deep-net 89.06 89.09 88.52 91.32 98.06 94.95 99.69 100 Table 4.5: Comparison of di↵erent classifiers on OP data. The proposed system outperforms all others by a margin of 2.23%. 2%. With the added depth, the system is able to absorb the subsampling and normalization processes, shown to be important in previous work on the OP [11],[10], boosting discriminative power in the intermediate layer before linear classification. This boost can be visualized by calculating within-class and between-class scatter matrices for each output stage in the network - OP, intermediate layer, softmax output and looking at the mean values of their diagonals, which serves as a scaled measure of variance. With these measures, the desire is to visualize relative changes in the feature space for each level of the network, indicating how close together the members of a given class are to each other (within-class scatter) and how far apart the centers of these classes are from each other (between-class scatter). Figure 4.4 shows these values, plotted for each class from highest classification score to lowest, left to right. The ideal case sees a horizontal and relatively low-valued line for within-class measures, indicating class members are close to each other in the layer’s feature-space, and a horizontal and relatively high line for between-class measures, indicating classes well separated by distance. The increase in discriminative power that network depth brings can be seen by following Chapter 4. System Configuration and Results 22 the progression of the shape and relative values of these plots from input OP to intermediate layer to softmax output. In Figure 4.4-top, the input line (red) is jagged and above the other lines in the plot, showing high scatter and unevenness across classes. The intermediate layer line (green) shows a general reduction in scatter with increased horizontal smoothing, and the output line (blue) shows significantly lower scatter values with a smooth and slow positive slope, reflecting the ordering of the classes from best-performing to worst. Figure 4.4-bottom, shows a similar trend but in reverse, with increases from input to output, showing a greater separation between classes at each stage, ending with a low, negatively sloped smooth output line. Its clear that these lines progress towards the ideal linearly separable case, with each class becoming more distinct at each stage. The increase in separability between the input OP and the intermediate layer explains the increased classification accuracy seen in Table 4.5. As an aside, in these plots it is notable that Tango starts with the highest within-class scatter but also the highest between-class distance while having the highest classification accuracy. So, despite the fact that representations of tracks in this genre are far from each other in the OP feature space, it seems that the network learns to identify Tango based on its distance from the rest of the classes. This facet of Tango as its represented in this dataset is further investigated in Section 6.2. Chapter 4. System Configuration and Results Figure 4.4: Top: Progression from input to output shows increasingly compact genre representation from input to output. Bottom: Progression from input to output shows increasingly distant classes from input to output. 23 Chapter 5 OP Rhythm Similarity Analysis Although classification scores on the LMD obtained with this approach are quite high compared with previous results, the genre-wise unevenness in classification requires a closer look to see what the contributing factors to these errors may be. More specifically, deep analysis aims to measure the e↵ect of the characteristics of the OP as a measure of rhythm similarity, examine the contents and accuracy of the dataset, and address the foundational assumption of this verification approach that genre can be used as a proxy for rhythm. In this, and the following two chapters, each of these factors and their interactions are examined to elucidate and inform both the success of this approach as well as its deficits as a system for tracking rhythm similarity. 5.1 Tempo Dependence The OP preserves relative rhythmic structure across tempos, but does not adjust for shifts along the log-scale that result from large tempo changes[11, Fig 4]. Given the importance of this factor for classification with the OP, demonstrated in [11], tempo data is gathered for the LMD in order to properly determine its e↵ect on the results. The Echonest’s API [51] is used to calculate estimated beats-per-minute (BPM) values. These estimates are then hand-corrected for octave errors, either doubling or halving the estimate based on listening to the audio. Figure 5.1 shows Gaussian modeled tempo distributions for the LMD. From the plot, it is clear that there is some genre tempo-dependence in this dataset (comparing with other common datasets, it shows more tempo dependence than with the Turkish music dataset, but this dependence is not as stark as with the Ballroom dataset[11, Fig 1]). Based on this distribution, a superficial view of the classifier output suggests a correlation 24 Chapter 5. OP Rhythm Similarity Analysis 25 between success and tempo separation: Merengue is well separated by tempo and has high classification success while the genres Bolero and Sertaneja have a lot of tempo overlap as well as a high rate of mutual confusions (see Table 4.4). But, Figure 5.1 also shows many of the genres to have a high degree of tempo overlap in the 150-200 BPM range, contrasting with the generally high classification rates seen for these genres. Figure 5.1: Gaussian modeled tempo distributions by genre in the LMD. As a means of more precisely quantifying any e↵ects of tempo-overlap on classification results, binary logistic regression analysis is performed on classification success using tempo and tempo density as predictors. Here, tempo density is a measure of how many out-of-class tracks are within a 1-D tempo-neighborhood of a track belonging to a given class. Density measurements for each track are estimated by sampling a gaussian-smoothed tempo histogram of out-of-class tracks. This analysis initially suggests a measured e↵ect from overlap on classification success. The results in Table 5.2 show that, while BPM is not a factor, exceeding the 0.05 threshold that indicates significance, density is inversely related to classification success with a significance value of 0.00. However, after performing several goodness-of-fit tests for this analysis, the data was shown to have poor predictive power. A Hosmer and Lemeshow goodness-of-fit test [52], Table 5.2, proves the data to be a poor fit showing statistical significance (Sig = 0.000) indicating that the data diverges from the model.1 Further evidence of this can be seen through two more goodness-offit tests, the Cox & Snell [55] and Nagelkerke [56] R2 tests, which serve as pseudo R2 measures for logistic regression models. Both R2 values (R2 = 0.004, 0.010, respectively) are far less than the minimum threshold of 0.05 that would indicate a predictive model. 1 Although the Hosmer and Lemeshow test is known to be misleading for larger sample sizes [53], [54] provides a rule of thumb for adjusting the group size parameter which acts to standardize the test for varying sized datasets. For tests on the LMD, with 3216 tracks, group size is calculated as: 2 + 8( 3216 )2 ⇡ 74. 1000 Chapter 5. OP Rhythm Similarity Analysis 26 Variable Score df Sig. Density BPM BPM*Density Overall 13.557 2.841 9.526 13.952 1 1 1 3 0.000 0.092 0.002 0.003 Table 5.1: Results of binary logistic regression with classification success as dependents variable, BPM and Density as input show density is significant while BPM is not. 2 118.6241 df Sig. 72 0.0004 Table 5.2: Hosmer & Lemeshow test shows BPM and density data to be poor predictors for classification success. These tests conclude that tempo dependence in this dataset does not play a significant role in classification results. 5.2 Fine Grain Rhythmic Similarity A facet of the OP not explored in previous work is its sensitivity to small rhythmic variations in how a rhythm is played. What musicians will refer to as “feel” is often an integral part of the character of a rhythm or genre. This feel, as it will be referred to here, arises from minute variations in how each hit lands on the rhythmic grid. A useful dichotomy for discussing rhythmic feel in music is through designations of either “straight” or “swung”. In a piece of metered music, a straight feel will have strict 8thnote subdivisions, dividing the distance between quarter notes evenly. With a swung feel, subdivisions cleave closer to an 8th-note triplet grid, landing on the second 8th-note triplet division of the beat, closer in time to each succeeding quarter note. Although the term “swing” is often used loosely, referring to a more esoteric quality in the music, this formal definition is applied here. A first step to examining the e↵ect of feel on classification is to quantify how it is represented in the LMD. Using this swung/straight dichotomy, estimate of the percentage of tracks that are swung within each genre are gathered through listening tests. Given the LMD’s size, identifying the feel of each track is time-restrictive, so random samples are taken of 60 tracks per genre, analyzed and used to extrapolate total percentages. Table 5.3 with the results of these measurements, shows the LMD to be noticeably divided between straight and swung tracks with divisions, for the most part, following genre boundaries. Shown in the top half of the table, Brazilian genres, with the exception Chapter 5. OP Rhythm Similarity Analysis 27 of Sertaneja, are mostly swung, while the genres shown at the bottom of the table are almost entirely straight. Brazilian music is marked by rhythms that pull between a straight-eighth note feel and a swung feel - landing between 8th-note subdivided and 8th-note triplet subdivided notes - confounding the straight/swung dichotomy and giving it a unique rhythmic identity. The degree to which these tracks are swung varies. For instance, Forró presents many fully swung rhythms while Pagode presents many rhythms that are only lightly swung, at times closer to being straight. The other genres fit the dichotomy well, with most tracks presenting a precise, straight feel. However, before addressing this considerable connection between feel and genre in the data, considering the low periodicity resolution used in calculating the OP, it remains unclear whether this feel information survives in the signal at all. These di↵erences would manifest in the OP not as shifts in the periodicity direction as with tempo changes, but as minor changes in the relationship between peaks in periodicity. As a means of testing this, a binary classification experiment is performed using the OP on a toy dataset designed to isolate feel as a characteristic. The dataset consists of ten rhythm-templates, composed using sequenced percussion samples, with a swung and straight version of each template, rendered at varying tempos spanning the range 40 to 325BPM at 5 BPM increments. This makes for a total of 1160 tagged examples. Note that this is an unrealistic tempo range for a single rhythm; although the LMD as a whole spans this range, each genre typically spans a range 100-150BPM. These rhythms were chosen based on common rhythms from each genre in the LMD so as to approximate its rhythmic complexity. On this data, the system described in 4.5 with an altered output layer width for two classes (straight/swung), achieved accuracy of 78.8%, averaged over ten folds. Considering the OP’s partial tempo-sensitivity and the extreme range of tempos in the toy dataset, this system shows relative success at di↵erentiating these labels, indicating that the OP is indeed sensitive to rhythmic feel. To see the e↵ect of this sensitivity to feel on classification results in Section 4.5, the confusions in Table 4.4 are examined from the perspective of rhythmic feel. Misclassified tracks are each labeled as straight or swung to allow comparison between the track’s actual feel to the feel makeup of its predicted genre. Table 5.4 shows these comparisons for the highest instances of confusions between genres Axe, Bolero, Forró, Gaúcha and Sertaneja. For ease of comparison, the results of Table 5.3 are summarized for these genres. For each actual genre, the table shows the genres predicted by the classifier, the Chapter 5. OP Rhythm Similarity Analysis 28 Genre Percent Swung Axé Forró Gaúcha Pagode Sertaneja 84 98 76 82 25 Bachata Bolero Merengue Salsa Tango 00 02 00 00 02 Table 5.3: Feel breakdown by genre showing percentage of tracks in each genre that are swung. Genre % Genre Swung Confusion Genre Sertaneja Num Confusions % Confusions Swung Bolero 02 10 10 Sertaneja 25 Bolero Axe Forro 18 11 12 00 55 83 Gaucha 76 Bolero Sertaneja Axe 06 14 14 00 36 79 Axe 84 Sertaneja Gaucha 09 08 56 88 Forro 98 Bolero Sertaneja Gaucha Axe 07 14 11 8 86 93 91 88 Table 5.4: Comparison of actual genre feel versus predicted genre feel for LMD classification results. number of tracks in each genre prediction, and the percentage of those tracks that are swung. The swing percentages of these predicted tracks, when compared with the actual swing percentages of the predicted genres, reveals patterns indicative of rhythmic feel sensitivity. This pattern is easiest to see by examining the misclassified Gaúcha tracks. Of these tracks, the ones classified as Bolero are all straight, matching the near-zero swing percentage of that genre. The rest of these Gaúcha tracks similarly match the feel breakdown of their predicted genre, with a 36% swung rate for the tracks predicted as Sertaneja, loosely matching that genre’s swing percentage of 25%, and a 79% swung Chapter 5. OP Rhythm Similarity Analysis 29 rate for the tracks predicted as Axé, close to that genre’s swing percentage of 84%. Misclassified Sertaneja and Axé tracks show this same pattern as well. Bolero and Forró do not have significant straight/swung splits and so do not display this e↵ect. These trends support the claim that this classification is indeed tracking rhythmic feel in this dataset. And given the low classification performance of the genres with the largest straight/swung splits, this sensitivity to feel is likely a contributing factor to the errors seen in the results. A caveat to this conclusion is that, from a musicological perspective, it may be difficult to disentangle rhythmic feel from larger rhythmic characteristics, leaving the possibility that this apparent sensitivity to feel in the context of this dataset simply coincides with the system tracking general rhythmic similarity. But the results of the controlled swing classification experiment along with anecdotal evidence from classification results on the LMD support the case for rhythmic feel as a factor. Cases where the only distinguishing characteristic between genres is feel exemplify this effect. For instance, the track by Sertaneja artists Guilherme & Santiago, titled “Coroção” has a strong backbeat, emphasizing the 2nd and 4th beats of a 4/4 rhythm, which is characteristic of the straight-feel tracks in this class, but it has a heavy swung feel and has been tagged by the classifier as Forró, a genre that is almost entirely characterized by swung rhythms. Rhythmically, it is closer to the straight Sertaneja tracks, but its heavily swung beat triggers Forró. Another example is the Roberto Carlos song, “Música Suave”, a Bolero track tagged as Sertaneja. In this case, these two genres have many similarities (explored in more detail in the next chapter), and this track in particular, at 80BPM and with a backbeat rhythm, is characteristic of both genres. But, as a track among many other similar Bolero tracks, its primary distinguishing characteristic is that it is swung, pushing it closer to Sertaneja which has a greater percentage of swung tracks overall. Chapter 6 Dataset Observations These characteristics of the OP - partial tempo-dependence and sensitivity to rhythmic feel - go some ways to explain the error rate seen in Table 4.3, but certain patterns in the results and direct insights from the classifier point back to the data itself. In this chapter, the Latin Music Dataset is examined in detail, looking at its construction and contents, to shed more light on its e↵ect in the current system design and results. 6.1 Ground Truths As described in [4], the paper that accompanied its publication, the authors took e↵orts to produce trust-worthy ground truths for the LMD, relying on the expert knowledge of professional dance instructors with over ten years worth of experience teaching many of the dances associated with these genres. Further, they indicate that they labeled the dataset track-by-track, noting that labeling by artist or album would lead to poor genre boundaries. As a means of enforcing consistency in genre identification, they gave specific instructions to label each track according to how one would dance to it. This approach takes the view that the way in which a trained dancer might dance to a song is indicative of genre. Data collection also relied on the libraries of these dance instructors, where most of the tracks come from [4]. This approach produced a sizable and, one can argue, rhythmically relevant dataset. The LMD is bigger than the Ballroom Dance dataset by a factor of 5, and undergirding the assumption of dance-as-proxy-for-genre is a fundamental connection between dance and rhythm. But the particular viewpoint of genre boundaries represented in the LMD at times undermines the rhythmic relevance of the data for these purposes. 30 Chapter 6. Dataset Observations 31 When examining the output of the classifier, and in particular the resulting confusion matrix of Table 4.4, manifestations of this viewpoint are apparent. Of the 36 Bolero tracks incorrectly tagged, five clearly stand out as not Boleros, but rather American or British pop/rock songs by artists Linda Ronstadt, Frank Sinatra, Rick Astley, The Bangles and Tina Turner. [4] describes Bolero as defined by ballads originating from Spanish folk dances, strongly associated with Mexico and featuring “Spanish vocals and a subtle percussion e↵ect, usually implemented with guitar, conga or bongos” [4] 1 . These outliers are ballads but share none of the other common characteristics of Bolero. Although they account for less than 2% of the errors, they point to a larger trend in this class - the inclusion of tracks that are better described as soft-rock. Although singer José Feliciano is known to record Boleros, four of these misclassified tracks are songs of his that are easily identified as soft-rock, having few of the characteristics of Bolero, and there are many more like this in the dataset. To frame the discussion of genre identification more precisely, the authors in [58], a study on rapid human recognition of genre, provide some useful observations on genre boundaries in music. As they note, the boundaries of a particular genre are often loosely defined, transient and rarely universally agreed upon. A record label with commercial interests in mind may define genre for marketing purposes, while an individual’s conception of genre relies on a necessarily limited personal experience and memory of music. [58] further notes that: “Rightly or wrongly, an individual may associate a specific musical feature with a genre”, using cues as shortcuts to genre identification. In the LMD, those cues may be rhythmic, but only insofar as that rhythm fits a set of physical movements. This is a viewpoint that places use over content in assigning genre. A percussionist may listen for particular rhythms and instrumentations when asked to identify a track as one genre or another, but a professional dance instructor may listen instead for tempo, form and general feel. For these outlier Bolero tracks, the cue for inclusion in this class is tempo and ballad song-form, perhaps sufficient signifiers given the guidelines of the dataset. But from a stricter, musicological perspective, the lack of instrumentation and rhythms common to the genre and the heavy back-beat rhythms that they all have would signal these as otherwise. In fact, these misclassifications are rather indicative of the system’s success at identifying rhythm similarity. All of these songs were predicted to be Sertaneja by the classifier, a genre dominated by Americanized ballads, rather adeptly matching their rhythms. 1 Though [4] mentions its Spanish origin, Bolero, as represented in this dataset has its stylistic roots in the Cuban Bolero, in duple-metre, not to be confused with the traditional Spanish Bolero, in triple-metre [57]. Chapter 6. Dataset Observations 6.2 32 Artist/Chronological Distribution Beyond issues of ground truth and genre definitions, there are some clear disparities in the makeup of the LMD. Table 6.1 shows some basic info-metrics on the data and, although there is a relatively even distribution of tracks over genres, there is an uneven number of artists per genre. Sertaneja and Tango are on the low end, with less than 10 artists each. Although a few artists may well represent a genre stylistically, a narrow class representation can skew the data toward some unintended idiosyncrasy. Here, Tango is an extreme example of this kind of class-wide coloration. Although the LMD does not specifically include metadata on recording dates, the genre labels themselves reveal chronological details. Axé, Gaúcha, and Sertaneja are all fairly modern, with most recordings made in the 1980’s or later [59, p.4,220-224]. Contrast this with Tango and Bolero, which have both been around for more than a century [60, p. A-2, A12] and a significant disparity in recording dates can be expected in this dataset. This is not problematic for the task of measuring rhythm similarity per sé, but with the way Tango is represented in this dataset, recording quality becomes a significant artifact. Though there are seven Tango artists in the dataset, the vast majority of the tracks come from a single artist. Out of 407, 315 were recorded by Carlos Gardel between 1917 and 19352 . These recordings, some nearly a century old, have a very particular spectral makeup that distinguishes them from the rest of the dataset. Figure 6.1 shows a side-by-side comparison of the OP of a modern Sertaneja track and that of a Carlos Gardel Tango recording dating from 1917. The Tango OP has no activity below 200Hz or above 3kHz. Contrasted with the wide frequency range OP on the left, it is probable this disparity accounts for much of the accuracy seen for Tango classification. Rather than track rhythm, the classifier can easily identify Tango based on spectral content. For this reason, a broad genre representation with a variety of artists is desirable. Bolero, though it matches Tango in age, is represented in the LMD by 99 di↵erent artists. This spread ensures a certain level of diversity in recording quality and protects it from such facile identification as can be done with Tango. 6.3 Brazilian Skew The LMD was gathered in Paraná, Brazil, by dance instructors trained not only in Ballroom dance styles but also in Brazilian cultural dances [4]. So it is not just that, as mentioned before, this dataset represents the viewpoint of dancers, but more specifically 2 The dates of the tracks were found in the track filenames Chapter 6. Dataset Observations 33 Genre # Artists Axé Bachata Bolero Forró Gaúcha Merengue Pagode Salsa Sertaneja Tango 37 64 99 27 92 96 16 54 9 7 Tracks 313 312 314 312 309 314 306 309 320 407 Table 6.1: LMD infometrics. Tango: Carlos Gardel − Mi Noche Triste 4800 4800 2400 2400 log−frequency (Hz) log−frequency (Hz) Sertaneja: Edson & Hudson − Hei Voce Ai 1200 1200 600 600 300 300 150 30 60 120 240 log−periodicity (BPM) 480 150 30 60 120 240 log−periodicity (BPM) 480 Figure 6.1: Left: OP of a modern Sertaneja track. Right: OP of a Tango recording from 1917. Brazilian dancers. Referring back to [58], a study on genre recognition, one of their hypotheses in the determination of genre is what they call “The Fisheye-Lens E↵ect”. Essentially, it posits that people are more attuned to stylistic variations in music that is more familiar to them, defining others based on broad stereotypes to varying degrees [58]. Here, the dataset collectors had a general knowledge of all the genres as part of canonized ballroom dance traditions, yet the data skews Brazilian, highlighting their location and specialized knowledge of regional genres. This skew can be seen from several levels. The data is split evenly between HispanoAmerican musical genres (i.e. originating in countries colonized by Spain) and Brazilian ones. In general, the Hispano-American genres are older, internationally known and stylistically established (Salsa, Tango, Bolero, Merengue [60, p.A2-12]), whereas most of the Brazilian genres came to commercial prominence in the 1980’s or later (Axé, Gaúcha, Sertaneja)[61, p.168],[59, p.4,220-224] and are all fairly regionally specific. Figure 6.2 shows the regional specificity of the Brazilian genres versus the international dispersion Chapter 6. Dataset Observations 34 of Bolero3 . Comparing the genre Gaúcha, a recent and regionally specific genre, with Bolero, an internationally established genre dating back many decades, the disparity in scope and granularity is clear. Referring back to table 4.3 in section 4.5, the fact that of the five worst performing genres four of them are Brazilian is notable in this context. Also, as the confusion matrix in Table 4.4 shows, many of the highest instances of confusions are between these Brazilian genres. Based on the observation that the data is skewed Brazilian, a possible explanation for these trends may be explained by the fisheye-lens e↵ect as described above. Because of genre familiarity and regional access (Sertaneja and Gaúcha are localized in and around Paraná, where the LMD was collected) these genres are represented in greater detail and include tracks that may diverge stylistically but contain some regionally understood cues that mark them as one genre or another. Whereas the Hispano-American genres, being older and translated through various cultures, are represented more concisely based on stereotypical cues for the genres. Despite this argument’s plausibility and supporting circumstantial evidence, it is abstruse and difficult to prove. A more informative explanation lies in a musicologically founded examination of these genres, explored in the next chapter. 3 Location data was only gathered on Bolero and Brazilian genres. Chapter 6. Dataset Observations Figure 6.2: Top: Geographical spread of Bolero vs. Brazilian genres. Bottom: Detail of geographical spread of Brazilian genres. 35 Chapter 7 System Verification Issues Assuming that the proposed system is indeed tracking musically relevant rhythmic information, then a possible explanation for these confusions is the presence of stylistic overlaps in the data. Further, considering the amount of within-Brazilian genre confusion, the implied rhythmic similarity between these tracks points to possible regional influences among these genres. This underscores a weakness in using genre as a proxy for rhythm in gauging the success of the system. With this in mind, this chapter explores the musicological aspects of these genres with the aim of evaluating these claims and ultimately, the general success of this rhythm similarity system. 7.1 Rhythm-Genre Connection Knowing that the LMD has been labeled based on associated dance styles, it is reasonable to assume that rhythm would track well with genre. It is the rhythm of a given song that informs the dancer’s pace and phrasing, among other things. But in this dataset, the inclusion of genres not well defined by a specific rhythm undermines this relationship. The genre Sertaneja is a prime example of this and, with its low classification rate, highlights a divergence between the assumptions of the current validation process and intended goals. As the “country pop” genre of Brazil, Sertaneja is less defined by a particular rhythm or dance than it is by a set of cultural signifiers, musical influences, instrumentation and lyrical content. In one of the few comprehensive musicological examinations of the genre, “River of Tears”, the author, Alexander Dent, discusses in great detail the cultural and musical profile of Sertaneja and the music of Brazil’s interior. Sertaneja, as a rural music with strong ties to Brazil’s Central-Southern culture, presents specific 36 Chapter 7. System Verification Strategy Biases 37 racial influences, distinct from the other Brazilian genres in the LMD. As Dent writes of these rural genres: “they emphasize a di↵erent racial geography from that found in mainstream Brazilian musical production,” having a history rooted less in the African Diaspora of the coastal cities than with the “early contact between Portuguese Catholics and Indians in the Central-Southern region”[62, p. 95]. This stands in contrast to more rhythmically based Brazilian genres, like Axé, that maintain rhythmic traditions from Africa [62]. But as a pop genre that intentionally sought to modernize rural music while claiming its mantle, it has been open to, and indeed characterized by, its willingness to borrow and incorporate new and foreign elements. With electric and electronic instruments, a Rock drum set and a duo of male singers, its heavy borrowing from American country pop acts is immediately apparent, but also, its connection to Mexican ballads and Boleros is both historically rooted and sonically present in the music [61, p. 120]. Sertaneja has come to refer less to the musical traditions of rural Brazil than to rural music generally, as it can be found worldwide [61, p. 133]. Further, this willingness to borrow extends to the other popular musical forms of Brazil, particularly other rural music, insofar as the borrowing, or as some charge, copying, works toward a narrative within the context of rural life [61, p. 133]. One might argue that many of the other genres in the set should also be understood through their cultural attachments, but Sertaneja stands out against the other genres as fundamentally not danced-based. Take for example Salsa music, which is described in the Grove Encyclopedia of Music as having a “distinctive feel ... based upon a foundation of interlocking rhythmic ostinati” and they go on to notate what they refer to as its “rhythmic foundation”[63]. Or Forró, described as “an urbanized northeastern-style country dance music”, but also referring to a specific rhythm played between a triangle, zabumba and surdo drum [64, p. 90]. It is these kinds of specific and characteristic rhythms that this system aims to track. Sertaneja has a less specific relationship to rhythm, not defined by it, but instead often using it as a signifier within a larger musical-cultural narrative. Though the genre is mostly represented here by American style ballads, its general characteristics place less restraint on what rhythms may be included. It is probable that this largely explains why Sertaneja has the lowest classification success rate in the results. It is not that Sertaneja presents challenging rhythms for the system to understand, but that its stylistic tangents are rhythmically dissimilar and should be expected to not classify well. Chapter 7. System Verification Strategy Biases 7.2 38 Inter-genre Influence The fact of inter-genre influence within Brazil looms large over fixed ideas of genre and notions of musical boundaries. Although the recording industry often pushes conceptions of genre for commercial purposes, among musicians, musical boundaries are there to be crossed. And often, from the point of view of musicians at the forefront of musical creation, these genre terms are already obsolete [58]. Within the Brazilian genres in this dataset, two forces undermining genre boundaries come to light: regional and global influences. Regional influences are most apparent within Sertaneja and Gaúcha which are both known to take rhythms and instruments from Forró, among others [59, p.224]. There are several instances where the classifier identifies a Gaúcha or Sertaneja track as Forró, and indeed the track features the distinctive triangle pattern, bass drum and rhythmic feel specific to Forró, albeit embellished. (e.g. Alma Serrana’s “Xote Dos Milagres” and Teodoro & Sampaio’s “O Tocador”, Gaúcha and Sertaneja respectively). The influence of global popular music has a more dispersed e↵ect here. Despite strong roots in more traditional Brazilian music practices, Axé, Forró, Gaúcha, Pagode, and Sertaneja, are all commercialized and modernized genres. With the exception of Forró and Pagode which became popular in the 1940s and 1970s respectively, as mentioned before, these genres crystalized into their popular commercial forms between 1980-1990’s. This timing coincides with the rise of globalized pop, largely centered around American acts. And as Brazil’s music became more globally recognized, these genres, while maintaining their Brazilian identities, increasingly incorporated globalized pop instrumentations and musical cues into themselves [61, p.168], [65, p. 30-31]. As noted before, Sertaneja holds a strong affinity to American country music. This entails strong backbeats on nearly every track. As such, this rhythmic basis is strongly associated with Sertaneja by the classifier, and many instances of this strong back beat in other genres triggers a Sertaneja classification. The many confusions between Bolero and Sertaneja are due not only to the inclusion of Bolero within Sertaneja as a musical influence, but also to this flattening e↵ect of global influences on both genres. This is true for tracks in other genres predicted as Sertaneja as well; many contain a strong back beat, indicating an influence of globalized popular rhythms, aligning them more strongly with Sertaneja despite certain musical and cultural cues that might indicate otherwise. Chapter 7. System Verification Strategy Biases 7.3 39 Rhythm as Genre Signifier Given that the confusion trends in the results largely match up to known genre influences, this indicates that this system is successful at isolating rhythms. But it also underscores the shortcomings in using genre identification to validate rhythmic similarity. Some of these genres might be better separated based on their timbre or some other signifier. In the case of Sertaneja, timbre, lyrics and vocal style might prove useful in reducing errors and building invariance to cross-genre influences. The authors of [10] note this in their original paper on the OP and show that a hybrid feature set that includes MFCC’s, a measure of timbre, provides modest increases in classification. But these considerations diverge from the focus here of rhythm-based analysis. What is clear from the preceding analysis is that, despite the fact that the dataset used in this work is dance-based, genre tags are often unsuitable for validation of rhythmic similarity. The peak classification score of 91.32% is at once hurt by genres ill-defined with respect to rhythm as with Sertaneja, and bolstered by artifacts such as the recording quality of Tango tracks. The classifier was able to produce rhythmically meaningful groupings as well as provide insights into the data at hand, but a dataset divided along stricter rhythmic lines might prove more useful for evaluating success in this task. Chapter 8 Conclusions In addressing computational rhythm similarity measures for music, this thesis took a survey of current systems, developed a novel state-of-the-art approach, and presented a detailed analysis of the results from both a numerical and musicological perspective. It examined previous approaches to rhythm similarity and identified the onset pattern (OP) as having high potential for further development for its high-dimensional, semitempo invariant characteristics. To improve results with the OP, this work then presented a new system architecture that makes use of deep architectures for feature learning and classification. The OP was stripped of several of the post-processing and pooling stages detailed in [10] and fed into a deep-network with the aim of learning the best representation for this task. Using the widely known Latin Music Dataset [4] for validation purposes, this approach proved worthwhile, producing average classification scores of 91.32%, well above previously obtained results on this data. With added network depth, the system was able to absorb these post-processing and pooling stages that are infeasible to tune manually. Following this system architecture discussion, a detailed cross-disciplinary look at the results was taken with the aim of fully assessing the abilities of this approach to rhythm similarity. Based on observations of the system characteristics, the classification output, and contents of the dataset, this research identified three factors to examine: rhythmic characteristics of the OP, dataset idiosyncrasies and basic assumptions built into this system’s methodology. It was showed through statistical testing on gathered metadata that, though the OP’s partial tempo sensitivity does not play a significant role in the results, its sensitivity to fine-grain rhythmic variations does, grouping tracks, sometimes against class markers, based on rhythmic feel. Detailed examinations of the dataset highlighted its features 40 Chapter 8 Conclusions 41 that undermine the dataset’s use for the task of rhythm verification. These included its perspective on genre that favors use over content (i.e. associated dance versus musical content), a regional skewing of the data towards Brazilian genres and class-wide coloration of recordings arising from unevenness in representation (e.g. the recording quality of Tango tracks). Perhaps most importantly, this research underscored the tenuous connection between rhythm and genre, an assumed connection that this approach’s verification strategy depended on. Although genre proved generally useful as a proxy for rhythm in the context of the dance-based LMD, it did not guarantee rhythmic homogeneity within classes. This was best exemplified in the genre Sertaneja, defined less by rhythm than by a mix of musical influences, song form, lyrical content, instrumentation and an array of social cues. This analysis further highlighted the fact that genre definitions often fail to account for inter-genre influences that can manifest as rhythmic cross-pollination and showed that, despite class labels, the system was able to identify many of these rhythmic influences across genres in the LMD. Beyond presenting a state-of-the-art approach for measuring rhythm similarity in music, this work hopes to emphasize the importance of this kind of in-depth analysis. By examining the approach from the perspectives of classification scores, statistical analysis and genre and rhythm-specific musicological discussions, this e↵ort was able to shed light not only on the characteristics and performance of this system, but also bring a greater understanding to the task in general. If the goal is to develop musically relevant methods of analysis, then this kind of domain-specific discussion is highly constructive in going beyond the question of numerical efficacy and deeper into an understanding of what is being expressed musically by these systems. This approach has been couched in the context of a deep-network architecture for feature learning and classification, but the current implementation leans heavily on a handdesigned feature extraction stage. Given the success of learning components of the feature representation, as shown here, future work should focus on folding more of these feature design aspects into the learning process. Bibliography [1] Yajie Hu and Mitsunori Ogihara. Genre classification for million song dataset using confidence-based classifiers combination. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1083–1084. ACM, 2012. [2] Lie Lu, Dan Liu, and Hong-Jiang Zhang. Automatic mood detection and tracking of music audio signals. Audio, Speech, and Language Processing, IEEE Transactions on, 14(1):5–18, 2006. [3] Business Matters: From 2009. 75,000 Albums Released In U.S. In 2010 – Down 22% http://www.billboard.com/biz/articles/news/1179201/ business-matters-75000-albums-released-in-us-in-2010-down-22-from-2009. Accessed: 2013-11-6. [4] Carlos Nascimento Silla Jr, Alessandro L Koerich, and Celso AA Kaestner. The latin music database. In ISMIR, pages 451–456, 2008. [5] Juan José Burred, Axel Röbel, and Xavier Rodet. An accurate timbre model for musical instruments and its application to classification. In Workshop on Learning the Semantics of Audio Signals, Athens, Greece, 2006. [6] Jeremiah D Deng, Christian Simmermacher, and Stephen Cranefield. A study on feature analysis for musical instrument classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 38(2):429–438, 2008. [7] Elias Pampalk, Arthur Flexer, and Gerhard Widmer. Improvements of audio-based music similarity and genre classificaton. In ISMIR, volume 5, pages 634–637. London, UK, 2005. [8] Jean-Julien Aucouturier and Francois Pachet. Finding songs that sound the same. In Proc. of IEEE Benelux Workshop on Model based Processing and Coding of Audio, pages 1–8, 2002. [9] Simon Dixon, Fabien Gouyon, and Gerhard Widmer. Towards characterisation of music via rhythmic patterns. In ISMIR, 2004. 42 Bibliography 43 [10] Tim Pohle, Dominik Schnitzer, Markus Schedl, Peter Knees, and Gerhard Widmer. On rhythm and general music similarity. In ISMIR, pages 525–530, 2009. [11] Andre Holzapfel, Arthur Flexer, and Gerhard Widmer. Improving tempo-sensitive and tempo-robust descriptors for rhythmic similarity. In Proceedings of the 8th Sound and Music Computing Conference. SMC’11, 2011. [12] Geo↵roy Peeters. Spectral and temporal periodicity representations of rhythm for the automatic classification of music audio signal. Audio, Speech, and Language Processing, IEEE Transactions on, 19(5):1242–1252, 2011. [13] Emilia Gómez. Tonal description of music audio signals. Unpublished doctoral dissertation, Universitat Pompeu Fabra, Barcelona, Spain, 2006. [14] Joan Serra, Holger Kantz, Xavier Serra, and Ralph G Andrzejak. Predictability of music descriptor time series and its application to cover song detection. Audio, Speech, and Language Processing, IEEE Transactions on, 20(2):514–525, 2012. [15] J. Salamon. Melody Extraction from Polyphonic Music Signals. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013. [16] Godfried Toussaint. The Geometry of Musical Rhythm: What Makes a “good” Rhythm Good? CRC Press, 2013. [17] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchho↵, and Anssi Klapuri. Automatic music transcription: Breaking the glass ceiling. In ISMIR, pages 379–384, 2012. [18] Fabien Gouyon and Simon Dixon. A review of automatic rhythm description systems. Computer music journal, 29(1):34–54, 2005. [19] Emiru Tsunoo, Nobutaka Ono, and Shigeki Sagayama. Rhythm map: Extraction of unit rhythmic patterns and analysis of rhythmic structure from music acoustic signals. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 185–188. IEEE, 2009. [20] Matthew Wright, W Andrew Schloss, and George Tzanetakis. Analyzing afro-cuban rhythms using rotation-aware clave template matching with dynamic programming. In ISMIR, pages 647–652, 2008. [21] Hui Li Tan, Yongwei Zhu, Susanto Rahardja, and Lekha Chaisorn. Rhythm analysis for personal and social music applications using drum loop patterns. In Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on, pages 1672–1675. IEEE, 2009. Bibliography 44 [22] Andre Holzapfel and Yannis Stylianou. A scale transform based method for rhythmic similarity of music. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 317–320. IEEE, 2009. [23] André Holzapfel and Yannis Stylianou. Scale transform in rhythmic similarity of music. Audio, Speech, and Language Processing, IEEE Transactions on, 19(1): 176–185, 2011. [24] Elias Pampalk. A matlab toolbox to compute music similarity from audio. In ISMIR, 2004. [25] Fabien Gouyon. A computational approach to rhythm description-audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing. 2005. [26] Jonathan Foote and Shingo Uchihashi. The beat spectrum: A new approach to rhythm analysis. In ICME, 2001. [27] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. Speech and Audio Processing, IEEE transactions on, 10(5):293–302, 2002. [28] Matthias Gruhne, Christian Dittmar, and Daniel Gaertner. Improving rhythmic similarity computation by beat histogram transformations. In ISMIR, pages 177– 182, 2009. [29] Jesper Højvang Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. A tempo-insensitive representation of rhythmic patterns. In Proc. of the 17th European Signal Processing Conf.(EUSIPCO-09). Citeseer, 2009. [30] Geo↵roy Peeters. Rhythm classification using spectral rhythm patterns. In ISMIR, pages 644–647, 2005. [31] Sebastian Krey and Uwe Ligges. Svm based instrument and timbre classification. In Classification as a Tool for Research, pages 759–766. Springer, 2010. [32] Adam R Tindale, Ajay Kapur, George Tzanetakis, and Ichiro Fujinaga. Retrieval of percussion gestures using timbre classification techniques. In ISMIR, 2004. [33] Giulio Agostini, Maurizio Longari, and Emanuele Pollastri. Musical instrument timbres classification with spectral features. EURASIP Journal on Applied Signal Processing, 2003:5–14, 2003. [34] Róisı́n Loughran, Jacqueline Walker, Michael O’Neill, and Marion O’Farrell. Musical instrument identification using principal component analysis and multi-layered perceptrons. In Audio, Language and Image Processing, 2008. ICALIP 2008. International Conference on, pages 643–648. IEEE, 2008. Bibliography 45 [35] Aliaksandr Paradzinets, Hadi Harb, and Liming Chen. Multiexpert system for automatic music genre classification. Teknik Rapor, Ecole Centrale de Lyon, Departement MathInfo, 2009. [36] Sander Dieleman, Philémon Brakel, and Benjamin Schrauwen. Audio-based music classification with a pretrained convolutional network. 2011. [37] Philippe Hamel, Sean Wood, and Douglas Eck. Automatic identification of instrument classes in polyphonic and poly-instrument audio. In ISMIR, pages 399–404, 2009. [38] Eric J Humphrey and Juan P Bello. Rethinking automatic chord recognition with convolutional neural networks. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on, volume 2, pages 357–362. IEEE, 2012. [39] Eric J Humphrey, Juan P Bello, and Yann LeCun. Feature learning and deep architectures: new directions for music informatics. Journal of Intelligent Information Systems, pages 1–21, 2013. [40] Philippe Hamel and Douglas Eck. Learning features from music audio with deep belief networks. In ISMIR, pages 339–344. Utrecht, The Netherlands, 2010. [41] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2011. [42] Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. 2013. [43] Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards ai. LargeScale Kernel Machines, 34, 2007. [44] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989. [45] Philippe Hamel, Simon Lemieux, Yoshua Bengio, and Douglas Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR, pages 729–734, 2011. [46] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009. Bibliography [47] Deep 46 Learning Tutorial code repository. http://marl.smusic.nyu. edu/wordpress/projects/feature-learning-deep-architectures/ deep-learning-tutorial/#source_code. Accessed: 2013-11-6. [48] Bob L Sturm. Classification accuracy is not enough: On the evaluation of music genre recognition systems. Journal of Intelligent Information Systems, 2013. [49] Yandre Costa, Luiz Oliveira, Alessandro Koerich, and Fabien Gouyon. Music genre recognition using gabor filters and lpq texture descriptors. In Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 2013. [50] C.N. Silla, A.L. Koerich, and C.A.A. Kaestner. Feature selection in automatic music genre classification. In Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on, pages 39–44, 2008. doi: 10.1109/ISM.2008.54. [51] The echonest api overview. http://developer.echonest.com/docs/v4. Accessed: 2013-11-19. [52] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression. Wiley. com, 2013. [53] Andrew A Kramer and Jack E Zimmerman. Assessing the calibration of mortality benchmarks in critical care: The hosmer-lemeshow test revisited*. Critical care medicine, 35(9):2052–2056, 2007. [54] Prabasaj Paul, Michael L Pennell, and Stanley Lemeshow. Standardizing the power of the hosmer–lemeshow goodness of fit test in large data sets. Statistics in Medicine, 32(1):67–80, 2013. [55] David R Cox and EJ Snell. On test statistics calculated from residuals. Biometrika, 58(3):589–594, 1971. [56] Nico JD Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78(3):691–692, 1991. [57] Willi Kahl and Israel J. Katz. Salsa, volume Oxford Music Online of Grove Music Online. Oxford University Press. URL http://www.oxfordmusiconline.com/ subscriber/article/grove/music/03444. [58] Robert O Gjerdingen and David Perrott. Scanning the dial: The rapid recognition of music genres. Journal of New Music Research, 37(2):93–100, 2008. [59] C. McGowan and R. Pessanha. The Brazilian Sound: Samba, Bossa Nova, and the Popular Music of Brazil. Temple University Press, 1998. ISBN 9781566395458. Bibliography 47 [60] R.D. Moore, J. Koegel, and W.A. Clark. Musics of Latin America. W W Norton & Company Incorporated, 2012. ISBN 9780393929652. URL http://books.google. com/books?id=txbKygAACAAJ. [61] A. Dent. River of Tears: Country Music, Memory, and Modernity in Brazil. e-Duke books scholarly collection. Duke University Press, 2009. ISBN 9780822391098. [62] C.B. Henry. Let’s Make Some Noise: Axé and the African Roots of Brazilian Popular Music. University Press of Mississippi, 2010. ISBN 9781604733341. [63] Lise Waxer. Salsa, volume Oxford Music Online of Grove Music Online. Oxford University Press. URL http://www.oxfordmusiconline.com/subscriber/ article/grove/music/24410. [64] L. Crook. Brazilian Music: Northeastern Traditions and the Heartbeat of a Modern Nation. Number v. 1 in ABC-CLIO world music series. ABC-CLIO, 2005. ISBN 9781576072875. URL http://books.google.com/books?id=kSH9HQox_K8C. [65] Charles A Perrone and Christopher Dunn. Brazilian popular music and globalization. Journal of Popular Music Studies, 14(2):163–165, 2002.
© Copyright 2026 Paperzz