Analyzing the Relationship Among Audio Labels Using Hubert-Arabie adjusted Rand Index Kwan Kim Submitted in partial fulfillment of the requirements for the Master of Music in Music Technology in the Department of Music and Performing Arts Professions in The Steinhardt School New York University Advisor: Dr. Juan P. Bello Reader: Dr. Kenneth Peacock Date: 2012/12/11 c Copyright 2012 Kwan Kim Abstract With the advent of advanced technology and instant access to the Internet, the music databases have grown rapidly, requiring more efficient ways of organizing and providing access to music. A number of automatic classification algorithms are proposed in the field of music information retrieval (MIR) by a means of supervised learning method, in which ground truth labels are imperative. The goal of this study is to analyze a statistical relationship among audio labels such as era, emotions, genres, instruments, and origin, using the Million Song Dataset and Hubert-Arabie adjusted Rand Index in order to observe whether there is a significant enough correlation between these labels. It is found that the cluster validation is low among audio labels, which implies no strong correlation and not enough co-occurrence between these labels when describing songs. Acknowledgements I would like to thank everyone involved in completing this thesis. I especially send my deepest gratitude to my advisor, Juan P. Bello, for keeping me motivated. His critics and insights consistently pushed me to become a better student. I also thank Mary Farbood for being such a friendly mentor. It was a pleasure to work as her assistant for the past year and half. I thank the rest of NYU faculty for providing an opportunity and excellent program to study. Lastly, I thank my family and wife for their support and love. Contents List of Figures iv List of Tables vi 1 Introduction 1 2 Literature Review 4 2.1 Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Automatic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Methodology 9 3.1 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.1 1st Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 2nd Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.3 3rd Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.3.1 Co-occurence . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.3.2 Hierarchical Structure . . . . . . . . . . . . . . . . . . . 16 4th Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.4.1 Term Frequency . . . . . . . . . . . . . . . . . . . . . . 18 5th Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Audio Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.4 3.2.5 3.3 ii CONTENTS 3.4 3.5 3.3.3 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.4 Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.5 Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.1 k-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . 25 3.4.2 Feature Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.3 Feature Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.4 Feature Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Hubert-Arabie adjusted Rand Index . . . . . . . . . . . . . . . . . . . . 29 4 Evaluation and Discussion 31 4.1 K vs. ARIHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Hubert-Arabie adjusted Rand Index (revisited) . . . . . . . . . . . . . . 34 4.3 Cluster Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.1 Neighboring Clusters vs. Distant Clusters . . . . . . . . . . . . . 35 4.3.2 Correlated Terms vs. Uncorrelated Terms . . . . . . . . . . . . . 41 5 Conclusion and Future Work 49 References 50 iii List of Figures 1.1 System Diagram of a Generic Automatic Classification Model . . . . . . 3 2.1 System Diagram of a Genre Classification Model . . . . . . . . . . . . . 6 2.2 System Diagram of a music emotion recognition model . . . . . . . . . . 8 2.3 Thayer’s 2-Dimensional Emotion Plane (19) . . . . . . . . . . . . . . . . 8 3.1 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Co-occurence - same level . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Co-occurence - different level . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Hierarchical Structure (Terms) . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Intersection of Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.6 Era Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.7 Emotion Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.8 Genre Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.9 Instrument Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.10 Origin Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.11 Elbow Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.12 Content-based Cluster Histogram . . . . . . . . . . . . . . . . . . . . . . 28 4.1 K vs. ARIHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Co-occurence between feature clusters and era clusters . . . . . . . . . . 36 4.3 Co-occurence between feature clusters and emotion clusters . . . . . . . 37 4.4 Co-occurence between feature clusters and genre clusters . . . . . . . . . 38 4.5 Co-occurence between feature clusters and instrument clusters . . . . . 39 4.6 Co-occurence between feature clusters and origin clusters . . . . . . . . 40 4.7 Co-occurence between era clusters and feature clusters . . . . . . . . . . 42 iv LIST OF FIGURES 4.8 Co-occurence between emotion clusters and feature clusters . . . . . . . 42 4.9 Co-occurence between genre clusters and feature clusters . . . . . . . . . 43 4.10 Co-occurence between instrument clusters and feature clusters . . . . . 43 4.11 Co-occurence between origin clusters and feature clusters . . . . . . . . 44 v List of Tables 3.1 Overall Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Field List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6 Hierarchical Structure (Clusters) . . . . . . . . . . . . . . . . . . . . . . 16 3.7 Hierarchical Structure (µ and σ) . . . . . . . . . . . . . . . . . . . . . . 18 3.8 Mutually Exclusive Clusters . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.9 Filtered Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.10 Era Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.11 Emotion Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.12 Genre Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.13 Instrument Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.14 Origin Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.15 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.16 Cluster Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.17 2 x 2 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 ARIHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Term Cooccurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Term Cooccurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Optimal Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5 Self-similarity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6 Neighboring Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Distant Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi LIST OF TABLES 4.8 Term Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.9 Term Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.10 Term Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.11 Label Cluster Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.12 Label Cluster Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 vii Chapter 1 Introduction In 21st century, we are living in a world, where an instant access to the countless number of music database is granted. Online music stores such as iTunes store or online music streaming service such as Pandora provides millions of songs from artists all over the world. As the music database has grown rapidly with the advent of advanced technology and the Internet, it requires much more efficient ways of organizing and finding music. One of the main tasks in the field of music information retrieval (MIR) is to generate a computational model for classification of audio data such that it is faster and easier to search for and listen to music. A number of researchers have proposed methods to categorize music into different classifications such as genres, emotions, activities, or artists (1, 2, 3, 4). This automated classification would then let us search for audio data based on their labels – e.g., when we search for “sad” music, the audio emotion classification model returns songs with label “sad.” Regardless of the type of classification model, there is a generic approach to this problem as outlined in figure 1.1 – extracting audio features, obtaining labels, and computing the parameters to generate a model by a means of supervised machine learning technique. When utilizing a supervised learning technique to construct a classification model, however, it is imperative that ground truth labels are provided. Obtaining labels involves 1 human subjects, which makes the process expensive and inefficient. In certain cases, the number of labels are bound to be insufficient, making it even harder to collect data. As a result, researchers have used a semi-supervised learning method, in which unlabeled data is combined with labeled data during training process in order to improve performance (5). However, this method is also limited to certain situation, where data has only one type of label – e.g., if a dataset is labeled by genre, it is possible to construct a genre classification model; however, it is not possible to create a mood classification model without knowing a priori correspondence between genre and mood labels. This causes a problem when certain dataset has only one type of label and needs to be classified into a different label class. It can be much efficient and less expensive if there exists a statistical correspondence among different audio labels so that it enables to easily predict a different label class from the same dataset. Therefore, the goal of this study is to define a statistical relationship among different audio labels such as genres, emotions, era, origin, and instruments, using The Million Song Dataset (6), applying an unsupervised learning technique, i.e. k-means algorithm, and calculating the Hubert-Arabie adjusted Rand (ARIHA ) index (7). The outline of this thesis is organized in following steps: literature review will be provided about previous MIR studies on automatic classification models. The detailed methodology and data analysis will be given in Chapter 3. Based on the results obtained from Chapter 3, possible interpretations of the data will be discussed in Chapter 4. Finally, concluding remarks and future work are laid out in Chapter 5. 2 Figure 1.1: System Diagram of a Generic Automatic Classification Model Labels are used only in supervised learning case 3 Chapter 2 Literature Review 2.1 Music Information Retrieval There are many ways to categorize music. One of the traditional ways to categorize music is by its metadata such as name of song, artist, or album, which is known as tag-based or text-based categorization (8). As music databases have grown virtually countless, it requires more efficient ways to query and retrieve music. As opposed to tag-based query and retrieval, which only enables to retrieve songs that we have a priori information about, a content-based query and retrieval allows us to find songs in different ways - e.g., it allows to find songs similar in musical context or structure and it could also recommend songs based on musical labels such as emotion. Music information retrieval (MIR) is a widely and rapidly growing research topic in the multimedia processing industry, which aims at extending the understanding and usefulness of music data, through the research, development and application of computational approaches and tools. As a novel way of retrieving songs or creating a playlist, researchers have come up with a number of classification methods using different labels such as genre, emotion, or cover song (1, 2, 3, 4) so that each classification model could retrieve a song based on its label or musical similarities. These methods are different 4 2.2 Automatic Classification than tag-based method since audio features are extracted and analyzed prior to constructing a computational model. Therefore, a retrieved song is based on the content of the audio, not on its metadata. 2.2 Automatic Classification In previous studies, most audio classification models are based on supervised learning method, in which musical labels such as genre or emotion are required (1, 2, 3, 4). Using labels along with well-defined high-dimensional musical features, learning algorithms go through computations to train the data to find possible relationships between the features and a label so that for any given unknown (test) data, the model could correctly recognize the label. 2.2.1 Genre Tzanetakis et al. (1, 9) are among the earliest researchers who worked on automatic genre classification. Instead of manually assigning musical genre for a song, automatic genre classification model enables to generate a genre label for a given song after comparing its musical features with the model. In (1, 9) the authors used three feature sets, in which each describes timbral texture, rhythmic content, and pitch content, respectively. Features such as spectral centroid, rolloff, flux, zero crossing rate, and MFCC (10) were extracted to construct a feature vector that describes timbral texture of music. Automatic beat detection algorithm (4, 11) was used to calculate the rhythmic structure of music and used as a feature vector that describes rhythmic content. Lastly, pitch detection techniques (12, 13, 14) were used to construct a pitch content feature vector. Figure 3.12 represents the system overview of the automatic genre classification model described in (1, 9). 5 2.2 Automatic Classification Figure 2.1: System Diagram of a Genre Classification Model - Gaussian Mixture Model (GMM) is used as a classifier 6 2.2 Automatic Classification 2.2.2 Emotion In 2006, the work of L. Lu et al. (15) was one of a few studies that provided indepth analysis of mood detection and tracking of music signals using acoustic features extracted directly from audio waveform, instead of using MIDI or symbolic representations. Although it has been an active research topic, researchers have consistently faced the same problem with quantification of music emotion due to the nature of subjectivity of music emotion. Recent studies have sought ways to minimize the inconsistency among labels. Skowronek et al. (16) paid close attention to material collection process. They obtained a large number of labelled data from 12 subjects and accounted for only those in agreement with one another in order to exclude the ambiguous ones. In (17), the authors created a collaborative game that collects dynamic (time-varying) labels of music mood from two players and ensures that the players cross check each other’s label in order to build a consensus. Defining mood classes is not an easy task. There have been mainly two approaches to defining mood: categorical and continuous. In (15) mood labels are classified into adjectives such as happy, angry, sad, or sleepy. However, the authors in (18) defined mood as a continuous regression problem as described in figure 2.2, and mapped emotion into two-dimensional Thayer’s Plane (19) shown in figure 2.3. Recent studies focus on multi-modal classification using both lyrics and audio contents to quantify music emotion (20, 21, 22), on dynamic music emotion modeling (23, 24), or on unsupervised learning approach for mood recognition (25). 7 2.2 Automatic Classification Figure 2.2: System Diagram of a music emotion recognition model - Arousal and Valence are two independent regressors Figure 2.3: Thayer’s 2-Dimensional Emotion Plane (19) - Each axis is used as an independent regressor 8 Chapter 3 Methodology Previous studies have constructed the automatic classification model, using a relationship between audio features and one type of label – (e.g. genre or mood). As it is stated in chapter 1, however, if statistical relationship among several audio labels is defined, it could reduce the cost of constructing the automatic classification models. In order to solve the problem, two things are needed: 1. Big Music Data with multiple labels: The Million Song Dataset (6) 2. Cluster Validation Method: Hubert-Arabie adjusted Rand Index A large dataset is required to minimize bias and noisiness of labels. Since labels are acquired from users, small number of music data would lead to large variance among labels and thus large error. A cluster validation method is required to compare sets of clusters created by different labels, hence Hubert-Arabie adjusted Rand index. 9 3.1 Data Statistics 3.1 Data Statistics The Million Song Dataset (6) consists of million files in HDF5 format, from which various information can be retrieved including metadata such as the name of artist, title of song, or tags (terms) and musical features such as chroma, tempo, loudness, mode, or key. Table 3.1 shows the overall statistics of the dataset and table 3.2 shows a list of fields available in the files of the dataset. No. 1 2 3 4 5 6 Type Songs Data Unique Artists Unique Terms Artists with at least one term Identified Cover Song Total 1,000,000 273 GB 44,745 7,643 43,943 18,196 Table 3.1: Overall Data Statistics - Statistics of The Million Song Dataset 3.2 Filtering LabROSA, the distributor of The Million Song Dataset, also provides all the necessary functions to access and manipulate the data from Matlab. ‘HDF5 Song File Reader’ function lets convert .h5 files into a Matlab object, which can be further used to extract labels using ‘get artist terms’ function and features using ‘get segments pitches’, ‘get tempo’, and ‘get loudness’ functions. Therefore, labels enable to create several sets of clusters, while audio features are used to form another set of cluster. Figure 3.1 indicates different sets of clusters. Although it is idealistic to take account of all million songs, due to the noisiness of data, dataset must undergo following filtering process to get rid of unnecessary songs: 1. All terms are categorized into one of 5 label classes 2. Create a set of clusters based on each label class. 10 3.2 Filtering 3. Find hierarchical structure of each label class. 4. Make each set of clusters mutually exclusive. 5. Songs that contain at least one term from all of the five label classes are retrieved. Figure 3.1: Clusters - Several sets of clusters can be made using labels and audio features 3.2.1 1st Filtering As shown in table 3.1, there are 7643 unique terms that describe songs in the dataset. Some examples of these terms are shown in table 3.3. These unique terms have to be filtered so that meaningless terms are ignored. In other words, five labels are chosen so that each term can be categorized into one of following five labels: era, emotion, genre, instrument, and origin. Doing so, any term that cannot be described as one of those labels is dropped. Table 3.4 shows the total number of terms that belong to each label. Note the small number of terms in each label category compared to original 7643 unique terms. This is because many terms cross reference each other. For example, ‘00s’, ‘00s alternative’, and ‘00s country’ all count as unique terms, but they are all represented as ‘00s’ under ‘era’ label class. Similarly, ‘alternative jazz’, ‘alternative 11 3.2 Filtering Field Name analysis sample rate artist familiarity artist hotnesss artist id artist name artist terms artist terms freq audio md5 bars confidence bars start beats confidence beats start danceability duration energy key key confidence loudness mode mode confidence release sections confidence sections start segments confidence segments loudness max segments loudness max time segments loudness max start segments pitches segments start segments timbre similar artist song hotttnesss song id tempo time signature time signature confidence title track id Type float float float string string array string array float string array float array float array float array float float float float int float float int float string array float array float array float array float array float array float 2D array float array float 2D array float array string float string float int float string string Description sample rate of the audio used algorithmic estimation algorithmic estimation Echo Nest ID artist name Echo Nest tags Echo Nest tags freqs audio hash code confidence measure beginning of bars confidence measure result of beat tracking algorithmic estimation in seconds energy from listener perspective key the song is in confidence measure overall loudness in dB major or minor confidence measure album name confidence measure largest grouping in a song confidence measure max dB value time of max dB value dB value at onset chroma feature musical events texture features Echo Nest artist IDs algorithmic estimation Echo Nest song ID estimated tempo in BPM estimate of number of beats/bar confidence measure song title Echo Nest track ID Table 3.2: Field List - A list of fields available in the files of the dataset 12 3.2 Filtering rock’, ‘alternative r & b’, and ‘alternative metal’ are simply ‘alternative’, ‘jazz’, ‘rock’, ‘r & b’, and ‘metal’ under ‘genre’ label category. No. 1 2 3 . . 3112 3113 3114 . . 5787 5788 5789 . . Terms ‘00s’ ‘00s alternative’ ‘00s country’ . . ‘gp worldwide’ ‘grammy winner’ ‘gramophone’ . . ‘punky reggae’ ‘pure black metal’ ‘pure grunge’ . . Table 3.3: Terms - Examples of terms (tags) Label era emotion genre instrument origin Total 17 96 436 78 635 Table 3.4: Labels - Terms belong to each label class In this way, the total number of terms in each category is reduced and it is still possible to search songs without using repetitive terms. For example, a song that has ‘alternative jazz’ term can be searched by both ‘alternative’ and ‘jazz’ keywords, instead of ‘alternative jazz’. In addition, composite terms such as ‘alternative jazz’ or ‘ambient electronics’ are not included since they are at the lowest level of hierarchical level and the number of elements that belong to such clusters is few. 13 3.2 Filtering 3.2.2 2nd Filtering After all unique terms are filtered into one of five label classes, each term belonging to each label class is regarded as a cluster as shown in table 3.5. Note that it is still not deterministic that all terms are truly representative as independent clusters as it must be taken into account that there are a few hierarchical layers among terms – i.e. ‘piano’ and ‘electric piano’ terms might not be in the same level of hierarchy in ‘instrument’ label class. In order to account for differences in layers, co-occurence between a pair of clusters is calculated as explained in next section. Label era emotion genre instrument origin Clusters ‘00s’ ‘1700s’ ‘1910s’ ‘19th century’ ‘angry’ ‘chill’ ‘energetic’ ‘horror’ ‘mellow’ ‘ambient’ ‘blues’ ‘crossover’ ‘dark’ ‘electronic’ ‘opera’ ‘accordion’ ‘banjo’ ‘clarinet’ ‘horn’ ‘ukelele’ ‘laptop’ ‘african’ ‘belgian’ ‘dallas’ ‘hongkong’ ‘moroccan’ Table 3.5: Clusters - Each term forms a cluster within each label class 3.2.3 3.2.3.1 3rd Filtering Co-occurence Within a single label class, there are a number of different terms, of which each could possibly represent an individual cluster. However, while certain terms inherently possess clear meaning, some do not – e.g. in ‘genre’ label class, the distinctions between ‘dark metal’ and ‘death metal’ or ‘acid metal’ and ’acid funk’ might not be obvious. In order to avoid ambiguity among clusters, co-ocurrences of two clusters are measured. Co-occurrences of a pair of clusters can be easily calculated as follows: cooca,b = intersect(a, b) , intersect(a, a) coocb,a = 14 intersect(a, b) intersect(b, b) (3.1) 3.2 Filtering where intersect(i, j) counts the number of elements in both i and j clusters. Therefore, if both clusters have high or small co-occurrence values, it implies that there is a large or small overlap between clusters, while if only one of two clusters has a high value and the other has a low value, it implies that one cluster is a subset of the other as illustrated in figures 3.2 and 3.3. Also note that if one cluster is a subset of the other, it implies that they are not at the same hierarchical level. Figure 3.2: Co-occurence - same level - (a) small overlap between two clusters; (b) large overlap between two clusters Figure 3.3: Co-occurence - different level - (a) Cluster B is a subset of Cluster A; (b) Cluster A has relatively large number of elements than Cluster B, of which most belong to intersection Therefore, threshold is set such that if (cooca,b > 0.9 & coocb,a < 0.1) or (cooca,b < 0.1 & coocb,a > 0.9), then cluster A is a subset of cluster B or vice versa. If neither condition 15 3.2 Filtering is met, two clusters are at the same hierarchical level. In doing so, layers of hierarchy can be retrieved. 3.2.3.2 Hierarchical Structure After obtaining co-occurrence values for all the pairs of clusters, the structure of clusters in each label classes can be known. Table 3.6 shows the hierarchical structure of each label class and figure 3.4 shows some of the terms at different hierarchical level. Label era emotion genre instrument origin 1st Layer 3 3 27 18 13 2nd Layer 14 93 127 72 135 3rd Layer empty empty 274 empty 487 Total 17 96 428 90 635 Table 3.6: Hierarchical Structure(Clusters) - Total number of clusters at different layers in each label class Figure 3.4: Hierarchical Structure (Terms) - Examples of terms at different layers in each label class 16 3.2 Filtering The structure looks well correlated with intuition with more general terms at higher level such as ‘bass’ or ‘guitar’, while terms such as ‘acoustic bass’ or ‘classical guitar’ are at lower level. The number of songs in each cluster also matches well with intuition. Terms at the high level of hierarchy have a large number of songs, while there are relatively small number of songs that belong to terms at the low level. Now that the structure of clusters for each label class is known, it must be carefully decided that which layer should be used as there is a tradeoff between the number of clusters and the number of songs belonging to each cluster: higher layer has a small total number of clusters but each cluster contains sufficient amount of songs and vice versa. In order to make a logical decision, three different thresholds are set: the number of cluster, N , the mean, µ, and the standard deviation, σ, of all levels are calculated and shown in table 3.7. The rationale is that each layer within a label class must have enough number of clusters and that each cluster must contain sufficient number of songs while the variance of the distribution is as small as possible. The author defined the value for all three thresholds as follows: N >5 µ > 5, 000 σ = as small as possible 1st layer from instrument class and 2nd layer from era, emotion, genre, and origin label classes are selected as shown in table 3.7. 17 3.2 Filtering Label era emotion genre instrument origin 1st Layer µ 59,524 (3) 23,905 (3) 35,839 (27) 39,744 (18) 69,452 (13) σ 47,835 15,677 79,370 76,272 141,440 2nd Layer µ 21,686 (14) 5,736 (93) 5,744 (127) 998 (72) 8,804 (135) σ 43,263 19,871 14,307 2,464 23,383 3rd Layer µ empty empty 2,421 (274) empty 644 (487) σ empty empty 6,816 empty 2,008 Table 3.7: Hierarchical Structure (µ and σ)] - The mean and the standard deviation for each layer. Number in parenthesis denotes number of clusters. Bold numbers denote selected layer. 3.2.4 3.2.4.1 4th Filtering Term Frequency After finding the structure of clusters and selecting the layer in the previous section, all the clusters within the same layer must become mutually exclusive, leaving no overlapping elements among clusters. Therefore, after finding intersections among clusters, it needs to be decided to which cluster the element should belong. In order to resolve conflicts in multiple clusters, the frequency of terms is retrieved for every single element via provided function ‘get artist terms freq’. Therefore, for every element within intersection, the term frequency value is taken into account and whichever term that has a higher value should take the element, while the other should lose. In this way, total number of clusters are reduced via merging and all the terms become mutually exclusive. Table 3.8 indicates total number of songs in each label class. Label era emotion genre instrument origin # of Songs 387,977 394,860 700,778 384,509 871,631 Table 3.8: Mutually Exclusive Clusters - Total number of songs in mutually exclusive clusters 18 3.2 Filtering 3.2.5 5th Filtering Since most songs are given multiple terms, they might belong to several label classes – e.g. a song with ‘00s’ and ‘alternative jazz’ terms belong to both ‘era’ and ‘genre’ label class. Therefore, after obtaining the indexes of songs that belong to each category, intersections among these indexes are retrieved so that only the songs with each of all five labels are considered. The description of aforementioned process is shown in figure 3.5. Finally, the total number of clusters in each label class and the total number of songs used in the study after all filtering processes is shown in table 3.9. Figure 3.5: Intersection of Labels - Songs that belong to all five label classes are chosen Original Filtered Songs 1,000,000 41,269 Era 14 7 Emotion 91 34 Genre 122 44 Instrument 17 7 Origin 99 33 Table 3.9: Filtered Dataset - Total number of songs and clusters after filtering 19 3.3 Audio Labels 3.3 Audio Labels 3.3.1 Era After all the filtering processes, 7 clusters are selected for era label class. Terms such as ‘16th century’ or ‘21th century’ as well as ‘30s’ and ‘40s’ are successfully ignored via merging and hierarchy. Table 3.10 and figure 3.6 show the statistics of remaining terms. Note that the distribution is negatively skewed, which is intuitive, because there are more songs that exist in recorded format in later decades than early 20th century due to the advanced recording technology. It also makes sense that the cluster ‘80s’ consists of most songs because people use the term ‘80s’ to describe ‘80s’ rock or ‘80s’ music more often than ‘00s’ music or ‘00s’ pop. Era # of songs 50s 661 60s 3,525 70s 2,826 80s 17,359 90s 9,555 00s 6,111 20th 1,232 Total 41,269 Table 3.10: Era Statistics - # of Songs belonging to each era cluster Histogram of Era Cluster 16000 14000 12000 # of songs 10000 8000 6000 4000 2000 0 50s 60s 70s 80s Cluster 90s 00s 20th century Figure 3.6: Era Histogram - Distribution of songs based on era label 20 3.3 Audio Labels 3.3.2 Emotion There are a total of 34 clusters in emotion label class, which are shown in table 3.11. Note the uneven distribution of songs in emotion label class is shown in figure 3.7. Clusters such as ‘beautiful’, ‘chill’, and ‘romantic’ together consist about one third of the total songs, while there are relatively a few number of songs belonging to clusters such as ‘evil’, ‘’haunting’, and ‘uplifting’. ‘beautiful’ ‘evil’ ‘haunting’ ‘inspirational’ ‘moody’ ‘romantic’ ‘uplifting’ ‘brutal’ ‘gore’ ‘horror’ ‘light’ ‘obscure’ ‘sad’ ‘wicked Emotion ‘calming’ ‘grim’ ‘humorous’ ‘loud’ ‘patriotic’ ‘sexy’ ‘wistful’ ‘chill’ ‘happy’ ‘hypnotic’ ‘melancholia’ ‘peace’ ‘strange’ ‘witty’ ‘energetic’ ‘harsh’ ‘intense’ ‘mellow’ ‘relax’ ‘trippy’ Table 3.11: Emotion Terms - all the emotion terms. Histogram of Emotion Cluster 12000 10000 # of songs 8000 6000 4000 2000 0 beautiful evil haunting inspirational Cluster moody romantic uplifting Figure 3.7: Emotion Histogram - Distribution of songs based on emotion label 21 3.3 Audio Labels 3.3.3 Genre A total of 44 genre clusters are created and shown in table 3.12 and its distribution is shown in figure 3.8. Also note that certain genre terms such as ‘hip hop’, ‘indie’, and ‘wave’ have more songs than the others like ‘emo’ or ‘melodic’. ‘alternative’ ‘christian’ ‘electronic’ ‘industrial’ ‘new’ ‘opera’ ‘rag’ ‘swing’ ‘urban’ ‘ambient’ ‘classic’ ‘eurodance’ ‘indie’ ‘noise’ ‘post’ ‘soundtrack’ ‘synth pop’ ‘waltz’ Genre ‘ballade’ ‘country’ ‘hard style’ ‘lounge’ ‘nu’ ‘power’ ‘salsa’ ‘techno’ ‘wave’ ‘blues’ ‘dance’ ‘hip hop’ ‘modern’ ‘old’ ‘progressive’ ‘smooth’ ‘thrash’ ‘zouk’ ‘british’ ‘dub’ ‘instrumental’ ‘neo’ ‘orchestra’ ‘r&b’ ‘soft’ ‘tribal’ Table 3.12: Genre Terms - all the genre terms. Histogram of Genre Cluster 7000 6000 # of songs 5000 4000 3000 2000 1000 0 alternative country instrumental noise Cluster progressive swing wave Figure 3.8: Genre Histogram - Distribution of songs based on genre label 22 3.3 Audio Labels 3.3.4 Instrument There are only 7 instrument clusters after filtering processes. The name of each cluster and the number of songs belonging to corresponding cluster is given in table 3.13. The values make perfect sense as ‘guitar’, ‘piano’, and ‘synth’ have many songs in their clusters while there are relatively small number of songs belonging to ‘saxophone’ and ‘violin’. Figure 3.9 shows the histogram of instrument clusters. Instrument ‘bass’ ‘guitar’ ‘saxophone’ ‘synth’ 2444 9731 1340 16662 ‘drum’ ‘piano’ ‘violin’ 5103 5667 322 Table 3.13: Instrument Statistics - # of songs belonging to each instrument cluster Histogram of Genre Cluster 16000 14000 12000 # of songs 10000 8000 6000 4000 2000 0 bass drum guitar piano Cluster saxophone violin synth Figure 3.9: Instrument Histogram - Distribution of songs based on instrument label 23 3.3 Audio Labels 3.3.5 Origins There are 33 different origin clusters as laid out in table 3.14. Note that clusters such as ‘american’, ‘british’, ‘dc’, and ‘german’ have a large number of songs, while clusters such as ‘new orleans’, ‘suomi’, or ‘texas’ consists of relatively small number of songs. Also note that terms ‘american’ and ‘texas’ both appear as independent clusters, while it seems intuitive that ‘texas’ should be a subset of ‘american’. It is because when describing a song with origin label, certain songs are specifically described by ‘texas’ than ‘american’ or ‘united states’ – e.g. country music. Finally, the statistics of origin label class is shown in figure 3.10. ‘african’ ‘cuba’ ‘ireland’ ‘massachusetts’ ‘new orleans’ ‘southern’ ‘texas’ ‘american’ ‘dc’ ‘israel’ ‘mexico’ ‘poland’ ‘spain’ ‘united states’ Origin ‘belgium’ ‘east coast’ ‘italian’ ‘nederland’ ‘roma’ ‘suomi’ ‘west coast’ ‘british’ ‘england’ ‘japanese’ ‘new york’ ‘russia’ ‘sweden’ ‘canada’ ‘german’ ‘los angeles’ ‘norway’ ‘scotland’ ‘tennessee’ Table 3.14: Origin Terms - all the origin terms. Histogram of Origin Cluster 7000 6000 # of songs 5000 4000 3000 2000 1000 0 african cuba ireland massachusetts Cluster new orleans southern texas Figure 3.10: Origin Histogram - Distribution of songs based on origin label 24 3.4 Audio Features 3.4 Audio Features Audio features are extracted in order to construct feature clusters using clustering algorithm, using provided functions such as ‘get segments timbre’ or ‘get segments pitches’. Table 3.15 shows a list of extracted features. It takes about 30ms to extract a feature from one song, which makes a total of 8 hours from million songs. However, since only 41,269 songs are used, the computation time is reduced to less than an hour. No. 1 2 3 4 5 6 7 8 Feature Chroma Texture Tempo Key Key Confidence Loudness Mode Mode Confidence Function ‘get segments pitches’ ‘get segments timbre’ ‘get tempo’ ‘get key’ ‘get key confidence’ ‘get loudness’ ‘get mode’ ‘get mode confidence’ Table 3.15: Audio Features - Several audio features are extracted via respective functions 3.4.1 k-means Clustering Algorithm Content-based clusters can be constructed based on clustering algorithm, an unsupervised learning method, which does not require pre-labeling for data and uses only features to construct clusters of similar data points. There are several variants of clustering algorithms such as k-means, k-median, centroid-based, or single-linkage (26, 27, 28). In this study, k-means clustering algorithm is used for automatic clusters. The basic structure of the algorithm is defined in following steps (29, 30): 1. Define a similarity measurement metric, d. (e.g. Euclidean, Manhattan, etc.) 2. Randomly initialize k centroids, µk . 3. For all data points x, find µk that returns minimum d. 25 3.4 Audio Features 4. Find Ck , a cluster that includes a set of points assigned to µk . 5. Recalculate µk for every Ck . 6. Repeat steps 3 through 5 until it converges. 7. Repeat steps 2 through 6 multiple times to avoid local minima. The author used the (squared) Euclidean distance as the similarity measurement metric, d, and computed the centroid means of each cluster as such: d(i) := ||x(i) − µk ||2 (3.2) 1 X (i) x |Ck | (3.3) µk := i∈Ck where x(i) is the position of i’th point. Ck is constructed by finding c(i) that minimizes (3.3), where c(i) is the index of the centroid closest to x(i) . In other words, points belong to a cluster, where the Euclidean distance between a point and its centroid is minimum. 3.4.2 Feature Matrix Using extracted audio features such as chroma, timbre, key, key confidence, mode, mode confidence, tempo, and loudness, feature matrix F IxJ is constructed, where I is the total number of points (= 41,269), and J is the total number of features (= 30 – i.e. both chroma and timbre features are averaged across time, resulting in 12 x 1 dimensions for each point). Therefore, the cost function of the algorithm is: I 1 X (i) d I i=1 26 (3.4) 3.4 Audio Features and the optimization objective is to minimize (3.4). 3.4.3 Feature Scale Feature scaling is necessary as each feature vector is in different range and therefore needs to be normalized for equal weighting. The author used mean/standard deviation scaling method for each feature fj as such: fˆj = 3.4.4 fj − µfj σ fj (3.5) Feature Clusters It is often arbitrary what should be the correct number for K and there is no algorithm that leads to the definitive answer. However, an elbow method is often used to determine the number of cluster, K. Figure 3.11 shows a plot of a cost function based on different K. Either K = 8 or K = 10 marks the elbow of the plot and a possible candidate for the number of clusters. In this study, K = 10 is chosen. Elbow Method 30 28 Cost 26 24 22 20 18 0 5 10 15 20 25 K 30 35 40 45 50 Figure 3.11: Elbow Method - K = 8 or K = 10 is the possible number of clusters 27 3.4 Audio Features After choosing the right value of K, the structure of clusters is found and shown in figure 3.12. Cluster # of Songs Cluster # of Songs 1 4,349 6 3,933 2 4,172 7 2,544 3 4,128 8 5,149 4 4,475 9 2,436 5 5,866 10 4,217 Table 3.16: Cluster Statistics - The number of songs within each cluster is found. Histogram of Content−based Cluster 5000 # of songs 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 Cluster Figure 3.12: Content-based Cluster Histogram - Distribution of songs based on audio features 28 3.5 Hubert-Arabie adjusted Rand Index 3.5 Hubert-Arabie adjusted Rand Index After obtaining six sets of clusters – i.e. five with labels and one with audio features, the relationship among a pair of clusters can be found by calculating the HubertArabie adjusted Rand (ARIHA ) index (7, 31). ARIHA index enables to quantify cluster validation by comparing the generated clusters with the original structure of the data. Therefore, by comparing two different sets of clusters, the correlation between two clusters can be drawn. ARIHA index can be measured as: ARIHA = N 2 (a + d) − [(a + b)(a + c) + (c + d)(b + d)] . N 2 − [(a + b)(a + c) + (c + d)(b + d)] 2 (3.6) where N is the total number of data and a, b, c, d represents four different types of pairs. Let A and B be two sets of clusters and P and Q be number of clusters in each set, then a, b, c, and d are defined as following: a : element in the same group of both A and B b : elements in the same group of B but in different group of A c : elements in the same group of A but in different group of B d : elements in different group of both A and B which can be easily described by a contingency table shown in 3.17. This leads to the computation of a, b, c, and d as following: Q P P P a= p=1 q=1 P P b= p=1 t2pq − N 2 Q P P P t2p+ − p=1 q=1 2 29 . (3.7) t2pq . (3.8) 3.5 Hubert-Arabie adjusted Rand Index Q P c= Q P P P d= p=1 q=1 q=1 Q P P P t2+q − p=1 q=1 t2pq 2 t2pq + N 2 − P P p=1 . t2p+ − (3.9) Q P q=1 2 t2+q . (3.10) where tpq , tp+ , and t+q denote the total number of elements belonging to both pth and qth cluster, the total number of elements belonging to pth cluster, and the total number of elements belonging to qth cluster, respectively. It can be viewed as such that ARIHA = 1 means perfect cluster recovery, while values greater than 0.9, 0.8, and 0.65 mean excellent, good, and moderate recovery, respectively (7). A pair in same group pair in same group a pair in different group b c d B pair in different group Table 3.17: 2 x 2 Contingency Table - 2 x 2 contingency table that describes four different types of pairs: a, b, c, d 30 Chapter 4 Evaluation and Discussion ARIHA is calculated for all pairs of cluster sets and shown in table 4.1. Features Era Emotion Genre Instrument Origin Features 1 0.0145 0.0066 0.0404 0.0289 0.0139 Era 0.0145 1 0.0399 0.0823 0.1010 0.0315 Emotion 0.0066 0.0399 1 0.1267 0.0390 0.0961 Genre 0.0404 0.0823 0.1267 1 0.0833 0.0843 Instrument 0.0289 0.1010 0.0390 0.0833 1 0.0418 Origin 0.0139 0.0315 0.0961 0.0843 0.0418 1 Table 4.1: ARIHA - Cluster validation is calculated based on Hubert-Arabie adjusted Rand Index It is observed from Table 4.1 that the cluster validation between any pair of cluster sets is overall very low with the highest correlation between emotion and genre at 12.67 % and the lowest between origin and era at 3.15 %. Although all the validation values are too low to draw a relationship between a pair of audio labels, it is still interesting to observe that emotion and genre are most correlated among those, indicating that there are common emotion annotations for certain genres. In order to observe a closer relationship between emotion and genre, the number of intersections between each term from both label classes are calculated and the maximum intersection for each term is 31 shown in table 4.2 and 4.3. Genre ‘alternative’ ‘ambient’ ‘ballade’ ‘blues’ ‘british’ ‘christian’ ‘classic’ ‘country’ ‘dance’ ‘dub’ ‘electronic’ ‘eurodance’ ‘hard style’ ‘hip hop’ ‘instrumental’ ‘industrial’ ‘indie’ ‘lounge’ ‘modern’ ‘neo’ ‘new’ ‘noise’ ‘nu’ Intersection 159 92 161 204 119 214 392 90 74 99 509 52 94 3002 156 44 2167 58 70 169 200 121 44 Emotion ‘beautiful’ ‘chill’ ‘beautiful’ ‘energetic’ ‘beautiful’ ‘inspirational’ ‘romantic’ ‘romantic’ ‘chill’ ‘chill’ ‘chill’ ‘uplifting’ ‘gore’ ‘chill’ ‘beautiful’ ‘romantic’ ‘chill’ ‘beautiful’ ‘chill’ ‘chill’ ‘chill’ ‘beautiful’ ‘chill’ Genre ‘old’ ‘orchestra’ ‘opera’ ‘post’ ‘power’ ‘progressive’ ‘r&b’ ‘rag’ ‘soundtrack’ ‘chill’ ‘smooth’ ‘soft’ ‘swing’ ‘synth pop’ ‘techno’ ‘thrash’ ‘tribal’ ‘urban’ ‘waltz’ ‘wave’ ‘zouk’ Intersection 121 117 116 164 160 827 102 182 1042 173 1584 536 60 132 960 134 99 154 49 3448 100 Emotion ‘beautiful’ ‘beautiful’ ‘romantic’ ‘chill’ ‘melancholia’ ‘chill’ ‘chill’ ‘chill’ ‘chill’ ‘chill’ ‘chill’ ‘mellow’ ‘mellow’ ‘melancholia’ ‘happy’ ‘peace’ ‘brutal’ ‘beautiful’ ‘romantic’ ‘romantic’ ‘beautiful’ Table 4.2: Term Cooccurrence - The most common emotion term for each genre term is observed It is observed that because of disproportional distribution among emotion terms, most genre labels share the same emotion terms such as ‘beautiful’, ‘chill’, ‘romantic’. On the other hand, as the distribution of genre terms are more ‘flat’, many emotion terms share different genre terms. However, do note that the co-occurrence between an emotion label and a genre label does not correlate well with intuition as it can be observed from table 4.3. – e.g. ‘beautiful’ & ‘indie’, ‘happy’ & ‘hip hop’, ‘uplifting’ & ‘progressive,’ 32 4.1 K vs. ARIHA Emotion ‘beautiful’ ‘brutal’ ‘calming’ ‘chill’ ‘energetic’ ‘evil’ ‘gore’ ‘grim’ ‘happy’ ‘harsh’ ‘haunting’ ‘horror’ ‘humorous’ ‘hypnotic’ ‘intense’ ‘inspirational’ ‘light’ Intersection 883 99 118 3002 276 72 94 659 1472 28 110 370 96 93 70 214 118 Genre ‘indie ‘tribal’ ‘synthpop’ ‘hip hop’ ‘wave’ ‘indie’ ‘hardstyle’ ‘hip hop’ ‘hip hop’ ‘noise’ ‘electronic’ ‘wave’ ‘salsa’ ‘smooth’ ‘rag’ ‘christian’ ‘soft’ Emotion ‘loud’ ‘melancholia’ ‘mellow’ ‘moody’ ‘obscure’ ‘patriotic’ ‘peace’ ‘relax’ ‘romantic’ ‘sad’ ‘sexy’ ‘strange’ ‘trippy’ ‘uplifting’ ‘wicked’ ‘wistful’ ‘witty’ Intersection 119 325 536 88 50 30 134 230 3448 161 752 76 67 79 99 121 104 Genre ‘christian’ ‘indie’ ‘soft’ ‘alternative’ ‘new’ ‘hip hop’ ‘thrash’ ‘smooth’ ‘wave’ ‘indie’ ‘hip hop’ ‘progressive’ ‘progressive’ ‘progressive’ ‘hip hop’ ‘classic’ ‘progressive’ Table 4.3: Term Cooccurrence - The most common genre term for each emotion term is observed which is indicative of the low cluster validation rate. It also indicates that people use only limited vocabulary to describe the emotional aspect of a song regardless of the genre of the given song. Although it seems intuitive and expected that the correlations between audio labels turn out to be low, it is quite surprising that the cluster validations between audio features and each label are also low. In order to understand why this is the case, a number of post-processing steps are proposed. 4.1 K vs. ARIHA In section 3.4.4., the number of clusters, K, was chosen based on the elbow method. This K does not necessarily generate optimal validation rates, and therefore, K vs. ARIHA plot is drawn to find out K that maximizes the validation rates for each set of clusters. Figure 4.1 shows the pattern of ARIHA for each label class as K changes. It 33 4.2 Hubert-Arabie adjusted Rand Index (revisited) turns out that the sum of ARIHA is maximum when K = 5, the maximum number of feature clusters. ARIha vs. K 0.06 Era Emotion Genre Instrument Origin 0.05 ARIha 0.04 0.03 0.02 0.01 0 0 5 10 15 20 25 # of cluster: K 30 35 40 45 50 Figure 4.1: K vs. ARIHA - ARIHA is maximum when K = 5 4.2 Hubert-Arabie adjusted Rand Index (revisited) Using the result from previous section, (K = 5), ARIHA is re-calculated for each label class and shown in table 4.4. Features (original) Features (K = 5) Era 0.0145 0.0228 Emotion 0.0066 0.0069 Genre 0.0404 0.0439 Instrument 0.0289 0.0402 Origin 0.0139 0.0160 Table 4.4: Optimal Cluster Validation - optimal ARIHA are calculated for each label class 34 4.3 Cluster Structure Analysis 4.3 Cluster Structure Analysis Now that the optimal K and ARIHA values are found, it needs to be discussed the reason for such low cluster validation rates. In order to do so, the structure of clusters needs to be known by calculating the Euclidean distance between centroids of clusters. Table 4.5 shows the Euclidean distance between centroids of clusters. Note that the centroids of clusters 1 and 2 have the minimum distance while those of clusters 3 and 4 have the maximum distance, indicating most similar and dissimilar clusters, respectively. Cluster 1 2 3 4 5 1 0 0.0860 0.1638 0.1042 0.1061 2 0.0860 0 0.1975 0.0963 0.1086 3 0.1638 0.1975 0 0.2310 0.1368 4 0.1042 0.0963 0.2310 0 0.1810 5 0.1061 0.1086 0.1368 0.1810 0 Table 4.5: Self-similarity matrix - the distances between each pair of clusters are calculated 4.3.1 Neighboring Clusters vs. Distant Clusters In order to observe the detailed structure of the cluster, co-occurrence between feature clusters and label clusters are calculated and the first four most co-occurred clusters are returned. In other words, for each feature cluster 1 through 5, four most intersecting clusters from each label class is calculated and shown in figures 4.2 – 4.6. Note that due to uneven distribution of songs within each label class, the cluster that contains the largest number of songs such as ‘80s’ in era label, ‘chill’ in emotion, ‘hip hop’ in genre, ‘synth’ in instrument, and ‘dc’ in origin, appear frequently across all five feature clusters. In fact, ‘80s’ and ‘chill’ clusters appear as the most co-occurring cluster with all five feature clusters. 35 4.3 Cluster Structure Analysis 5000 # of songs # of songs 3000 2000 1000 0 80s 4000 3000 2000 1000 90s 00s 0 80s 60s 90s Cluster 1 00s 70s 00s 70s Cluster 2 1000 3000 # of songs # of songs 800 600 400 1000 200 0 80s 2000 20th century Cluster 3 60s 0 80s 90s 90s Cluster 4 # of songs 3000 2000 1000 0 80s 90s 00s 60s Cluster 5 Figure 4.2: Co-occurence between feature clusters and era clusters - First four most co-occurred era clusters are returned for each feature cluster 36 4.3 Cluster Structure Analysis 2000 # of songs # of songs 3000 2000 1000 0 chill 1500 1000 500 beautiful romantic Cluster 1 0 chill happy romantic beautiful Cluster 2 happy 2000 # of songs # of songs 2500 1000 500 1500 1000 500 0 chill romantic beautiful Cluster 3 mellow beautiful romantic Cluster 5 mellow 0 chill happy romantic sexy Cluster 4 # of songs 2500 2000 1500 1000 500 0 chill Figure 4.3: Co-occurence between feature clusters and emotion clusters - First four most co-occurred emotion clusters are returned for each feature cluster 37 4.3 Cluster Structure Analysis 2000 # of songs # of songs 2000 1500 1000 1000 500 500 0 hip hop 1500 wave smooth 0 wave indie indie Cluster 1 hip hop soft techno progressive Cluster 2 2500 # of songs # of songs 600 400 200 0 soundtrack 2000 1500 1000 500 smooth classic Cluster 3 0 hip hop indie wave Cluster 4 # of songs 1500 1000 500 0 indie wave smooth soft Cluster 5 Figure 4.4: Co-occurence between feature clusters and genre clusters - First four most co-occurred genre clusters are returned for each feature cluster 38 4.3 Cluster Structure Analysis 4000 # of songs # of songs 3000 2000 1000 0 synth 3000 2000 1000 guitar drum 0 synth piano guitar Cluster 1 piano drum guitar bass Cluster 2 4000 # of songs # of songs 1500 1000 500 0 piano 3000 2000 1000 guitar synth 0 synth drum Cluster 3 drum Cluster 4 # of songs 3000 2000 1000 0 synth guitar piano drum Cluster 5 Figure 4.5: Co-occurence between feature clusters and instrument clusters First four most co-occurred instrument clusters are returned for each feature cluster 39 4.3 Cluster Structure Analysis 2000 # of songs # of songs 1500 1000 500 0 dc 1500 1000 500 american german 0 dc british british Cluster 1 roma german german roma Cluster 2 2500 # of songs # of songs 600 400 200 2000 1500 1000 500 0 american roma german 0 dc los angeles # of songs Cluster 3 british Cluster 4 1000 500 0 american british dc german Cluster 5 Figure 4.6: Co-occurence between feature clusters and origin clusters - First four most co-occurred origin clusters are returned for each feature cluster Knowing that the distance between clusters 1 and 2 is minimum and the distance between clusters 3 and 4 is maximum, it can be also observed from figures 4.2 – 4.6 that the co-occurring terms within clusters 1 and 2 are similar, while those within clusters 3 and 4 are quite dissimilar as shown in tables 4.6 and 4.7, indicating neighboring feature clusters share similar label clusters, while distant feature clusters do not. Cluster 1 (80s, 90s, 00s, 60s) (chill, beautiful, romantic, happy) (hip hop, wave, smooth, indie) (synth, guitar, drum, piano) (dc, american, german, british) vs Cluster 2 (80s, 90s, 00s, 70s) (chill, romantic, beautiful, happy) (wave, indie, hip hop, soft) (synth, guitar, piano, drum) (dc, british, roma, german) Table 4.6: Neighboring Clusters - clusters with minimum Euclidean distances share similar label clusters 40 4.3 Cluster Structure Analysis Cluster 3 (80s, 20th century, 60s, 90s) (chill, romantic, beautiful, mellow) (soundtrack, smooth, classic, indie) (piano, guitar, synth, drum) (american, roma, german, los angeles) vs Cluster 4 (80s, 90s, 00s, 70s) (chill, happy, romantic, sexy) (hip hop, wave, techno, progressive) (synth, drum, guitar, bass) (dc, british, german, roma) Table 4.7: Distant Clusters - clusters with maximum Euclidean distances have dissimilar label clusters 4.3.2 Correlated Terms vs. Uncorrelated Terms Considering the opposite case, the author selected four largest clusters from each label class and calculated the co-occurrence with every feature clusters as shown in figures 4.7 – 4.11. In order to observe whether highly correlated label clusters can also be characterized by feature clusters, table 4.8 shows the summary of the most correlated terms for the four largest clusters for each label class, whereas table 4.9 shows the least correlated terms for the same clusters. Using histograms from figures 4.7 – 4.11, 5-dimensional vector can be created for each term by finding the ratio of each feature cluster – (e.g. a vector for ‘80s’ term is (Cluster 1, Cluster 2, Cluster 3, Cluster 4, Cluster 5) = (0.701, 1, 0.196, 0.727, 0.601)). Using the same method, a total of 41 vectors are retrieved for every single term in tables 4.8 and 4.9 and shown in table 4.10. Using the relationship from tables 4.8, 4.9 and the vectors in 4.10, the Euclidean distance between a pair of vectors is calculated and shown in tables 4.11 and 4.12. As its average distance indicates, highly correlated terms share similar combination of feature clusters, whereas lowly correlated terms do not. 41 4.3 Cluster Structure Analysis 5000 2500 4000 # of songs # of songs 2000 3000 2000 1000 1000 0 Cluster 2 1500 500 Cluster 4 Cluster 1 80s Cluster 5 0 Cluster 4 Cluster 3 1400 Cluster 1 Cluster 2 90s Cluster 5 Cluster 3 Cluster 1 Cluster 2 60s Cluster 3 Cluster 4 1000 1200 800 # of songs # of songs 1000 800 600 600 400 400 200 200 0 Cluster 5 Cluster 1 Cluster 4 00s Cluster 2 0 Cluster 5 Cluster 3 Figure 4.7: Co-occurence between era clusters and feature clusters - Intersecting feature clusters are returned for the four largest era clusters 3500 1600 3000 1400 1200 # of songs # of songs 2500 2000 1500 1000 800 600 400 500 0 Cluster 1 1000 200 Cluster 5 Cluster 4 chill Cluster 2 0 Cluster 2 Cluster 3 Cluster 5 Cluster 4 romantic Cluster 1 Cluster 3 Cluster 2 Cluster 1 happy Cluster 5 Cluster 3 1400 1600 1200 1400 1000 # of songs # of songs 1200 1000 800 600 600 400 400 200 200 0 Cluster 5 800 Cluster 2 Cluster 1 beautiful Cluster 3 0 Cluster 4 Cluster 4 Figure 4.8: Co-occurence between emotion clusters and feature clusters - Intersecting feature clusters are returned for the four largest emotion clusters 42 4.3 Cluster Structure Analysis 2500 2000 # of songs # of songs 2000 1500 1000 1000 500 500 0 Cluster 4 1500 Cluster 1 Cluster 2 hip hop Cluster 5 0 Cluster 2 Cluster 3 Cluster 4 Cluster 1 wave Cluster 5 Cluster 3 Cluster 5 Cluster 3 smooth Cluster 2 Cluster 4 1000 800 # of songs # of songs 1500 1000 600 400 500 200 0 Cluster 5 Cluster 2 Cluster 1 indie Cluster 4 0 Cluster 1 Cluster 3 Figure 4.9: Co-occurence between genre clusters and feature clusters - Intersecting feature clusters are returned for the four largest genre clusters 3000 2500 # of songs # of songs 4000 3000 2000 2000 1500 1000 1000 0 Cluster 2 500 Cluster 4 Cluster 1 synth Cluster 5 0 Cluster 5 Cluster 3 1000 500 0 Cluster 3 Cluster 1 guitar Cluster 4 Cluster 3 Cluster 1 Cluster 2 drum Cluster 5 Cluster 3 1500 # of songs # of songs 1500 Cluster 2 1000 500 Cluster 5 Cluster 1 piano Cluster 2 0 Cluster 4 Cluster 4 Figure 4.10: Co-occurence between instrument clusters and feature clusters Intersecting feature clusters are returned for the four largest instrument clusters 43 4.3 Cluster Structure Analysis 2500 1500 # of songs # of songs 2000 1500 1000 1000 500 500 0 Cluster 4 Cluster 2 Cluster 1 dc Cluster 5 0 Cluster 2 Cluster 3 Cluster 5 Cluster 4 british Cluster 1 Cluster 3 Cluster 4 Cluster 5 german Cluster 2 Cluster 3 1200 1200 1000 # of songs # of songs 1000 800 600 800 600 400 400 200 200 0 Cluster 5 Cluster 1 Cluster 2 american Cluster 3 0 Cluster 1 Cluster 4 Figure 4.11: Co-occurence between origin clusters and feature clusters - Intersecting feature clusters are returned for the four largest origin clusters 80s chill hip hop synth dc chill wave synth dc 80s hip hop synth british 90s chill synth dc 80s chill wave dc 80s happy hip hop synth 90s romantic wave guitar british chill hip hop synth dc 80s wave synth roma 80s romantic synth roma 80s chill indie american 80s chill indie synth 00s beautiful indie piano american chill indie synth british 80s indie guitar american 00s chill synth british 80s beautiful soundtrack american 00s chill smooth guitar 60s happy smooth drum german Table 4.8: Term Correlation - The highest correlated terms for the four largest clusters from each label class 44 beautiful smooth guitar american 80s hip hop synth dc 80s chill guitar american 90s chill hip hop dc 80s chill hip hop synth 4.3 Cluster Structure Analysis 80s chill hip hop synth dc evil country violin cuba 50s ballade violin cuba 20th century gore violin england 20th century evil ambient cuba 20th century calming ambient saxophone 90s romantic wave guitar british evil ambient violin cuba 00s alternative saxophone belgium 50s calming violin belgium 50s calming dance belgium 50s brutal blues saxophone 00s beautiful indie piano american calming ambient saxophone east coast 50s ambient violin african 50s calming saxophone cuba 50s calming ambient african 50s brutal ballade violin 60s happy smooth drum german Table 4.9: Term Correlation - The least correlated terms for the four largest clusters from each label class 45 calming alternative violin belgium 50s alternative saxophone african 20th century brutal violin african 50s calming ambient cuba 20th century calming alternative violin 4.3 Cluster Structure Analysis Term 80s 90s 00s 60s chill romantic beautiful happy hip hop wave indie smooth soundtrack synth guitar piano drum dc british american german roma 50s 20th century brutal evil calming gore ambient alternative ballad blues country dance violin saxophone african belgium cuba east coast england Vectors (0.701, 1, 0.196, 0.727, 0.601) (0.827, 0.659, 0.185, 1, 0.582) (0.996, 0.910, 0.271, 0.938, 1) (0.834, 0.648, 0.519, 0.257, 1) (1, 0.651, 0.394, 0.749, 0.781) (0.598, 1, 0.410, 0.646, 0.681) (0.594, 0.775, 0.322, 0.315, 1) (0.522, 0.559, 0.039, 1, 0.208) (0.829, 0.486, 0.048, 1, 0.231) (0.486, 1, 0.046, 0.617, 0.373) (0.494, 0.988, 0.074, 0.266, 1) (1, 0.309, 0.463, 0.215, 0.714) (0.722, 0.665, 1, 0.269, 0.763) (0.752, 1, 0.093, 0.866, 0.629) (0.621, 0.946, 0.238, 0.424, 1) (0.526, 0.500, 1, 0.222, 0.741) (0.778, 0.465, 0.089, 1, 0.281) (0.571, 0.773, 0.07, 1, 0.373) (0.622, 1, 0.130, 0.623, 0.687) (0.964, 0.594, 0.543, 0.492, 1) (1, 0.646, 0.434, 0.758, 0.701) (0.708, 0.567, 1, 0.489, 0.635) (0.823, 0.129, 0.635, 0.149, 1) (0.154, 0.047, 1, 0.118, 0.346) (0.535, 0.807, 0.193, 1, 0.790) (0.744, 0.744, 0.308, 0.231, 1) (0.563, 0.469, 0.109, 0.141, 1) (0.388, 1, 0, 0.6, 0.235) (0.250, 0.568, 0.159, 0.114, 1) (0.414, 1, 0.019, 0.543, 0.432) (0.128, 1, 0.009, 0.284, 0.413) (0.899, 1, 0.203, 0.529, 0.544) (0.139, 0.181, 0.042, 0.028, 1) (0.277, 0.185, 0, 1, 0.108) (0.190, 0.177, 0.658, 0.013, 1) (1, 0.325, 0.265, 0.262, 0.567) (1, 0.644, 0.137, 0.824, 0.537) (0.659, 0.319, 0.055, 1, 0.110) (0.469, 0.047, 0, 1, 0.031) (0.618, 0.056, 0.101, 0.034, 1) (0.587, 1, 0.239, 0.635, 0.819) Table 4.10: Term Vectors - 5-dimensional vectors are created for all the terms appearing in tables 4.8 and 4.9, using histograms as in figures 4.7 – 4.11. 46 4.3 Cluster Structure Analysis Correlated 80s 00s chill beautiful hip hop indie synth piano dc american average Terms chill wave synth dc chill indie synth british 80s hip hop synth british 80s indie guitar american 90s chill synth dc 00s chill synth british 80s chill wave dc 80s beautiful soundtrack american 80s happy hip hop synth 00s chill smooth guitar Distance 0.1064 0.0729 0.0365 0.0918 0.0815 0.1730 0.0984 0.1208 0.1064 0.1472 0.1115 0.1198 0.1276 0.0691 0.0442 0.1000 0.0829 0.1472 0.1339 0.0824 0.1730 0.1736 0.1503 0.0990 0.0365 0.1115 0.0896 0.0821 0.2192 0.1569 0.0523 0.1483 0.0918 0.0553 0.0824 0.0821 0.1223 0.0750 0.0995 0.1165 0.1068 Correlated 90s 60s romantic happy wave smooth guitar drum british german average Terms chill hip hop synth dc beautiful smooth guitar american 80s wave synth roma 80s hip hop synth dc 80s romantic synth roma 80s chill guitar american 80s chill indie american 90s chill hip hop dc 80s chill indie synth 80s chill hip hop synth Distance 0.0840 0.0829 0.0776 0.0738 0.0681 0.0957 0.0982 0.0550 0.0527 0.0981 0.0837 0.1516 0.1386 0.0633 0.1335 0.0553 0.0729 0.0981 0.0896 0.2220 0.1911 0.1283 0.1704 0.0995 0.1024 0.1278 0.0528 0.1165 0.0748 0.1400 0.0170 0.0766 0.0339 0.1198 0.0990 0.0568 0.1063 0.0180 0.1390 0.1131 0.0970 Table 4.11: Label Cluster Distance - Euclidean distance between a pair of most correlated terms is calculated 47 4.3 Cluster Structure Analysis Correlated 80s 00s chill beautiful hip hop indie synth piano dc american average Terms evil country violin cuba calming ambient saxophone east coast 50s ballade violin cuba 50s ambient violin african 20th century gore violin england 50s calming saxophone cuba 20th century evil ambient cuba 50s calming ambient african 20th century calming ambient saxophone 50s brutal ballade violin Distance 0.1393 0.2575 0.2699 0.2366 0.2043 0.2337 0.1987 0.2622 0.1755 0.2351 0.2482 0.2390 0.1543 0.0955 0.1703 0.1661 0.3063 0.1576 0.3125 0.1831 0.2168 0.1079 0.1918 0.3079 0.3303 0.1616 0.2141 0.2343 0.1314 0.1865 0.1862 0.2362 0.3062 0.2214 0.2307 0.2005 0.1204 0.1617 0.2479 0.2016 0.2110 Correlated 90s 60s romantic happy wave smooth guitar drum british german average Terms evil ambient violin cuba calming alternative violin belgium 00s alternative saxophone belgium 50s alternative saxophone african 50s calming violin belgium 20th century brutal violin african 50s calming dance belgium 50s calming ambient cuba 50s brutal blues saxophone 20th century calming alternative violin Distance 0.1784 0.2282 0.2836 0.1834 0.1072 0.1954 0.1692 0.2606 0.1221 0.1019 0.1787 0.2044 0.2816 0.1365 0.2008 0.1240 0.2706 0.1910 0.2755 0.1685 0.2207 0.2153 0.1827 0.1576 0.1941 0.1145 0.2743 0.2492 0.2572 0.2282 0.1619 0.1167 0.2347 0.0898 0.0667 0.1745 0.2779 0.1787 0.1743 0.2507 0.1920 Table 4.12: Label Cluster Distance - Euclidean distance between a pair of least correlated terms is calculated 48 Chapter 5 Conclusion and Future Work The goal of this study was to analyze the relationship among audio labels using a large dataset and cluster validation method. As it is observed from the previous chapter, the validation rates among audio labels turn out to be low, implying no statistically significant correlation. In other words, songs described by certain instrument are not distinctively characterized to be originated from certain place or to arouse specific emotion, or to be described as a certain genre. It is still interesting, however, to find out that emotion and genre labels have the highest correlation, implying that certain genres do co-occur with specific emotion terms in some cases. Regarding the low cluster validation rates between audio feature clusters and label classes, it was found that extracted features do not successfully characterize songs based on only one label class as certain label clusters co-occur with all five feature clusters; however, it still shows that there exists a correlation between feature clusters and label clusters as neighboring feature clusters tend to contain similar label clusters and highly correlated label clusters also share similar combination of feature clusters. For future work, it should be meaningful to find out which combination of feature clusters or label clusters could maximize cluster validation, which could lead to better understanding of the relationship between features and labels. 49 [9] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Audio, Speech and Language Processing, 10(2):293–302, 2002. 5 [10] J.P. Bello L. Daudet S. Abdallah C. Duxbury M. Davies and M.B. Sandler. A tutorial on onset de- References tection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5), 2005. 5 [11] M. Alonso B. David and G. Richard. Tempo and [1] G. Tzanetakis G. Essel and P. Cook. Automatic beat estimation of musical signals. ISMIR, 2004. musical genre classification of audio signals. IS- 5 MIR, pages 205–210, 2001. 1, 4, 5 [12] A. M. Noll. Cepstrum pitch determination. J. [2] M. Casey and M. Slaney. Fast recognition of Acoust. Soc. Am., 41(2), 1967. 5 remixed music audio. ICASSP, 2007. 1, 4, 5 [13] L. R. Robiner. On the use of autocorrelation [3] Y. H. Yang Y. C. Lin Y. F. Su and H. H. Chen. analysis for pitch detection. Music emotion classification: A regression ap- ICASSP, 25(1), 1977. 5 proach. ICME, pages 208–209, 2007. 1, 4, 5 [14] R. Meddis and M. J. Hewitt. Virtual pitch and [4] D. P. W. Ellis and G. E. Poliner. Identifying phase sensitivity of a compute model of the audi- ’cover songs’ with chroma features and dynamic tory periphery. i: Pitch identification. J. Acoust. programming beat tracking. ICASSP, 2007. 1, Soc. Am., 89(6), 1991. 5 4, 5 [15] L. Lu D. Liu and H. Zhang. Automatic mood [5] S. Basu A. Banerjee and R. Mooney. Semi- detection and tracking of music audio signals. supervised clustering by seeding. ICML, 2002. IEEE Transactions on Audio, Speech and Lan- 2 guage Processing, 14(1):5–18, 2006. 7 [6] Thierry Bertin-Mahieux, Daniel P.W. Ellis, [16] J. Skowronek M. F. McKinney and S. van de Brian Whitman, and Paul Lamere. The million Par. A demonstrator for automatic music mood song dataset. ISMIR, 2011. 2, 9, 10 estimation. ISMIR, 2007. 7 [7] D. Steinley. Properties of the hubert-arabie ad- [17] Y. E. Kim E. Schmidt and L. Emelle. justed rand index. Psychology Methods, 2004. 2, Moodswings: A collaborative game for music 29, 30 mood label collection. ISMIR, 2008. 7 A [18] Y. H. Yang. A regression approach to music emo- texture-based object detection and an adaptive tion recognition. IEEE Transactions on Audio, model-based classification. IEEE Intelligent Ve- Speech and Language Processing, 16(2):448–457, hicles Symposium, 1998. 4 2008. 7 [8] T. Kalinke C. Tzomakas and W. Seelan. 50 REFERENCES tures and the emergence of mood. ISMIR, 2011. [19] R. E. Thayer. The Biopsychology of Mood and 7 Arousal. Oxford University Press, 1989. iv, 7, 8 [20] C. Laurier J. Grivolla and P. Herrera. Mul- [26] M. Anderberg. Cluster analysis for applications. timodal musiic mood classification using audio Academic Press, 1973. 25 and lyrics. ICMLA, 2008. 7 [21] Y. Hu X. Chen and D. Yang. [27] M. Bern and D. Eppstein. Approximation algo- Lyrics-based rithms for geometric problems. Approximation song emotion detection with affective lexicon Algorithms for NP-Hard Problems, 1996. 25 and fuzzy clustering method. ISMIR, 2009. 7 [28] R. Duda P. Hart and D. Stork. Pattern Classi- [22] X. Hu and J. S. Downie. When lyrics outperform fication. Wiley, 2001. 25 audio for music mood classification: A feature analysis. ISMIR, 2010. 7 [29] A. Jain and R. Dubes. Algorithms for Clustering [23] Y. Vaizman R. Y. Granot and G. Lanckriet. Data. Prentice-Hall, 1981. 25 Modeling dynamic patterns for emotional content in music. ISMIR, 2011. 7 [30] J. Puzicha T. Hofmann and J. Buhmann. A theory of proximity based clustering: structure de- [24] E. M. Schmidt and Y. E. Kim. Modeling musi- tection by optimization. Pattern Recognition, cal emotion dynamics with conditional random 2000. 25 fields. ISMIR, 2011. 7 [31] L. Hubert and P. Arabie. Comparing partitions. [25] M. McVicar T. Freemand and T. D. Bie. Min- Journal of Classification, 2(193-218), 1985. 29 ing the correlation between lyrical and audio fea- 51
© Copyright 2026 Paperzz