Analyzing the Relationship Among Audio Labels Using Hubert-Arabie adjusted Rand Index

Analyzing the Relationship Among Audio Labels Using
Hubert-Arabie adjusted Rand Index
Kwan Kim
Submitted in partial fulfillment of the requirements for the
Master of Music in Music Technology
in the Department of Music and Performing Arts Professions
in The Steinhardt School
New York University
Advisor: Dr. Juan P. Bello
Reader: Dr. Kenneth Peacock
Date: 2012/12/11
c
Copyright 2012
Kwan Kim
Abstract
With the advent of advanced technology and instant access to the Internet,
the music databases have grown rapidly, requiring more efficient ways of
organizing and providing access to music. A number of automatic classification algorithms are proposed in the field of music information retrieval
(MIR) by a means of supervised learning method, in which ground truth
labels are imperative. The goal of this study is to analyze a statistical relationship among audio labels such as era, emotions, genres, instruments, and
origin, using the Million Song Dataset and Hubert-Arabie adjusted Rand
Index in order to observe whether there is a significant enough correlation
between these labels. It is found that the cluster validation is low among audio labels, which implies no strong correlation and not enough co-occurrence
between these labels when describing songs.
Acknowledgements
I would like to thank everyone involved in completing this thesis. I especially
send my deepest gratitude to my advisor, Juan P. Bello, for keeping me
motivated. His critics and insights consistently pushed me to become a
better student. I also thank Mary Farbood for being such a friendly mentor.
It was a pleasure to work as her assistant for the past year and half. I thank
the rest of NYU faculty for providing an opportunity and excellent program
to study. Lastly, I thank my family and wife for their support and love.
Contents
List of Figures
iv
List of Tables
vi
1 Introduction
1
2 Literature Review
4
2.1
Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
Automatic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2.1
Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2.2
Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3 Methodology
9
3.1
Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2.1
1st Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.2.2
2nd
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.3
3rd
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.3.1
Co-occurence . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.3.2
Hierarchical Structure . . . . . . . . . . . . . . . . . . .
16
4th Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.2.4.1
Term Frequency . . . . . . . . . . . . . . . . . . . . . .
18
5th Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Audio Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3.1
Era
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3.2
Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2.4
3.2.5
3.3
ii
CONTENTS
3.4
3.5
3.3.3
Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3.4
Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.3.5
Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4.1
k-means Clustering Algorithm . . . . . . . . . . . . . . . . . . .
25
3.4.2
Feature Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.4.3
Feature Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.4.4
Feature Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Hubert-Arabie adjusted Rand Index . . . . . . . . . . . . . . . . . . . .
29
4 Evaluation and Discussion
31
4.1
K vs. ARIHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.2
Hubert-Arabie adjusted Rand Index (revisited) . . . . . . . . . . . . . .
34
4.3
Cluster Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.3.1
Neighboring Clusters vs. Distant Clusters . . . . . . . . . . . . .
35
4.3.2
Correlated Terms vs. Uncorrelated Terms . . . . . . . . . . . . .
41
5 Conclusion and Future Work
49
References
50
iii
List of Figures
1.1
System Diagram of a Generic Automatic Classification Model . . . . . .
3
2.1
System Diagram of a Genre Classification Model . . . . . . . . . . . . .
6
2.2
System Diagram of a music emotion recognition model . . . . . . . . . .
8
2.3
Thayer’s 2-Dimensional Emotion Plane (19) . . . . . . . . . . . . . . . .
8
3.1
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.2
Co-occurence - same level . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.3
Co-occurence - different level . . . . . . . . . . . . . . . . . . . . . . . .
15
3.4
Hierarchical Structure (Terms) . . . . . . . . . . . . . . . . . . . . . . .
16
3.5
Intersection of Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.6
Era Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.7
Emotion Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.8
Genre Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.9
Instrument Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.10 Origin Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.11 Elbow Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.12 Content-based Cluster Histogram . . . . . . . . . . . . . . . . . . . . . .
28
4.1
K vs. ARIHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.2
Co-occurence between feature clusters and era clusters . . . . . . . . . .
36
4.3
Co-occurence between feature clusters and emotion clusters . . . . . . .
37
4.4
Co-occurence between feature clusters and genre clusters . . . . . . . . .
38
4.5
Co-occurence between feature clusters and instrument clusters
. . . . .
39
4.6
Co-occurence between feature clusters and origin clusters . . . . . . . .
40
4.7
Co-occurence between era clusters and feature clusters . . . . . . . . . .
42
iv
LIST OF FIGURES
4.8
Co-occurence between emotion clusters and feature clusters . . . . . . .
42
4.9
Co-occurence between genre clusters and feature clusters . . . . . . . . .
43
4.10 Co-occurence between instrument clusters and feature clusters
. . . . .
43
4.11 Co-occurence between origin clusters and feature clusters . . . . . . . .
44
v
List of Tables
3.1
Overall Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2
Field List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.3
Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.4
Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.5
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.6
Hierarchical Structure (Clusters) . . . . . . . . . . . . . . . . . . . . . .
16
3.7
Hierarchical Structure (µ and σ) . . . . . . . . . . . . . . . . . . . . . .
18
3.8
Mutually Exclusive Clusters . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.9
Filtered Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.10 Era Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.11 Emotion Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.12 Genre Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.13 Instrument Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.14 Origin Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.15 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.16 Cluster Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.17 2 x 2 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.1
ARIHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2
Term Cooccurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3
Term Cooccurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.4
Optimal Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.5
Self-similarity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.6
Neighboring Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.7
Distant Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
vi
LIST OF TABLES
4.8
Term Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.9
Term Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.10 Term Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.11 Label Cluster Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.12 Label Cluster Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
vii
Chapter 1
Introduction
In 21st century, we are living in a world, where an instant access to the countless number of music database is granted. Online music stores such as iTunes store or online
music streaming service such as Pandora provides millions of songs from artists all over
the world. As the music database has grown rapidly with the advent of advanced technology and the Internet, it requires much more efficient ways of organizing and finding
music. One of the main tasks in the field of music information retrieval (MIR) is to
generate a computational model for classification of audio data such that it is faster
and easier to search for and listen to music. A number of researchers have proposed
methods to categorize music into different classifications such as genres, emotions, activities, or artists (1, 2, 3, 4). This automated classification would then let us search
for audio data based on their labels – e.g., when we search for “sad” music, the audio
emotion classification model returns songs with label “sad.” Regardless of the type of
classification model, there is a generic approach to this problem as outlined in figure
1.1 – extracting audio features, obtaining labels, and computing the parameters to generate a model by a means of supervised machine learning technique.
When utilizing a supervised learning technique to construct a classification model, however, it is imperative that ground truth labels are provided. Obtaining labels involves
1
human subjects, which makes the process expensive and inefficient. In certain cases,
the number of labels are bound to be insufficient, making it even harder to collect
data. As a result, researchers have used a semi-supervised learning method, in which
unlabeled data is combined with labeled data during training process in order to improve performance (5). However, this method is also limited to certain situation, where
data has only one type of label – e.g., if a dataset is labeled by genre, it is possible
to construct a genre classification model; however, it is not possible to create a mood
classification model without knowing a priori correspondence between genre and mood
labels. This causes a problem when certain dataset has only one type of label and needs
to be classified into a different label class. It can be much efficient and less expensive if
there exists a statistical correspondence among different audio labels so that it enables
to easily predict a different label class from the same dataset.
Therefore, the goal of this study is to define a statistical relationship among different
audio labels such as genres, emotions, era, origin, and instruments, using The Million
Song Dataset (6), applying an unsupervised learning technique, i.e. k-means algorithm,
and calculating the Hubert-Arabie adjusted Rand (ARIHA ) index (7).
The outline of this thesis is organized in following steps: literature review will be provided about previous MIR studies on automatic classification models. The detailed
methodology and data analysis will be given in Chapter 3. Based on the results obtained from Chapter 3, possible interpretations of the data will be discussed in Chapter
4. Finally, concluding remarks and future work are laid out in Chapter 5.
2
Figure 1.1: System Diagram of a Generic Automatic Classification Model Labels are used only in supervised learning case
3
Chapter 2
Literature Review
2.1
Music Information Retrieval
There are many ways to categorize music. One of the traditional ways to categorize
music is by its metadata such as name of song, artist, or album, which is known as
tag-based or text-based categorization (8). As music databases have grown virtually
countless, it requires more efficient ways to query and retrieve music. As opposed to
tag-based query and retrieval, which only enables to retrieve songs that we have a priori information about, a content-based query and retrieval allows us to find songs in
different ways - e.g., it allows to find songs similar in musical context or structure and
it could also recommend songs based on musical labels such as emotion.
Music information retrieval (MIR) is a widely and rapidly growing research topic in
the multimedia processing industry, which aims at extending the understanding and
usefulness of music data, through the research, development and application of computational approaches and tools. As a novel way of retrieving songs or creating a playlist,
researchers have come up with a number of classification methods using different labels
such as genre, emotion, or cover song (1, 2, 3, 4) so that each classification model could
retrieve a song based on its label or musical similarities. These methods are different
4
2.2 Automatic Classification
than tag-based method since audio features are extracted and analyzed prior to constructing a computational model. Therefore, a retrieved song is based on the content
of the audio, not on its metadata.
2.2
Automatic Classification
In previous studies, most audio classification models are based on supervised learning
method, in which musical labels such as genre or emotion are required (1, 2, 3, 4). Using
labels along with well-defined high-dimensional musical features, learning algorithms
go through computations to train the data to find possible relationships between the
features and a label so that for any given unknown (test) data, the model could correctly
recognize the label.
2.2.1
Genre
Tzanetakis et al. (1, 9) are among the earliest researchers who worked on automatic
genre classification. Instead of manually assigning musical genre for a song, automatic
genre classification model enables to generate a genre label for a given song after comparing its musical features with the model. In (1, 9) the authors used three feature sets,
in which each describes timbral texture, rhythmic content, and pitch content, respectively. Features such as spectral centroid, rolloff, flux, zero crossing rate, and MFCC
(10) were extracted to construct a feature vector that describes timbral texture of
music. Automatic beat detection algorithm (4, 11) was used to calculate the rhythmic
structure of music and used as a feature vector that describes rhythmic content. Lastly,
pitch detection techniques (12, 13, 14) were used to construct a pitch content feature
vector. Figure 3.12 represents the system overview of the automatic genre classification
model described in (1, 9).
5
2.2 Automatic Classification
Figure 2.1: System Diagram of a Genre Classification Model - Gaussian Mixture
Model (GMM) is used as a classifier
6
2.2 Automatic Classification
2.2.2
Emotion
In 2006, the work of L. Lu et al. (15) was one of a few studies that provided indepth analysis of mood detection and tracking of music signals using acoustic features
extracted directly from audio waveform, instead of using MIDI or symbolic representations. Although it has been an active research topic, researchers have consistently faced
the same problem with quantification of music emotion due to the nature of subjectivity of music emotion. Recent studies have sought ways to minimize the inconsistency
among labels. Skowronek et al. (16) paid close attention to material collection process.
They obtained a large number of labelled data from 12 subjects and accounted for only
those in agreement with one another in order to exclude the ambiguous ones. In (17),
the authors created a collaborative game that collects dynamic (time-varying) labels
of music mood from two players and ensures that the players cross check each other’s
label in order to build a consensus.
Defining mood classes is not an easy task. There have been mainly two approaches
to defining mood: categorical and continuous. In (15) mood labels are classified into
adjectives such as happy, angry, sad, or sleepy. However, the authors in (18) defined
mood as a continuous regression problem as described in figure 2.2, and mapped emotion into two-dimensional Thayer’s Plane (19) shown in figure 2.3.
Recent studies focus on multi-modal classification using both lyrics and audio contents
to quantify music emotion (20, 21, 22), on dynamic music emotion modeling (23, 24),
or on unsupervised learning approach for mood recognition (25).
7
2.2 Automatic Classification
Figure 2.2: System Diagram of a music emotion recognition model - Arousal and
Valence are two independent regressors
Figure 2.3: Thayer’s 2-Dimensional Emotion Plane (19) - Each axis is used as an
independent regressor
8
Chapter 3
Methodology
Previous studies have constructed the automatic classification model, using a relationship between audio features and one type of label – (e.g. genre or mood). As it is
stated in chapter 1, however, if statistical relationship among several audio labels is
defined, it could reduce the cost of constructing the automatic classification models. In
order to solve the problem, two things are needed:
1. Big Music Data with multiple labels: The Million Song Dataset (6)
2. Cluster Validation Method: Hubert-Arabie adjusted Rand Index
A large dataset is required to minimize bias and noisiness of labels. Since labels are
acquired from users, small number of music data would lead to large variance among
labels and thus large error. A cluster validation method is required to compare sets of
clusters created by different labels, hence Hubert-Arabie adjusted Rand index.
9
3.1 Data Statistics
3.1
Data Statistics
The Million Song Dataset (6) consists of million files in HDF5 format, from which various information can be retrieved including metadata such as the name of artist, title
of song, or tags (terms) and musical features such as chroma, tempo, loudness, mode,
or key. Table 3.1 shows the overall statistics of the dataset and table 3.2 shows a list
of fields available in the files of the dataset.
No.
1
2
3
4
5
6
Type
Songs
Data
Unique Artists
Unique Terms
Artists with at least one term
Identified Cover Song
Total
1,000,000
273 GB
44,745
7,643
43,943
18,196
Table 3.1: Overall Data Statistics - Statistics of The Million Song Dataset
3.2
Filtering
LabROSA, the distributor of The Million Song Dataset, also provides all the necessary
functions to access and manipulate the data from Matlab. ‘HDF5 Song File Reader’
function lets convert .h5 files into a Matlab object, which can be further used to extract labels using ‘get artist terms’ function and features using ‘get segments pitches’,
‘get tempo’, and ‘get loudness’ functions. Therefore, labels enable to create several
sets of clusters, while audio features are used to form another set of cluster. Figure 3.1
indicates different sets of clusters.
Although it is idealistic to take account of all million songs, due to the noisiness of
data, dataset must undergo following filtering process to get rid of unnecessary songs:
1. All terms are categorized into one of 5 label classes
2. Create a set of clusters based on each label class.
10
3.2 Filtering
3. Find hierarchical structure of each label class.
4. Make each set of clusters mutually exclusive.
5. Songs that contain at least one term from all of the five label classes are retrieved.
Figure 3.1: Clusters - Several sets of clusters can be made using labels and audio features
3.2.1
1st Filtering
As shown in table 3.1, there are 7643 unique terms that describe songs in the dataset.
Some examples of these terms are shown in table 3.3. These unique terms have to be
filtered so that meaningless terms are ignored. In other words, five labels are chosen
so that each term can be categorized into one of following five labels: era, emotion,
genre, instrument, and origin. Doing so, any term that cannot be described as one of
those labels is dropped. Table 3.4 shows the total number of terms that belong to each
label. Note the small number of terms in each label category compared to original 7643
unique terms. This is because many terms cross reference each other. For example,
‘00s’, ‘00s alternative’, and ‘00s country’ all count as unique terms, but they are all
represented as ‘00s’ under ‘era’ label class. Similarly, ‘alternative jazz’, ‘alternative
11
3.2 Filtering
Field Name
analysis sample rate
artist familiarity
artist hotnesss
artist id
artist name
artist terms
artist terms freq
audio md5
bars confidence
bars start
beats confidence
beats start
danceability
duration
energy
key
key confidence
loudness
mode
mode confidence
release
sections confidence
sections start
segments confidence
segments loudness max
segments loudness max time
segments loudness max start
segments pitches
segments start
segments timbre
similar artist
song hotttnesss
song id
tempo
time signature
time signature confidence
title
track id
Type
float
float
float
string
string
array string
array float
string
array float
array float
array float
array float
float
float
float
int
float
float
int
float
string
array float
array float
array float
array float
array float
array float
2D array float
array float
2D array float
array string
float
string
float
int
float
string
string
Description
sample rate of the audio used
algorithmic estimation
algorithmic estimation
Echo Nest ID
artist name
Echo Nest tags
Echo Nest tags freqs
audio hash code
confidence measure
beginning of bars
confidence measure
result of beat tracking
algorithmic estimation
in seconds
energy from listener perspective
key the song is in
confidence measure
overall loudness in dB
major or minor
confidence measure
album name
confidence measure
largest grouping in a song
confidence measure
max dB value
time of max dB value
dB value at onset
chroma feature
musical events
texture features
Echo Nest artist IDs
algorithmic estimation
Echo Nest song ID
estimated tempo in BPM
estimate of number of beats/bar
confidence measure
song title
Echo Nest track ID
Table 3.2: Field List - A list of fields available in the files of the dataset
12
3.2 Filtering
rock’, ‘alternative r & b’, and ‘alternative metal’ are simply ‘alternative’, ‘jazz’, ‘rock’,
‘r & b’, and ‘metal’ under ‘genre’ label category.
No.
1
2
3
.
.
3112
3113
3114
.
.
5787
5788
5789
.
.
Terms
‘00s’
‘00s alternative’
‘00s country’
.
.
‘gp worldwide’
‘grammy winner’
‘gramophone’
.
.
‘punky reggae’
‘pure black metal’
‘pure grunge’
.
.
Table 3.3: Terms - Examples of terms (tags)
Label
era
emotion
genre
instrument
origin
Total
17
96
436
78
635
Table 3.4: Labels - Terms belong to each label class
In this way, the total number of terms in each category is reduced and it is still possible
to search songs without using repetitive terms. For example, a song that has ‘alternative jazz’ term can be searched by both ‘alternative’ and ‘jazz’ keywords, instead of
‘alternative jazz’. In addition, composite terms such as ‘alternative jazz’ or ‘ambient
electronics’ are not included since they are at the lowest level of hierarchical level and
the number of elements that belong to such clusters is few.
13
3.2 Filtering
3.2.2
2nd Filtering
After all unique terms are filtered into one of five label classes, each term belonging to
each label class is regarded as a cluster as shown in table 3.5. Note that it is still not
deterministic that all terms are truly representative as independent clusters as it must
be taken into account that there are a few hierarchical layers among terms – i.e. ‘piano’
and ‘electric piano’ terms might not be in the same level of hierarchy in ‘instrument’
label class. In order to account for differences in layers, co-occurence between a pair of
clusters is calculated as explained in next section.
Label
era
emotion
genre
instrument
origin
Clusters
‘00s’ ‘1700s’ ‘1910s’ ‘19th century’
‘angry’ ‘chill’ ‘energetic’ ‘horror’ ‘mellow’
‘ambient’ ‘blues’ ‘crossover’ ‘dark’ ‘electronic’ ‘opera’
‘accordion’ ‘banjo’ ‘clarinet’ ‘horn’ ‘ukelele’ ‘laptop’
‘african’ ‘belgian’ ‘dallas’ ‘hongkong’ ‘moroccan’
Table 3.5: Clusters - Each term forms a cluster within each label class
3.2.3
3.2.3.1
3rd Filtering
Co-occurence
Within a single label class, there are a number of different terms, of which each could
possibly represent an individual cluster. However, while certain terms inherently possess clear meaning, some do not – e.g. in ‘genre’ label class, the distinctions between
‘dark metal’ and ‘death metal’ or ‘acid metal’ and ’acid funk’ might not be obvious. In
order to avoid ambiguity among clusters, co-ocurrences of two clusters are measured.
Co-occurrences of a pair of clusters can be easily calculated as follows:
cooca,b =
intersect(a, b)
,
intersect(a, a)
coocb,a =
14
intersect(a, b)
intersect(b, b)
(3.1)
3.2 Filtering
where intersect(i, j) counts the number of elements in both i and j clusters. Therefore,
if both clusters have high or small co-occurrence values, it implies that there is a large
or small overlap between clusters, while if only one of two clusters has a high value
and the other has a low value, it implies that one cluster is a subset of the other as
illustrated in figures 3.2 and 3.3. Also note that if one cluster is a subset of the other,
it implies that they are not at the same hierarchical level.
Figure 3.2: Co-occurence - same level - (a) small overlap between two clusters; (b)
large overlap between two clusters
Figure 3.3: Co-occurence - different level - (a) Cluster B is a subset of Cluster A; (b)
Cluster A has relatively large number of elements than Cluster B, of which most belong to
intersection
Therefore, threshold is set such that if (cooca,b > 0.9 & coocb,a < 0.1) or (cooca,b < 0.1 &
coocb,a > 0.9), then cluster A is a subset of cluster B or vice versa. If neither condition
15
3.2 Filtering
is met, two clusters are at the same hierarchical level. In doing so, layers of hierarchy
can be retrieved.
3.2.3.2
Hierarchical Structure
After obtaining co-occurrence values for all the pairs of clusters, the structure of clusters
in each label classes can be known. Table 3.6 shows the hierarchical structure of each
label class and figure 3.4 shows some of the terms at different hierarchical level.
Label
era
emotion
genre
instrument
origin
1st Layer
3
3
27
18
13
2nd Layer
14
93
127
72
135
3rd Layer
empty
empty
274
empty
487
Total
17
96
428
90
635
Table 3.6: Hierarchical Structure(Clusters) - Total number of clusters at different
layers in each label class
Figure 3.4: Hierarchical Structure (Terms) - Examples of terms at different layers
in each label class
16
3.2 Filtering
The structure looks well correlated with intuition with more general terms at higher
level such as ‘bass’ or ‘guitar’, while terms such as ‘acoustic bass’ or ‘classical guitar’
are at lower level. The number of songs in each cluster also matches well with intuition. Terms at the high level of hierarchy have a large number of songs, while there
are relatively small number of songs that belong to terms at the low level. Now that
the structure of clusters for each label class is known, it must be carefully decided that
which layer should be used as there is a tradeoff between the number of clusters and
the number of songs belonging to each cluster: higher layer has a small total number
of clusters but each cluster contains sufficient amount of songs and vice versa.
In order to make a logical decision, three different thresholds are set: the number of
cluster, N , the mean, µ, and the standard deviation, σ, of all levels are calculated and
shown in table 3.7. The rationale is that each layer within a label class must have
enough number of clusters and that each cluster must contain sufficient number of
songs while the variance of the distribution is as small as possible. The author defined
the value for all three thresholds as follows:
N >5
µ > 5, 000
σ = as small as possible
1st layer from instrument class and 2nd layer from era, emotion, genre, and origin
label classes are selected as shown in table 3.7.
17
3.2 Filtering
Label
era
emotion
genre
instrument
origin
1st Layer
µ
59,524 (3)
23,905 (3)
35,839 (27)
39,744 (18)
69,452 (13)
σ
47,835
15,677
79,370
76,272
141,440
2nd Layer
µ
21,686 (14)
5,736 (93)
5,744 (127)
998 (72)
8,804 (135)
σ
43,263
19,871
14,307
2,464
23,383
3rd Layer
µ
empty
empty
2,421 (274)
empty
644 (487)
σ
empty
empty
6,816
empty
2,008
Table 3.7: Hierarchical Structure (µ and σ)] - The mean and the standard deviation
for each layer. Number in parenthesis denotes number of clusters. Bold numbers denote
selected layer.
3.2.4
3.2.4.1
4th Filtering
Term Frequency
After finding the structure of clusters and selecting the layer in the previous section,
all the clusters within the same layer must become mutually exclusive, leaving no overlapping elements among clusters. Therefore, after finding intersections among clusters,
it needs to be decided to which cluster the element should belong. In order to resolve
conflicts in multiple clusters, the frequency of terms is retrieved for every single element via provided function ‘get artist terms freq’. Therefore, for every element within
intersection, the term frequency value is taken into account and whichever term that
has a higher value should take the element, while the other should lose. In this way,
total number of clusters are reduced via merging and all the terms become mutually
exclusive. Table 3.8 indicates total number of songs in each label class.
Label
era
emotion
genre
instrument
origin
# of Songs
387,977
394,860
700,778
384,509
871,631
Table 3.8: Mutually Exclusive Clusters - Total number of songs in mutually exclusive
clusters
18
3.2 Filtering
3.2.5
5th Filtering
Since most songs are given multiple terms, they might belong to several label classes –
e.g. a song with ‘00s’ and ‘alternative jazz’ terms belong to both ‘era’ and ‘genre’ label
class. Therefore, after obtaining the indexes of songs that belong to each category,
intersections among these indexes are retrieved so that only the songs with each of all
five labels are considered. The description of aforementioned process is shown in figure
3.5. Finally, the total number of clusters in each label class and the total number of
songs used in the study after all filtering processes is shown in table 3.9.
Figure 3.5: Intersection of Labels - Songs that belong to all five label classes are
chosen
Original
Filtered
Songs
1,000,000
41,269
Era
14
7
Emotion
91
34
Genre
122
44
Instrument
17
7
Origin
99
33
Table 3.9: Filtered Dataset - Total number of songs and clusters after filtering
19
3.3 Audio Labels
3.3
Audio Labels
3.3.1
Era
After all the filtering processes, 7 clusters are selected for era label class. Terms such
as ‘16th century’ or ‘21th century’ as well as ‘30s’ and ‘40s’ are successfully ignored
via merging and hierarchy. Table 3.10 and figure 3.6 show the statistics of remaining
terms. Note that the distribution is negatively skewed, which is intuitive, because there
are more songs that exist in recorded format in later decades than early 20th century
due to the advanced recording technology. It also makes sense that the cluster ‘80s’
consists of most songs because people use the term ‘80s’ to describe ‘80s’ rock or ‘80s’
music more often than ‘00s’ music or ‘00s’ pop.
Era
# of songs
50s
661
60s
3,525
70s
2,826
80s
17,359
90s
9,555
00s
6,111
20th
1,232
Total
41,269
Table 3.10: Era Statistics - # of Songs belonging to each era cluster
Histogram of Era Cluster
16000
14000
12000
# of songs
10000
8000
6000
4000
2000
0
50s
60s
70s
80s
Cluster
90s
00s
20th century
Figure 3.6: Era Histogram - Distribution of songs based on era label
20
3.3 Audio Labels
3.3.2
Emotion
There are a total of 34 clusters in emotion label class, which are shown in table 3.11.
Note the uneven distribution of songs in emotion label class is shown in figure 3.7.
Clusters such as ‘beautiful’, ‘chill’, and ‘romantic’ together consist about one third of
the total songs, while there are relatively a few number of songs belonging to clusters
such as ‘evil’, ‘’haunting’, and ‘uplifting’.
‘beautiful’
‘evil’
‘haunting’
‘inspirational’
‘moody’
‘romantic’
‘uplifting’
‘brutal’
‘gore’
‘horror’
‘light’
‘obscure’
‘sad’
‘wicked
Emotion
‘calming’
‘grim’
‘humorous’
‘loud’
‘patriotic’
‘sexy’
‘wistful’
‘chill’
‘happy’
‘hypnotic’
‘melancholia’
‘peace’
‘strange’
‘witty’
‘energetic’
‘harsh’
‘intense’
‘mellow’
‘relax’
‘trippy’
Table 3.11: Emotion Terms - all the emotion terms.
Histogram of Emotion Cluster
12000
10000
# of songs
8000
6000
4000
2000
0
beautiful
evil
haunting
inspirational
Cluster
moody
romantic
uplifting
Figure 3.7: Emotion Histogram - Distribution of songs based on emotion label
21
3.3 Audio Labels
3.3.3
Genre
A total of 44 genre clusters are created and shown in table 3.12 and its distribution is
shown in figure 3.8. Also note that certain genre terms such as ‘hip hop’, ‘indie’, and
‘wave’ have more songs than the others like ‘emo’ or ‘melodic’.
‘alternative’
‘christian’
‘electronic’
‘industrial’
‘new’
‘opera’
‘rag’
‘swing’
‘urban’
‘ambient’
‘classic’
‘eurodance’
‘indie’
‘noise’
‘post’
‘soundtrack’
‘synth pop’
‘waltz’
Genre
‘ballade’
‘country’
‘hard style’
‘lounge’
‘nu’
‘power’
‘salsa’
‘techno’
‘wave’
‘blues’
‘dance’
‘hip hop’
‘modern’
‘old’
‘progressive’
‘smooth’
‘thrash’
‘zouk’
‘british’
‘dub’
‘instrumental’
‘neo’
‘orchestra’
‘r&b’
‘soft’
‘tribal’
Table 3.12: Genre Terms - all the genre terms.
Histogram of Genre Cluster
7000
6000
# of songs
5000
4000
3000
2000
1000
0
alternative
country
instrumental
noise
Cluster
progressive
swing
wave
Figure 3.8: Genre Histogram - Distribution of songs based on genre label
22
3.3 Audio Labels
3.3.4
Instrument
There are only 7 instrument clusters after filtering processes. The name of each cluster
and the number of songs belonging to corresponding cluster is given in table 3.13. The
values make perfect sense as ‘guitar’, ‘piano’, and ‘synth’ have many songs in their
clusters while there are relatively small number of songs belonging to ‘saxophone’ and
‘violin’. Figure 3.9 shows the histogram of instrument clusters.
Instrument
‘bass’
‘guitar’
‘saxophone’
‘synth’
2444
9731
1340
16662
‘drum’
‘piano’
‘violin’
5103
5667
322
Table 3.13: Instrument Statistics - # of songs belonging to each instrument cluster
Histogram of Genre Cluster
16000
14000
12000
# of songs
10000
8000
6000
4000
2000
0
bass
drum
guitar
piano
Cluster
saxophone
violin
synth
Figure 3.9: Instrument Histogram - Distribution of songs based on instrument label
23
3.3 Audio Labels
3.3.5
Origins
There are 33 different origin clusters as laid out in table 3.14. Note that clusters such
as ‘american’, ‘british’, ‘dc’, and ‘german’ have a large number of songs, while clusters
such as ‘new orleans’, ‘suomi’, or ‘texas’ consists of relatively small number of songs.
Also note that terms ‘american’ and ‘texas’ both appear as independent clusters, while
it seems intuitive that ‘texas’ should be a subset of ‘american’. It is because when
describing a song with origin label, certain songs are specifically described by ‘texas’
than ‘american’ or ‘united states’ – e.g. country music. Finally, the statistics of origin
label class is shown in figure 3.10.
‘african’
‘cuba’
‘ireland’
‘massachusetts’
‘new orleans’
‘southern’
‘texas’
‘american’
‘dc’
‘israel’
‘mexico’
‘poland’
‘spain’
‘united states’
Origin
‘belgium’
‘east coast’
‘italian’
‘nederland’
‘roma’
‘suomi’
‘west coast’
‘british’
‘england’
‘japanese’
‘new york’
‘russia’
‘sweden’
‘canada’
‘german’
‘los angeles’
‘norway’
‘scotland’
‘tennessee’
Table 3.14: Origin Terms - all the origin terms.
Histogram of Origin Cluster
7000
6000
# of songs
5000
4000
3000
2000
1000
0
african
cuba
ireland
massachusetts
Cluster
new orleans
southern
texas
Figure 3.10: Origin Histogram - Distribution of songs based on origin label
24
3.4 Audio Features
3.4
Audio Features
Audio features are extracted in order to construct feature clusters using clustering algorithm, using provided functions such as ‘get segments timbre’ or ‘get segments pitches’.
Table 3.15 shows a list of extracted features. It takes about 30ms to extract a feature
from one song, which makes a total of 8 hours from million songs. However, since only
41,269 songs are used, the computation time is reduced to less than an hour.
No.
1
2
3
4
5
6
7
8
Feature
Chroma
Texture
Tempo
Key
Key Confidence
Loudness
Mode
Mode Confidence
Function
‘get segments pitches’
‘get segments timbre’
‘get tempo’
‘get key’
‘get key confidence’
‘get loudness’
‘get mode’
‘get mode confidence’
Table 3.15: Audio Features - Several audio features are extracted via respective functions
3.4.1
k-means Clustering Algorithm
Content-based clusters can be constructed based on clustering algorithm, an unsupervised learning method, which does not require pre-labeling for data and uses only features to construct clusters of similar data points. There are several variants of clustering
algorithms such as k-means, k-median, centroid-based, or single-linkage (26, 27, 28).
In this study, k-means clustering algorithm is used for automatic clusters. The basic
structure of the algorithm is defined in following steps (29, 30):
1. Define a similarity measurement metric, d. (e.g. Euclidean, Manhattan, etc.)
2. Randomly initialize k centroids, µk .
3. For all data points x, find µk that returns minimum d.
25
3.4 Audio Features
4. Find Ck , a cluster that includes a set of points assigned to µk .
5. Recalculate µk for every Ck .
6. Repeat steps 3 through 5 until it converges.
7. Repeat steps 2 through 6 multiple times to avoid local minima.
The author used the (squared) Euclidean distance as the similarity measurement metric, d, and computed the centroid means of each cluster as such:
d(i) := ||x(i) − µk ||2
(3.2)
1 X (i)
x
|Ck |
(3.3)
µk :=
i∈Ck
where x(i) is the position of i’th point. Ck is constructed by finding c(i) that minimizes
(3.3), where c(i) is the index of the centroid closest to x(i) . In other words, points
belong to a cluster, where the Euclidean distance between a point and its centroid is
minimum.
3.4.2
Feature Matrix
Using extracted audio features such as chroma, timbre, key, key confidence, mode,
mode confidence, tempo, and loudness, feature matrix F IxJ is constructed, where I is
the total number of points (= 41,269), and J is the total number of features (= 30
– i.e. both chroma and timbre features are averaged across time, resulting in 12 x 1
dimensions for each point). Therefore, the cost function of the algorithm is:
I
1 X (i)
d
I
i=1
26
(3.4)
3.4 Audio Features
and the optimization objective is to minimize (3.4).
3.4.3
Feature Scale
Feature scaling is necessary as each feature vector is in different range and therefore
needs to be normalized for equal weighting. The author used mean/standard deviation
scaling method for each feature fj as such:
fˆj =
3.4.4
fj − µfj
σ fj
(3.5)
Feature Clusters
It is often arbitrary what should be the correct number for K and there is no algorithm that leads to the definitive answer. However, an elbow method is often used to
determine the number of cluster, K. Figure 3.11 shows a plot of a cost function based
on different K. Either K = 8 or K = 10 marks the elbow of the plot and a possible
candidate for the number of clusters. In this study, K = 10 is chosen.
Elbow Method
30
28
Cost
26
24
22
20
18
0
5
10
15
20
25
K
30
35
40
45
50
Figure 3.11: Elbow Method - K = 8 or K = 10 is the possible number of clusters
27
3.4 Audio Features
After choosing the right value of K, the structure of clusters is found and shown in
figure 3.12.
Cluster
# of Songs
Cluster
# of Songs
1
4,349
6
3,933
2
4,172
7
2,544
3
4,128
8
5,149
4
4,475
9
2,436
5
5,866
10
4,217
Table 3.16: Cluster Statistics - The number of songs within each cluster is found.
Histogram of Content−based Cluster
5000
# of songs
4000
3000
2000
1000
0
1
2
3
4
5
6
7
8
9
10
Cluster
Figure 3.12: Content-based Cluster Histogram - Distribution of songs based on
audio features
28
3.5 Hubert-Arabie adjusted Rand Index
3.5
Hubert-Arabie adjusted Rand Index
After obtaining six sets of clusters – i.e. five with labels and one with audio features,
the relationship among a pair of clusters can be found by calculating the HubertArabie adjusted Rand (ARIHA ) index (7, 31). ARIHA index enables to quantify cluster
validation by comparing the generated clusters with the original structure of the data.
Therefore, by comparing two different sets of clusters, the correlation between two
clusters can be drawn. ARIHA index can be measured as:
ARIHA =
N
2
(a + d) − [(a + b)(a + c) + (c + d)(b + d)]
.
N 2
−
[(a
+
b)(a
+
c)
+
(c
+
d)(b
+
d)]
2
(3.6)
where N is the total number of data and a, b, c, d represents four different types of pairs.
Let A and B be two sets of clusters and P and Q be number of clusters in each set,
then a, b, c, and d are defined as following:
a : element in the same group of both A and B
b : elements in the same group of B but in different group of A
c : elements in the same group of A but in different group of B
d : elements in different group of both A and B
which can be easily described by a contingency table shown in 3.17. This leads to
the computation of a, b, c, and d as following:
Q
P P
P
a=
p=1 q=1
P
P
b=
p=1
t2pq − N
2
Q
P P
P
t2p+ −
p=1 q=1
2
29
.
(3.7)
t2pq
.
(3.8)
3.5 Hubert-Arabie adjusted Rand Index
Q
P
c=
Q
P P
P
d=
p=1 q=1
q=1
Q
P P
P
t2+q −
p=1 q=1
t2pq
2
t2pq + N 2 −
P
P
p=1
.
t2p+ −
(3.9)
Q
P
q=1
2
t2+q
.
(3.10)
where tpq , tp+ , and t+q denote the total number of elements belonging to both pth
and qth cluster, the total number of elements belonging to pth cluster, and the total
number of elements belonging to qth cluster, respectively. It can be viewed as such
that ARIHA = 1 means perfect cluster recovery, while values greater than 0.9, 0.8, and
0.65 mean excellent, good, and moderate recovery, respectively (7).
A
pair in same group
pair in same group
a
pair in different group
b
c
d
B
pair in different group
Table 3.17: 2 x 2 Contingency Table - 2 x 2 contingency table that describes four
different types of pairs: a, b, c, d
30
Chapter 4
Evaluation and Discussion
ARIHA is calculated for all pairs of cluster sets and shown in table 4.1.
Features
Era
Emotion
Genre
Instrument
Origin
Features
1
0.0145
0.0066
0.0404
0.0289
0.0139
Era
0.0145
1
0.0399
0.0823
0.1010
0.0315
Emotion
0.0066
0.0399
1
0.1267
0.0390
0.0961
Genre
0.0404
0.0823
0.1267
1
0.0833
0.0843
Instrument
0.0289
0.1010
0.0390
0.0833
1
0.0418
Origin
0.0139
0.0315
0.0961
0.0843
0.0418
1
Table 4.1: ARIHA - Cluster validation is calculated based on Hubert-Arabie adjusted
Rand Index
It is observed from Table 4.1 that the cluster validation between any pair of cluster sets
is overall very low with the highest correlation between emotion and genre at 12.67 %
and the lowest between origin and era at 3.15 %. Although all the validation values
are too low to draw a relationship between a pair of audio labels, it is still interesting
to observe that emotion and genre are most correlated among those, indicating that
there are common emotion annotations for certain genres. In order to observe a closer
relationship between emotion and genre, the number of intersections between each term
from both label classes are calculated and the maximum intersection for each term is
31
shown in table 4.2 and 4.3.
Genre
‘alternative’
‘ambient’
‘ballade’
‘blues’
‘british’
‘christian’
‘classic’
‘country’
‘dance’
‘dub’
‘electronic’
‘eurodance’
‘hard style’
‘hip hop’
‘instrumental’
‘industrial’
‘indie’
‘lounge’
‘modern’
‘neo’
‘new’
‘noise’
‘nu’
Intersection
159
92
161
204
119
214
392
90
74
99
509
52
94
3002
156
44
2167
58
70
169
200
121
44
Emotion
‘beautiful’
‘chill’
‘beautiful’
‘energetic’
‘beautiful’
‘inspirational’
‘romantic’
‘romantic’
‘chill’
‘chill’
‘chill’
‘uplifting’
‘gore’
‘chill’
‘beautiful’
‘romantic’
‘chill’
‘beautiful’
‘chill’
‘chill’
‘chill’
‘beautiful’
‘chill’
Genre
‘old’
‘orchestra’
‘opera’
‘post’
‘power’
‘progressive’
‘r&b’
‘rag’
‘soundtrack’
‘chill’
‘smooth’
‘soft’
‘swing’
‘synth pop’
‘techno’
‘thrash’
‘tribal’
‘urban’
‘waltz’
‘wave’
‘zouk’
Intersection
121
117
116
164
160
827
102
182
1042
173
1584
536
60
132
960
134
99
154
49
3448
100
Emotion
‘beautiful’
‘beautiful’
‘romantic’
‘chill’
‘melancholia’
‘chill’
‘chill’
‘chill’
‘chill’
‘chill’
‘chill’
‘mellow’
‘mellow’
‘melancholia’
‘happy’
‘peace’
‘brutal’
‘beautiful’
‘romantic’
‘romantic’
‘beautiful’
Table 4.2: Term Cooccurrence - The most common emotion term for each genre term is
observed
It is observed that because of disproportional distribution among emotion terms, most
genre labels share the same emotion terms such as ‘beautiful’, ‘chill’, ‘romantic’. On the
other hand, as the distribution of genre terms are more ‘flat’, many emotion terms share
different genre terms. However, do note that the co-occurrence between an emotion
label and a genre label does not correlate well with intuition as it can be observed from
table 4.3. – e.g. ‘beautiful’ & ‘indie’, ‘happy’ & ‘hip hop’, ‘uplifting’ & ‘progressive,’
32
4.1 K vs. ARIHA
Emotion
‘beautiful’
‘brutal’
‘calming’
‘chill’
‘energetic’
‘evil’
‘gore’
‘grim’
‘happy’
‘harsh’
‘haunting’
‘horror’
‘humorous’
‘hypnotic’
‘intense’
‘inspirational’
‘light’
Intersection
883
99
118
3002
276
72
94
659
1472
28
110
370
96
93
70
214
118
Genre
‘indie
‘tribal’
‘synthpop’
‘hip hop’
‘wave’
‘indie’
‘hardstyle’
‘hip hop’
‘hip hop’
‘noise’
‘electronic’
‘wave’
‘salsa’
‘smooth’
‘rag’
‘christian’
‘soft’
Emotion
‘loud’
‘melancholia’
‘mellow’
‘moody’
‘obscure’
‘patriotic’
‘peace’
‘relax’
‘romantic’
‘sad’
‘sexy’
‘strange’
‘trippy’
‘uplifting’
‘wicked’
‘wistful’
‘witty’
Intersection
119
325
536
88
50
30
134
230
3448
161
752
76
67
79
99
121
104
Genre
‘christian’
‘indie’
‘soft’
‘alternative’
‘new’
‘hip hop’
‘thrash’
‘smooth’
‘wave’
‘indie’
‘hip hop’
‘progressive’
‘progressive’
‘progressive’
‘hip hop’
‘classic’
‘progressive’
Table 4.3: Term Cooccurrence - The most common genre term for each emotion term is
observed
which is indicative of the low cluster validation rate. It also indicates that people use
only limited vocabulary to describe the emotional aspect of a song regardless of the
genre of the given song.
Although it seems intuitive and expected that the correlations between audio labels
turn out to be low, it is quite surprising that the cluster validations between audio
features and each label are also low. In order to understand why this is the case, a
number of post-processing steps are proposed.
4.1
K vs. ARIHA
In section 3.4.4., the number of clusters, K, was chosen based on the elbow method.
This K does not necessarily generate optimal validation rates, and therefore, K vs.
ARIHA plot is drawn to find out K that maximizes the validation rates for each set of
clusters. Figure 4.1 shows the pattern of ARIHA for each label class as K changes. It
33
4.2 Hubert-Arabie adjusted Rand Index (revisited)
turns out that the sum of ARIHA is maximum when K = 5, the maximum number of
feature clusters.
ARIha vs. K
0.06
Era
Emotion
Genre
Instrument
Origin
0.05
ARIha
0.04
0.03
0.02
0.01
0
0
5
10
15
20
25
# of cluster: K
30
35
40
45
50
Figure 4.1: K vs. ARIHA - ARIHA is maximum when K = 5
4.2
Hubert-Arabie adjusted Rand Index (revisited)
Using the result from previous section, (K = 5), ARIHA is re-calculated for each label
class and shown in table 4.4.
Features (original)
Features (K = 5)
Era
0.0145
0.0228
Emotion
0.0066
0.0069
Genre
0.0404
0.0439
Instrument
0.0289
0.0402
Origin
0.0139
0.0160
Table 4.4: Optimal Cluster Validation - optimal ARIHA are calculated for each label class
34
4.3 Cluster Structure Analysis
4.3
Cluster Structure Analysis
Now that the optimal K and ARIHA values are found, it needs to be discussed the
reason for such low cluster validation rates. In order to do so, the structure of clusters
needs to be known by calculating the Euclidean distance between centroids of clusters.
Table 4.5 shows the Euclidean distance between centroids of clusters. Note that the
centroids of clusters 1 and 2 have the minimum distance while those of clusters 3 and
4 have the maximum distance, indicating most similar and dissimilar clusters, respectively.
Cluster
1
2
3
4
5
1
0
0.0860
0.1638
0.1042
0.1061
2
0.0860
0
0.1975
0.0963
0.1086
3
0.1638
0.1975
0
0.2310
0.1368
4
0.1042
0.0963
0.2310
0
0.1810
5
0.1061
0.1086
0.1368
0.1810
0
Table 4.5: Self-similarity matrix - the distances between each pair of clusters are calculated
4.3.1
Neighboring Clusters vs. Distant Clusters
In order to observe the detailed structure of the cluster, co-occurrence between feature
clusters and label clusters are calculated and the first four most co-occurred clusters are
returned. In other words, for each feature cluster 1 through 5, four most intersecting
clusters from each label class is calculated and shown in figures 4.2 – 4.6.
Note that due to uneven distribution of songs within each label class, the cluster that
contains the largest number of songs such as ‘80s’ in era label, ‘chill’ in emotion, ‘hip
hop’ in genre, ‘synth’ in instrument, and ‘dc’ in origin, appear frequently across all
five feature clusters. In fact, ‘80s’ and ‘chill’ clusters appear as the most co-occurring
cluster with all five feature clusters.
35
4.3 Cluster Structure Analysis
5000
# of songs
# of songs
3000
2000
1000
0
80s
4000
3000
2000
1000
90s
00s
0
80s
60s
90s
Cluster 1
00s
70s
00s
70s
Cluster 2
1000
3000
# of songs
# of songs
800
600
400
1000
200
0
80s
2000
20th century
Cluster 3
60s
0
80s
90s
90s
Cluster 4
# of songs
3000
2000
1000
0
80s
90s
00s
60s
Cluster 5
Figure 4.2: Co-occurence between feature clusters and era clusters - First four
most co-occurred era clusters are returned for each feature cluster
36
4.3 Cluster Structure Analysis
2000
# of songs
# of songs
3000
2000
1000
0
chill
1500
1000
500
beautiful
romantic
Cluster 1
0
chill
happy
romantic
beautiful
Cluster 2
happy
2000
# of songs
# of songs
2500
1000
500
1500
1000
500
0
chill
romantic
beautiful
Cluster 3
mellow
beautiful
romantic
Cluster 5
mellow
0
chill
happy
romantic
sexy
Cluster 4
# of songs
2500
2000
1500
1000
500
0
chill
Figure 4.3: Co-occurence between feature clusters and emotion clusters - First
four most co-occurred emotion clusters are returned for each feature cluster
37
4.3 Cluster Structure Analysis
2000
# of songs
# of songs
2000
1500
1000
1000
500
500
0
hip hop
1500
wave
smooth
0
wave
indie
indie
Cluster 1
hip hop
soft
techno
progressive
Cluster 2
2500
# of songs
# of songs
600
400
200
0
soundtrack
2000
1500
1000
500
smooth
classic
Cluster 3
0
hip hop
indie
wave
Cluster 4
# of songs
1500
1000
500
0
indie
wave
smooth
soft
Cluster 5
Figure 4.4: Co-occurence between feature clusters and genre clusters - First four
most co-occurred genre clusters are returned for each feature cluster
38
4.3 Cluster Structure Analysis
4000
# of songs
# of songs
3000
2000
1000
0
synth
3000
2000
1000
guitar
drum
0
synth
piano
guitar
Cluster 1
piano
drum
guitar
bass
Cluster 2
4000
# of songs
# of songs
1500
1000
500
0
piano
3000
2000
1000
guitar
synth
0
synth
drum
Cluster 3
drum
Cluster 4
# of songs
3000
2000
1000
0
synth
guitar
piano
drum
Cluster 5
Figure 4.5: Co-occurence between feature clusters and instrument clusters First four most co-occurred instrument clusters are returned for each feature cluster
39
4.3 Cluster Structure Analysis
2000
# of songs
# of songs
1500
1000
500
0
dc
1500
1000
500
american
german
0
dc
british
british
Cluster 1
roma
german
german
roma
Cluster 2
2500
# of songs
# of songs
600
400
200
2000
1500
1000
500
0
american
roma
german
0
dc
los angeles
# of songs
Cluster 3
british
Cluster 4
1000
500
0
american
british
dc
german
Cluster 5
Figure 4.6: Co-occurence between feature clusters and origin clusters - First
four most co-occurred origin clusters are returned for each feature cluster
Knowing that the distance between clusters 1 and 2 is minimum and the distance between clusters 3 and 4 is maximum, it can be also observed from figures 4.2 – 4.6 that
the co-occurring terms within clusters 1 and 2 are similar, while those within clusters 3
and 4 are quite dissimilar as shown in tables 4.6 and 4.7, indicating neighboring feature
clusters share similar label clusters, while distant feature clusters do not.
Cluster 1
(80s, 90s, 00s, 60s)
(chill, beautiful, romantic, happy)
(hip hop, wave, smooth, indie)
(synth, guitar, drum, piano)
(dc, american, german, british)
vs
Cluster 2
(80s, 90s, 00s, 70s)
(chill, romantic, beautiful, happy)
(wave, indie, hip hop, soft)
(synth, guitar, piano, drum)
(dc, british, roma, german)
Table 4.6: Neighboring Clusters - clusters with minimum Euclidean distances share similar label clusters
40
4.3 Cluster Structure Analysis
Cluster 3
(80s, 20th century, 60s, 90s)
(chill, romantic, beautiful, mellow)
(soundtrack, smooth, classic, indie)
(piano, guitar, synth, drum)
(american, roma, german, los angeles)
vs
Cluster 4
(80s, 90s, 00s, 70s)
(chill, happy, romantic, sexy)
(hip hop, wave, techno, progressive)
(synth, drum, guitar, bass)
(dc, british, german, roma)
Table 4.7: Distant Clusters - clusters with maximum Euclidean distances have dissimilar
label clusters
4.3.2
Correlated Terms vs. Uncorrelated Terms
Considering the opposite case, the author selected four largest clusters from each label
class and calculated the co-occurrence with every feature clusters as shown in figures
4.7 – 4.11. In order to observe whether highly correlated label clusters can also be
characterized by feature clusters, table 4.8 shows the summary of the most correlated
terms for the four largest clusters for each label class, whereas table 4.9 shows the least
correlated terms for the same clusters.
Using histograms from figures 4.7 – 4.11, 5-dimensional vector can be created for each
term by finding the ratio of each feature cluster – (e.g. a vector for ‘80s’ term is (Cluster
1, Cluster 2, Cluster 3, Cluster 4, Cluster 5) = (0.701, 1, 0.196, 0.727, 0.601)). Using
the same method, a total of 41 vectors are retrieved for every single term in tables 4.8
and 4.9 and shown in table 4.10.
Using the relationship from tables 4.8, 4.9 and the vectors in 4.10, the Euclidean distance between a pair of vectors is calculated and shown in tables 4.11 and 4.12. As its
average distance indicates, highly correlated terms share similar combination of feature
clusters, whereas lowly correlated terms do not.
41
4.3 Cluster Structure Analysis
5000
2500
4000
# of songs
# of songs
2000
3000
2000
1000
1000
0
Cluster 2
1500
500
Cluster 4
Cluster 1
80s
Cluster 5
0
Cluster 4
Cluster 3
1400
Cluster 1
Cluster 2
90s
Cluster 5
Cluster 3
Cluster 1
Cluster 2
60s
Cluster 3
Cluster 4
1000
1200
800
# of songs
# of songs
1000
800
600
600
400
400
200
200
0
Cluster 5
Cluster 1
Cluster 4
00s
Cluster 2
0
Cluster 5
Cluster 3
Figure 4.7: Co-occurence between era clusters and feature clusters - Intersecting
feature clusters are returned for the four largest era clusters
3500
1600
3000
1400
1200
# of songs
# of songs
2500
2000
1500
1000
800
600
400
500
0
Cluster 1
1000
200
Cluster 5
Cluster 4
chill
Cluster 2
0
Cluster 2
Cluster 3
Cluster 5
Cluster 4
romantic
Cluster 1
Cluster 3
Cluster 2
Cluster 1
happy
Cluster 5
Cluster 3
1400
1600
1200
1400
1000
# of songs
# of songs
1200
1000
800
600
600
400
400
200
200
0
Cluster 5
800
Cluster 2
Cluster 1
beautiful
Cluster 3
0
Cluster 4
Cluster 4
Figure 4.8: Co-occurence between emotion clusters and feature clusters - Intersecting feature clusters are returned for the four largest emotion clusters
42
4.3 Cluster Structure Analysis
2500
2000
# of songs
# of songs
2000
1500
1000
1000
500
500
0
Cluster 4
1500
Cluster 1
Cluster 2
hip hop
Cluster 5
0
Cluster 2
Cluster 3
Cluster 4
Cluster 1
wave
Cluster 5
Cluster 3
Cluster 5
Cluster 3
smooth
Cluster 2
Cluster 4
1000
800
# of songs
# of songs
1500
1000
600
400
500
200
0
Cluster 5
Cluster 2
Cluster 1
indie
Cluster 4
0
Cluster 1
Cluster 3
Figure 4.9: Co-occurence between genre clusters and feature clusters - Intersecting feature clusters are returned for the four largest genre clusters
3000
2500
# of songs
# of songs
4000
3000
2000
2000
1500
1000
1000
0
Cluster 2
500
Cluster 4
Cluster 1
synth
Cluster 5
0
Cluster 5
Cluster 3
1000
500
0
Cluster 3
Cluster 1
guitar
Cluster 4
Cluster 3
Cluster 1
Cluster 2
drum
Cluster 5
Cluster 3
1500
# of songs
# of songs
1500
Cluster 2
1000
500
Cluster 5
Cluster 1
piano
Cluster 2
0
Cluster 4
Cluster 4
Figure 4.10: Co-occurence between instrument clusters and feature clusters Intersecting feature clusters are returned for the four largest instrument clusters
43
4.3 Cluster Structure Analysis
2500
1500
# of songs
# of songs
2000
1500
1000
1000
500
500
0
Cluster 4
Cluster 2
Cluster 1
dc
Cluster 5
0
Cluster 2
Cluster 3
Cluster 5
Cluster 4
british
Cluster 1
Cluster 3
Cluster 4
Cluster 5
german
Cluster 2
Cluster 3
1200
1200
1000
# of songs
# of songs
1000
800
600
800
600
400
400
200
200
0
Cluster 5
Cluster 1
Cluster 2
american
Cluster 3
0
Cluster 1
Cluster 4
Figure 4.11: Co-occurence between origin clusters and feature clusters - Intersecting feature clusters are returned for the four largest origin clusters
80s
chill
hip hop
synth
dc
chill
wave
synth
dc
80s
hip hop
synth
british
90s
chill
synth
dc
80s
chill
wave
dc
80s
happy
hip hop
synth
90s
romantic
wave
guitar
british
chill
hip hop
synth
dc
80s
wave
synth
roma
80s
romantic
synth
roma
80s
chill
indie
american
80s
chill
indie
synth
00s
beautiful
indie
piano
american
chill
indie
synth
british
80s
indie
guitar
american
00s
chill
synth
british
80s
beautiful
soundtrack
american
00s
chill
smooth
guitar
60s
happy
smooth
drum
german
Table 4.8: Term Correlation - The highest correlated terms for the four largest clusters
from each label class
44
beautiful
smooth
guitar
american
80s
hip hop
synth
dc
80s
chill
guitar
american
90s
chill
hip hop
dc
80s
chill
hip hop
synth
4.3 Cluster Structure Analysis
80s
chill
hip hop
synth
dc
evil
country
violin
cuba
50s
ballade
violin
cuba
20th century
gore
violin
england
20th century
evil
ambient
cuba
20th century
calming
ambient
saxophone
90s
romantic
wave
guitar
british
evil
ambient
violin
cuba
00s
alternative
saxophone
belgium
50s
calming
violin
belgium
50s
calming
dance
belgium
50s
brutal
blues
saxophone
00s
beautiful
indie
piano
american
calming
ambient
saxophone
east coast
50s
ambient
violin
african
50s
calming
saxophone
cuba
50s
calming
ambient
african
50s
brutal
ballade
violin
60s
happy
smooth
drum
german
Table 4.9: Term Correlation - The least correlated terms for the four largest clusters from
each label class
45
calming
alternative
violin
belgium
50s
alternative
saxophone
african
20th century
brutal
violin
african
50s
calming
ambient
cuba
20th century
calming
alternative
violin
4.3 Cluster Structure Analysis
Term
80s
90s
00s
60s
chill
romantic
beautiful
happy
hip hop
wave
indie
smooth
soundtrack
synth
guitar
piano
drum
dc
british
american
german
roma
50s
20th century
brutal
evil
calming
gore
ambient
alternative
ballad
blues
country
dance
violin
saxophone
african
belgium
cuba
east coast
england
Vectors
(0.701, 1, 0.196, 0.727, 0.601)
(0.827, 0.659, 0.185, 1, 0.582)
(0.996, 0.910, 0.271, 0.938, 1)
(0.834, 0.648, 0.519, 0.257, 1)
(1, 0.651, 0.394, 0.749, 0.781)
(0.598, 1, 0.410, 0.646, 0.681)
(0.594, 0.775, 0.322, 0.315, 1)
(0.522, 0.559, 0.039, 1, 0.208)
(0.829, 0.486, 0.048, 1, 0.231)
(0.486, 1, 0.046, 0.617, 0.373)
(0.494, 0.988, 0.074, 0.266, 1)
(1, 0.309, 0.463, 0.215, 0.714)
(0.722, 0.665, 1, 0.269, 0.763)
(0.752, 1, 0.093, 0.866, 0.629)
(0.621, 0.946, 0.238, 0.424, 1)
(0.526, 0.500, 1, 0.222, 0.741)
(0.778, 0.465, 0.089, 1, 0.281)
(0.571, 0.773, 0.07, 1, 0.373)
(0.622, 1, 0.130, 0.623, 0.687)
(0.964, 0.594, 0.543, 0.492, 1)
(1, 0.646, 0.434, 0.758, 0.701)
(0.708, 0.567, 1, 0.489, 0.635)
(0.823, 0.129, 0.635, 0.149, 1)
(0.154, 0.047, 1, 0.118, 0.346)
(0.535, 0.807, 0.193, 1, 0.790)
(0.744, 0.744, 0.308, 0.231, 1)
(0.563, 0.469, 0.109, 0.141, 1)
(0.388, 1, 0, 0.6, 0.235)
(0.250, 0.568, 0.159, 0.114, 1)
(0.414, 1, 0.019, 0.543, 0.432)
(0.128, 1, 0.009, 0.284, 0.413)
(0.899, 1, 0.203, 0.529, 0.544)
(0.139, 0.181, 0.042, 0.028, 1)
(0.277, 0.185, 0, 1, 0.108)
(0.190, 0.177, 0.658, 0.013, 1)
(1, 0.325, 0.265, 0.262, 0.567)
(1, 0.644, 0.137, 0.824, 0.537)
(0.659, 0.319, 0.055, 1, 0.110)
(0.469, 0.047, 0, 1, 0.031)
(0.618, 0.056, 0.101, 0.034, 1)
(0.587, 1, 0.239, 0.635, 0.819)
Table 4.10: Term Vectors - 5-dimensional vectors are created for all the terms appearing
in tables 4.8 and 4.9, using histograms as in figures 4.7 – 4.11.
46
4.3 Cluster Structure Analysis
Correlated
80s
00s
chill
beautiful
hip hop
indie
synth
piano
dc
american
average
Terms
chill
wave
synth
dc
chill
indie
synth
british
80s
hip hop
synth
british
80s
indie
guitar
american
90s
chill
synth
dc
00s
chill
synth
british
80s
chill
wave
dc
80s
beautiful
soundtrack
american
80s
happy
hip hop
synth
00s
chill
smooth
guitar
Distance
0.1064
0.0729
0.0365
0.0918
0.0815
0.1730
0.0984
0.1208
0.1064
0.1472
0.1115
0.1198
0.1276
0.0691
0.0442
0.1000
0.0829
0.1472
0.1339
0.0824
0.1730
0.1736
0.1503
0.0990
0.0365
0.1115
0.0896
0.0821
0.2192
0.1569
0.0523
0.1483
0.0918
0.0553
0.0824
0.0821
0.1223
0.0750
0.0995
0.1165
0.1068
Correlated
90s
60s
romantic
happy
wave
smooth
guitar
drum
british
german
average
Terms
chill
hip hop
synth
dc
beautiful
smooth
guitar
american
80s
wave
synth
roma
80s
hip hop
synth
dc
80s
romantic
synth
roma
80s
chill
guitar
american
80s
chill
indie
american
90s
chill
hip hop
dc
80s
chill
indie
synth
80s
chill
hip hop
synth
Distance
0.0840
0.0829
0.0776
0.0738
0.0681
0.0957
0.0982
0.0550
0.0527
0.0981
0.0837
0.1516
0.1386
0.0633
0.1335
0.0553
0.0729
0.0981
0.0896
0.2220
0.1911
0.1283
0.1704
0.0995
0.1024
0.1278
0.0528
0.1165
0.0748
0.1400
0.0170
0.0766
0.0339
0.1198
0.0990
0.0568
0.1063
0.0180
0.1390
0.1131
0.0970
Table 4.11: Label Cluster Distance - Euclidean distance between a pair of most correlated
terms is calculated
47
4.3 Cluster Structure Analysis
Correlated
80s
00s
chill
beautiful
hip hop
indie
synth
piano
dc
american
average
Terms
evil
country
violin
cuba
calming
ambient
saxophone
east coast
50s
ballade
violin
cuba
50s
ambient
violin
african
20th century
gore
violin
england
50s
calming
saxophone
cuba
20th century
evil
ambient
cuba
50s
calming
ambient
african
20th century
calming
ambient
saxophone
50s
brutal
ballade
violin
Distance
0.1393
0.2575
0.2699
0.2366
0.2043
0.2337
0.1987
0.2622
0.1755
0.2351
0.2482
0.2390
0.1543
0.0955
0.1703
0.1661
0.3063
0.1576
0.3125
0.1831
0.2168
0.1079
0.1918
0.3079
0.3303
0.1616
0.2141
0.2343
0.1314
0.1865
0.1862
0.2362
0.3062
0.2214
0.2307
0.2005
0.1204
0.1617
0.2479
0.2016
0.2110
Correlated
90s
60s
romantic
happy
wave
smooth
guitar
drum
british
german
average
Terms
evil
ambient
violin
cuba
calming
alternative
violin
belgium
00s
alternative
saxophone
belgium
50s
alternative
saxophone
african
50s
calming
violin
belgium
20th century
brutal
violin
african
50s
calming
dance
belgium
50s
calming
ambient
cuba
50s
brutal
blues
saxophone
20th century
calming
alternative
violin
Distance
0.1784
0.2282
0.2836
0.1834
0.1072
0.1954
0.1692
0.2606
0.1221
0.1019
0.1787
0.2044
0.2816
0.1365
0.2008
0.1240
0.2706
0.1910
0.2755
0.1685
0.2207
0.2153
0.1827
0.1576
0.1941
0.1145
0.2743
0.2492
0.2572
0.2282
0.1619
0.1167
0.2347
0.0898
0.0667
0.1745
0.2779
0.1787
0.1743
0.2507
0.1920
Table 4.12: Label Cluster Distance - Euclidean distance between a pair of least correlated
terms is calculated
48
Chapter 5
Conclusion and Future Work
The goal of this study was to analyze the relationship among audio labels using a large
dataset and cluster validation method. As it is observed from the previous chapter,
the validation rates among audio labels turn out to be low, implying no statistically
significant correlation. In other words, songs described by certain instrument are not
distinctively characterized to be originated from certain place or to arouse specific
emotion, or to be described as a certain genre. It is still interesting, however, to find
out that emotion and genre labels have the highest correlation, implying that certain
genres do co-occur with specific emotion terms in some cases.
Regarding the low cluster validation rates between audio feature clusters and label
classes, it was found that extracted features do not successfully characterize songs
based on only one label class as certain label clusters co-occur with all five feature
clusters; however, it still shows that there exists a correlation between feature clusters
and label clusters as neighboring feature clusters tend to contain similar label clusters
and highly correlated label clusters also share similar combination of feature clusters.
For future work, it should be meaningful to find out which combination of feature
clusters or label clusters could maximize cluster validation, which could lead to better
understanding of the relationship between features and labels.
49
[9] G. Tzanetakis and P. Cook.
Musical genre
classification of audio signals. IEEE Transactions on Audio, Speech and Language Processing, 10(2):293–302, 2002. 5
[10] J.P. Bello L. Daudet S. Abdallah C. Duxbury M.
Davies and M.B. Sandler. A tutorial on onset de-
References
tection in music signals. IEEE Transactions on
Speech and Audio Processing, 13(5), 2005. 5
[11] M. Alonso B. David and G. Richard. Tempo and
[1] G. Tzanetakis G. Essel and P. Cook. Automatic
beat estimation of musical signals. ISMIR, 2004.
musical genre classification of audio signals. IS-
5
MIR, pages 205–210, 2001. 1, 4, 5
[12] A. M. Noll. Cepstrum pitch determination. J.
[2] M. Casey and M. Slaney. Fast recognition of
Acoust. Soc. Am., 41(2), 1967. 5
remixed music audio. ICASSP, 2007. 1, 4, 5
[13] L. R. Robiner. On the use of autocorrelation
[3] Y. H. Yang Y. C. Lin Y. F. Su and H. H. Chen.
analysis for pitch detection.
Music emotion classification: A regression ap-
ICASSP, 25(1),
1977. 5
proach. ICME, pages 208–209, 2007. 1, 4, 5
[14] R. Meddis and M. J. Hewitt. Virtual pitch and
[4] D. P. W. Ellis and G. E. Poliner. Identifying
phase sensitivity of a compute model of the audi-
’cover songs’ with chroma features and dynamic
tory periphery. i: Pitch identification. J. Acoust.
programming beat tracking. ICASSP, 2007. 1,
Soc. Am., 89(6), 1991. 5
4, 5
[15] L. Lu D. Liu and H. Zhang. Automatic mood
[5] S. Basu A. Banerjee and R. Mooney.
Semi-
detection and tracking of music audio signals.
supervised clustering by seeding. ICML, 2002.
IEEE Transactions on Audio, Speech and Lan-
2
guage Processing, 14(1):5–18, 2006. 7
[6] Thierry Bertin-Mahieux, Daniel P.W. Ellis,
[16] J. Skowronek M. F. McKinney and S. van de
Brian Whitman, and Paul Lamere. The million
Par. A demonstrator for automatic music mood
song dataset. ISMIR, 2011. 2, 9, 10
estimation. ISMIR, 2007. 7
[7] D. Steinley. Properties of the hubert-arabie ad-
[17] Y.
E.
Kim
E.
Schmidt
and
L.
Emelle.
justed rand index. Psychology Methods, 2004. 2,
Moodswings: A collaborative game for music
29, 30
mood label collection. ISMIR, 2008. 7
A
[18] Y. H. Yang. A regression approach to music emo-
texture-based object detection and an adaptive
tion recognition. IEEE Transactions on Audio,
model-based classification. IEEE Intelligent Ve-
Speech and Language Processing, 16(2):448–457,
hicles Symposium, 1998. 4
2008. 7
[8] T. Kalinke C. Tzomakas and W. Seelan.
50
REFERENCES
tures and the emergence of mood. ISMIR, 2011.
[19] R. E. Thayer. The Biopsychology of Mood and
7
Arousal. Oxford University Press, 1989. iv, 7, 8
[20] C. Laurier J. Grivolla and P. Herrera.
Mul-
[26] M. Anderberg. Cluster analysis for applications.
timodal musiic mood classification using audio
Academic Press, 1973. 25
and lyrics. ICMLA, 2008. 7
[21] Y. Hu X. Chen and D. Yang.
[27] M. Bern and D. Eppstein. Approximation algo-
Lyrics-based
rithms for geometric problems. Approximation
song emotion detection with affective lexicon
Algorithms for NP-Hard Problems, 1996. 25
and fuzzy clustering method. ISMIR, 2009. 7
[28] R. Duda P. Hart and D. Stork. Pattern Classi-
[22] X. Hu and J. S. Downie. When lyrics outperform
fication. Wiley, 2001. 25
audio for music mood classification: A feature
analysis. ISMIR, 2010. 7
[29] A. Jain and R. Dubes. Algorithms for Clustering
[23] Y. Vaizman R. Y. Granot and G. Lanckriet.
Data. Prentice-Hall, 1981. 25
Modeling dynamic patterns for emotional content in music. ISMIR, 2011. 7
[30] J. Puzicha T. Hofmann and J. Buhmann. A theory of proximity based clustering: structure de-
[24] E. M. Schmidt and Y. E. Kim. Modeling musi-
tection by optimization. Pattern Recognition,
cal emotion dynamics with conditional random
2000. 25
fields. ISMIR, 2011. 7
[31] L. Hubert and P. Arabie. Comparing partitions.
[25] M. McVicar T. Freemand and T. D. Bie. Min-
Journal of Classification, 2(193-218), 1985. 29
ing the correlation between lyrical and audio fea-
51