PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES Audio Engineering Society Convention Paper Presented at the 115th Convention 2003 October 10–13 New York, NY, USA This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Automatic classification of large musical instrument databases using hierarchical classifiers with inertia ratio maximization Geoffroy Peeters1 1 IRCAM, 1 pl. Igor Stravinsky, 75004 Paris, France Correspondence should be addressed to Geoffroy Peeters ([email protected]) ABSTRACT This paper addresses the problem of classifying large databases of musical instrument sounds. An efficient algorithm is proposed for selecting the most appropriate signal features for a given classification task. This algorithm, called IRMFSP, is based on the maximization of the ratio of the between-class inertia to the total inertia combined with a step-wise feature space orthogonalization. Several classifiers - flat gaussian, flat KNN, hierarchical gaussian, hierarchical KNN and decision tree classifiers - are compared for the task of large database classification. Especially considered is the application when our classification system is trained on a given database and used for the classification of another database possibly recorded in completely different conditions. The highest recognition rates are obtained when the hierarchical gaussian and KNN classifiers are used. Organization of the instrument classes is studied through an MDS analysis derived from the acoustic features of the sounds. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 1 1. INTRODUCTION During the last decades, sound classification has been the subject of many research efforts [27] [3][17] [29]. However, few of them address the problem of generalization of the sound source recognition system i.e. applicability to several instances of the same source possibly recorded in different conditions, with various instrument manufacturers and players. In this context, Martin [16] reports only 39% recognition rate for individual instrument (76% for instrument family), using the output of a log-lag correlogram for 14 different instruments. Eronen [4] reports 35% (77%) recognition rate using mainly MFCCs and some other features for 16 different instruments. Sound classification systems rely on the extraction of a set of signal features (such as energy, spectral centroid, ...) from the signal. This set is then used to perform classification according to a given taxonomy. This taxonomy is defined by a set of textual attributes defining the properties of the sound such as its source (speaker genre, music genre, sound effects class, instrument name...) or its perception (bright, dark...). The choice of the features depends on the targeted application (speech/music/noise discrimination, speaker identification, sound effects recognition, musical instruments recognition). The most appropriate set of features can be selected a priori - having a prior knowledge of the feature discriminative power for the given task -, or a posteriori by including in the system an algorithm for automatic feature selection. In our system, in order to allow the coverage of a large set of potential taxonomies, we have implemented a large set of features. This set of features is then filtered automatically by a feature selection algorithm. Because sound is a phenomenon, which changes over time, features are computed over time (frame by frame analysis). The set of temporal features can be used directly for classification [3]; or the temporal evolution of the features can be modeled. Modeling can be done before the modeling of the classes (using mean, std, derivative values, modulation or polynomial representation [29]) or during the modeling of the classes (using for example a Hidden Markov Model [30]). In our system, temporal modeling is done before that of the classes. The last major difference between classification systems concerns the choice of the model to represent the classes of the taxonomy (multi-dimensional gaussian, gaussian mixture, KNN, NN, decision tree, SVM...). The system performance is generally evaluated, after training on a subset of a database, on the rest of the database. However, since most of the time a single database contains a single instance of an instrument (the same instrument played by the same player in the same recording conditions), this kind of evaluation does not prove any applicability of the system for the classification of sounds which do not belong to the database. In particular, the system may fail to recognize sounds recorded in completely different conditions. In this paper we evaluate such performances. 2. FEATURE EXTRACTION Many different types of signal features have been proposed for the task of sound recognition coming from the speech recognition community, previous studies on musical instrument sounds classification [27] [3] [17] [29] [13] and results of psycho-acoustical studies [14] [24]. In order to allow the coverage of a large set of potential taxonomies, a large set of features has been implemented, including features related to the • Temporal shape of the signal (attack-time, temporal increase/decrease, effective duration), • Harmonic features (harmonic/noise ratio, odd to even and tristimulus harmonic energy ratio, harmonic deviation), • Spectral shape features (centroid, spread, skewness, kurtosis, slope, roll-off frequency, variation), • Perceptual features (relative specific loudness, sharpness, spread, roughness, fluctuation strength), • Mel-Frequency Cepstral Coefficients (plus Delta and DeltaDelta coefficients), auto-correlation coefficients, zero-crossing rate, as well as some MPEG-7 Low Level Audio Descriptors (spectral flatness and crest factors [22]). PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES See [12] for a review. 3. FEATURE SELECTION Using a high number of features for classification can cause several problems: 1) bad classification results because some features are irrelevant for the given task; 2) over fitting of the model to the training set (this is especially true when using, without care, data reduction techniques such as Linear Discriminant Analysis), 3) the models are difficult to interpret by human. For this reason, feature selection algorithms attempt to detect the minimal set of 1. informative features with respect to the classes 2. features that provide non redundant information. 3.1. Inertia Ratio Maximization using Feature Space Projection (IRMFSP) gravity of the feature fi over all the data set, and mi,k is the center of gravity of the feature fi for data belonging to class k. A feature fi with a high value of r is therefore a feature for which the classes are well separated with respect to their within spread. The second criterion should allow taking into account the fact that a feature with a high value of r could bring the same information as an already selected feature and is therefore redundant. While other FSAs, like the CFS one [10]1 , use a weight based on the correlation between the candidate feature and already selected features, in the IRMFSP algorithm, an orthogonalization process is applied after the selection of each new feature fi . If we note F the feature space (space where each axis represents a feature), f i the last selected feature and g i its normalized form (g i = f i /||f i || ), we project F on g i and keep f 0j : ³ ´ f 0j = f j − f j · g i g i ∀j ∈ F Feature selection algorithms (FSA) can take three main forms (see [21]): • embedded: the FSA is part of the classifier • filter: the FSA is distinct from the classifier and used before the classifier • wrapper: the FSA makes use of the classification results. The FSA we propose is part of the Filter techniques. Considering a gaussian classifier, the first criterion for FSA can be expressed in the following way: ”feature values for sounds belonging to a specific class should be separated from the values for all the other classes”. If it is not the case then the gaussian pdfs will overlap, and class confusion will increase. In a mathematical way this can be expressed by looking at features for which the ratio r of the Between-class inertia B to the Total class inertia T is maximum. For a specific feature fi , r is defined as B r= = T PK Nk 0 k=1 N (mi,k − mi )(mi,k − mi ) P N 1 0 n=1 (fi,n − mi )(fi,n − mi ) N (1) where N is the total number of data, Nk is the number of data belonging to class k, mi is the center of (2) This process (ratio maximization followed by space projection) is repeated until the gain of adding a new feature f i is too small. This gain is measured by the ratio rl obtained at the lth iteration to the one at the first iteration. A stopping criterion of t = rr1l < 0.01 has been chosen. In Fig.1, we illustrate the results of the IRMFSP algorithm for the selection of features for a two classes taxonomy: separation between sustained and nonsustained sounds. In Fig.1, sounds are represented along the first three selected dimensions: temporal decrease (1st dim), spectral centroid (2nd dim) and temporal increase (3rd dim). In part 6.2, the CFS and IRMFSP algorithm are compared. 4. FEATURE TRANSFORMATION In the following, two feature transformation algorithms (FTA) are considered. 1 In the CFS algorithm (Correlation-based Feature Selection), the information brought by one specific feature is computed using symmetrical uncertainty (normalized mutual information) between discretized features and classes. The second criterion (features independence) is taken into account by selecting a new feature only if its cumulated correlation with already selected features is not too large. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 3 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES gaussian function. For each feature x, we find the best non-linear function (best value of λ) defined as the one with the largest gaussianity. nonsust sust 0.06 0.04 3rd dim 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.06 −0.04 −0.02 0 0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.02 0.015 0.025 2nd dim 1st dim Fig. 1: First three dimensions selected by the IRMFSP algorithm for the sustained / nonsustained sounds taxonomy Features Extraction Temporal Modeling Feature Transform.: Box-Cox Feature Selection IRMFSP Feature Transform. LDA Class modeling 4.2. Linear Discriminant Analysis The second FTA is the Linear Discriminant Analysis (LDA) which was proposed by [17] in the context of musical instrument sound classification and evaluated successfully in our previous classifier [25]. LDA allows finding a linear combination among features in order to maximize discrimination between classes. From the initial feature space F (or a selected feature space F 0 ), a new feature space G of dimension smaller than F is obtained. In our current classification system, LDA (when performed) is used between the feature selection algorithm and the class modeling (see Fig.2). 5. CLASS MODELING Among the various existing classifiers (multidimensional gaussian, gaussian mixture, KNN, NN, decision-tree, SVM...) (see [12] for a review), only the gaussian, KNN (and their hierarchical formulation) and decision-tree classifiers have been considered. 5.1. Fig. 2: Classification system flowchart 5.1.1. Flat Classifiers Flat gaussian classifier (F-GC) 4.1. Box-Cox Transformation Classification models based on gaussian distribution makes the underlying assumption that modeled data (in our case signal features) follow a gaussian probability density function (pdf). However, this is rarely verified by features extracted from the signal. Therefore a first FTA, a non-linear transformation, known as the “Box-Cox transformation” [2], can be applied to each feature individually in order to make its pdf fit as much as possible a gaussian pdf. The set of considered non-linear functions depending on the parameters λ is defined as A flat gaussian classifier models each class k by a multi-dimensional gaussian pdf. The parameters of the pdf (mean µk and covariance matrix Σk ) are estimated by maximum-likelihood given the selected features for sounds belonging to class k. The term ”flat” is used here since all classes are considered on a same level. In order to evaluate the probability that a new sound belongs to a class k, Bayes formula is used: xλ − 1 ifλ 6= 0 λ fλ (x) = log(x) ifλ = 0 where - p(k) is the prior probability of observing class k, - p(f ) is the distribution of the feature-vector f - p(f |k) is the conditional probability of observing the feature-vector given a class k (the estimated gaussian pdf). fλ (x) = (3) For a specific value of λ, the gaussianity of fλ (x) is measured by the correlation factor between the percent point function ppf (inverse of the cumulative distribution) of fλ (x) and the theoretical ppf of a p(k|f ) = p(f |k)p(k) p(f |k)p(k) =P p(f ) k (p(f |k)p(k) (4) The training and evaluation process of a flat gaussian classifier system is illustrated in Fig.3. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 4 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES 5.2. TRAINING EVALUATION top 5.2.1. feature selection best set of features f 1 ,f 2 ,...,F N ? feature transformation Linear Discriminant Analysis matrix ? for each class gaussian pdf parameters estimation feature selection use only f 1 ,f 2 ,...,F N feature transformation apply matrix for each class evaluate Bayes formula node j-1 node j node j+1 ... Fig. 3: Flat gaussian classifier 5.1.2. Flat KNN classifiers (F-KNN) K Nearest Neighbors (KNN) is one of the most straightforward algorithm for data classification. KNN is an instance-based algorithm. In KNN, the position of the data of the training set in the feature space F (or in the selected feature space F 0 ) is simply stored (without modeling) along with the corresponding classes. For an input sound located in F , the K closest data of the training set (the K Nearest Neighbors) are estimated. The majority class among these KNN is assigned to the input sound. An Euclidean distance is commonly used in order to find the K closest data. Therefore the weighting of the axes of the space F (weighting of the features) can change the closest data. In the following of this study, when using KNN classifiers, the weighting of the axes is implicitly done since the KNN is applied to the output space of the LDA transformation (LDA finds the optimal weights for the axes of the feature space G). The number of considered nearest neighbors, K, also plays an important role in the obtained result. In the following of this study, the results are indicated for a value of K=10 which yields to the best results in our case. Hierarchical Classifiers Hierarchical gaussian classifier (H-GC) A hierarchical gaussian classifier is a tree of flat gaussian classifiers, i.e. each node of the tree is a flat gaussian classifier with its own feature selection (IRMFSP), its own LDA, its own gaussian pdfs. Hierarchical classifiers have been used by [17] for the classification of 14 instruments (derived from the McGill Sound Library) using a hierarchical KNN-classifier and Fisher multiple discriminant analysis combined with a gaussian classifier. During the training, only the subset of sounds belonging to the classes of the current node (example: the bowed-string node is trained using only bowed-string sounds, the brass node is trained using only brass sounds) is used. During the evaluation, the maximum local probability at each node (probability p(k|f )) decides which branch of the tree to follow. The process is then pursued until reaching a leaf of the tree. Contrary to binary trees, the construction of the tree structure of a H-GC is supervised and requires a previous knowledge of class organization (oboe belongs to double-reeds family which belongs to sustained sounds). Advantages of Hierarchical Gaussian Classifiers (H-GC) over Flat Gaussian Classifiers (F-GC) . Learning facilities: Learning a H-GC (feature selection and gaussian pdf model parameter estimation) is easier since it is easier to characterize the difference in a small subset of classes (learning the difference between brass instruments only is easier than between the whole set of classes). Reduced class confusion: In a F-GC, all classes are represented on the same level and are thus neighbors in the same multi-dimensional feature space. Therefore, annoying class confusions, as for example confusing an ”oboe” sound with an ”harp” sound, are likely to occur. In a H-GC, because of the hierarchy and the high recognition rate at the higher levels of the tree (such as non-sustained /sustained sounds node), this kind of confusion is unlikely to occur. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 5 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES The training and evaluation process of a hierarchical gaussian classifier system is illustrated in Fig.4. The gray/white box connected to each node of the tree is the same as the one of Fig.3. 5.2.2. threshold at each node, mutual information and binary entropy are used as in [5]. The mutual information between a feature X, for a threshold t and classes C can be expressed as Hierarchical KNN classifiers (H-KNN) In hierarchical KNN, at each level of the tree, only the locally selected features and the locally considered classes are taken into account for the training. The training and evaluation process of a hierarchical KNN classifier system is illustrated in Fig.4. I(X, C) = H2 (X|t) − H2 (X|C, t) (5) where H2 (X|C, t) is the binary entropy given the classes C and given the threshold t: X H2 (X|C, t) = p(Ck )H2 (X|Ck , t) (6) k and H2 (X|t) is the binary entropy given the threshold t: top TRAINING EVALUATION H2 (X|t) = −p(x) log2 (p(x))−(1−p(x)) log2 (1−p(x)) (7) ... ... node i where p is the probability that x < t, (1 − p) that x≥t The best feature and the best threshold value at each node are the ones for which I(X,C) is maximum. node j-1 node j node j+1 ... Pre-pruning of the tree: The tree construction is stopped when the gain of adding a new split is too small. The stopping criterion used in [5] is the mutual information weighted by the local mass inside the current node j: Fig. 4: Hierarchical classifier 5.3. Decision Tree Classifiers Decision trees operate by asking the data a sequence of questions in which the next question depends on the answer to the current question. Since they are based on questions they can operate on both numerical and non-numerical data. 5.3.1. Nj Ij (X, C) N Binary Entropy Reduction Tree (BERT) A binary tree recursively decomposes the set of data into two subsets of data in order to maximize one class belonging. The decomposition is operated by a split criterion. In the binary tree considered here, the split criterion operates on a single variable. The split criterion decides with respect to the feature value which branch of the tree to follow (if feature < threshold then left branch, if feature ≥ threshold then right branch). In order to automatically determine the best feature and the best value of the (8) In part 6.2, the results obtained with our Binary Entropy Reduction Tree (BERT) and with two other widely used decision tree algorithms: C4.5. [26] and Partial Decision Tree (PART) [7] are compared. 6. EVALUATION 6.1. 6.1.1. Methodology Evaluation process For the evaluation of the models, three methods have been used. The first evaluation method used is the random 66%/33% partition of the database where 66% of the sounds of each class of a database are randomly selected in order to train the system. The evaluation is then performed on the remaining 33%. In AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 6 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES this case, the result is given as the mean value over 50 random sets. This taxonomy is of course subject to discussions, especially - the piano, which is supposed to belong here to the non-sustained family - the inclusion of all saxophone instruments in the same family as the oboe. The second and third evaluation methods were proposed by Livshin [15] for the evaluation of large database classification, especially for testing the applicability of a system trained on a given database when used for the recognition of another database Instrument T1 The second evaluation method, called O2O (One to One), uses in turns each database for training the system and measure the recognition rate on each of the remaining ones. If we note A, B and C the various databases, the training is performed on A, and used for the evaluation of B and C; then the training is performed on B, and used for the evaluation of A and C, ... The third evaluation method, called the LODO (Leave One Database Out), uses all databases for the training except one which is used for the evaluation. All possible left out databases are chosen in turns. The training is performed on A+B, and used for the evaluation of C; then the training is performed on A+C, and used for the evaluation of B; ... 6.1.2. Taxonomy used The instrument taxonomy used during the experiment is represented in Fig.5. In the following experiments we consider taxonomies at three different levels: NonSustained Sustained Strings Struck T2 Plucked Strings T3 Guitar Piano Wodwinds PizStrings Strings BowedStrings Violin Harp SingleDouble Brass Double Double Flute Clarinet Cornet Viola Celo SingleRds Trumpet Violin Viola Celo AirReds Reds Picol Tenorsax Trombne Recorder Altosax FrenchHorn Sopsax Tuba Acordeon DoubleRds Oboe Bason Englishorn Fig. 5: Instrument Taxonomy used for the experiment 400 Vi Pro Microsoft McGill Iowa SOL 350 300 250 200 150 100 1. a 2 classes taxonomy: sustained/ non-sustained sounds. We call it T1 in the following. 50 3. a 27 classes taxonomy corresponding to the instrument names: piano, guitar/ harp, pizzicatoviolin/ viola/ cello/double-bass, bowed-violin/ viola/ cello/ double-bass, trumpet/ cornet/ trombone/ FrenchHorn/ tubba, flute/ piccolo/ recorder, oboe/ bassoon/ EnglishHorn/ clarinet/ accordion/ alto-sax/ soprano-sax/ tenorsax. We call it T3 in the following. saxalto saxtenor oboe saxsop english-horn bassoon clarinet accordeon piccolo recorder tuba flute cornet trumpet trombone french horn cello violin viola double cello-pizz violin-pizz double-pizz harp viola-pizz piano 2. a 7 classes taxonomy corresponding to the instrument families: struck strings, pluckedstrings, pizzicato-strings, bowed-strings, brass, air reeds, single/double reeds. We call it T2 in the following. guitar 0 Fig. 6: Instrument distribution of the six database 6.1.3. Test set Six different databases were used for the evaluation of the models: • the Ircam Studio OnLine [1] (1323 sounds, 16 instruments), AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 7 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES • the Iowa University database [8] (816 sounds, 12 instruments), • the McGill University database [[23] (585 sounds, 23 instruments), • sounds extracted from the Microsoft ”Musical Instruments” CD-ROM [19] (216 sounds, 20 instruments), • two commercial databases the Pro (532 sounds, 20 instruments) and the Vi databases (691 sounds, 18 instruments),for a total of 4163 sounds. It is important to note that a large pitch range has been considered for each instrument (4 octaves on average). In the opposite, not all the sounds from each database have been considered. In order to limit the number of classes, the muted sounds, the martele/ staccato sounds and some more specific type of playing have not considered been. The instrument distribution of each database is depicted in Fig.6. 6.2. Table 1: Comparison of feature selection algorithm in terms of recognition rate, mean (standard deviation) T1 T2 T3 LDA 96 89 86 CFS weka 99.0 (0.5) 93.2 (0.8) 60.8 (12.9) IRMFSP (t=0.01, 99.2 (0.4) 95.8 (1.2) 95.1 (1.2) nbdescmax=20) number of required features is also larger and features redundancy is more likely to occur. CFS fails at T3, perhaps because of a potentially high feature redundancy. 6.2.2. Comparison of classification algorithms for cross-database classification In Table 2, we compare the recognition rate obtained using the O2O evaluation method for the • Flat gaussian (F-GC) and Results • Hierarchical gaussian (H-GC) classifiers. 6.2.1. Comparison of feature selection algorithms In Table 1, we compare the result of our previous classification system [25] (which was based on Linear Discriminant Analysis applied to the whole set of features combined with a flat gaussian classifier) with the results obtained with the flat gaussian classifier applied directly (without feature transformation) to the output of the two feature selection algorithms CFS and IRMFSP. The result is given for the Studio OnLine database for taxonomies T1, T2 and T3. Evaluation is performed using the 66%/33% paradigm with 50 random sets. Discussion: Comparing the result obtained with our previous classifiers (LDA) and the result obtained with the IRMFSP algorithm, we see that using a good feature selection algorithm not only allows to reduce the number of features but also increases the recognition rate. Comparing the results obtained using the CFS and IRMFSP algorithms, we see that for T3 IRMFSP performs better than CFS. Since the number of classes is larger at T3, the The results are indicated as mean values over the 30 (6*5) O2O experiments (six databases). Feature transformation algorithms (Box-Cox and LDA transformations) are not used here considering that the number of data inside each database is too small for a correct estimation of FTA parameters. Features have been selected using the IRMFSP algorithm with a stopping criterion of t≤0.01 and a maximum of 10 features per node. Discussion: Compared to the results of Table 1, we see that good result with flat gaussian classifier using 66%/33% paradigm on a single database does not prove any applicability of the system for the recognition of another database (30% using F-GC at T3 level). This is partly explained by the fact that each database contains a single instance of an instrument (same instrument played by the same player in the same recording conditions). Therefore the system mainly learns the instance of the instrument instead of the instrument itself and is unable to recognize another instance of it. Results obtained using HGC are higher than with H-GC (38% at T3 level). AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 8 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES Table 2: Comparison of flat and hierarchical gaussian classifiers using O2O methodology T1 T2 T3 F-GC 89 57 30 H-GC 93 63 38 This can be partly explained by the fact that, in a H-GC, lower levels of the tree benefit from the classification results of higher levels. Since the number of instances used for the training at the higher level is larger (at the T2 level, each family is composed of several instruments, thus several instances of the family) the training of higher level can be generalized and the lower level benefits from this. Not indicated here are the various recognition rates of each individual O2O experiment. These results show that when the training is performed on either Vi, McGill or Pro database, the model is applicable for the recognition of most other databases. On the other hand, when training is performed on Iowa database, the model is poorly applicable to other databases. 6.2.3. Comparison of classification algorithms for large database classification In order to increase the number of possible instrument models, several databases can be combined as in the LODO evaluation method. In Table 3, we compare the recognition rate obtained using the LODO evaluation method for the • Flat classifiers: flat gaussian (F-GC) and flat KNN (F-KNN) • Hierarchical classifiers: hierarchical gaussian (H-GC) and hierarchical KNN (H-KNN) • Decision tree classifiers: Binary Entropy Reduction Tree (BERT), C4.5. and PART. The results are indicated as mean values over the six Left Out databases. For flat and hierarchical classifiers (F-GC, F-KNN, H-GC and H-KNN), features have been selected using the IRMFSP algorithm with a stopping criterion of t≤0.01 and a maximum of 40 features per node. For F-KNN and HKNN, LDA has been applied at each node in order to maximize class separation and to obtain the proper weighting of the KNN axes. Comparing O2O and LODO results: As expected, the recognition rate increases with the number of instances of each instrument used for the training (for F-GC at T3 level: 30% using O2O and 53% using LODO, for H-GC at T3 level: 38% using O2O and 57% using LODO). Comparing flat and hierarchical classifiers: The best results are obtained with the hierarchical classifiers, both H-GC and H-KNN. In Table 3, the effect of applying feature transformation algorithm (Box-Cox and LDA transformations) for both F-GC and H-GC is observed. In the case of HGC, it increases the recognition rate from 57% to 64%. It is commonly held that among classifiers, KNN provides the highest recognition rates. However in our case, H-KNN and H-GC (when combined with feature transform) provide very similar results: H-KNN: T1=99%, T2=84%, T3=64% and H-GC: T1=99%, T2=85%, T3=64%. Decision Tree algorithm: Using decision tree classifiers surprisingly yields poor results even when using post-pruning techniques (such as the ones of C4.5). This is surprising considering the high recognition rate obtained by [11] for the task of unpitched percussion sounds recognition. This tends to favor the use of “smooth” classifiers (based on probability) instead of “hard” classifiers (based on Boolean boundaries) for the task of musical instrument sounds recognition. Among the various tested decision tree classifiers, the best results were obtained using Partial Decision Tree algorithm for the T2 level (T2=71%) and C4.5 algorithm for the T3 level (T3=48%). 6.3. Instrument Class Similarity For the learning of hierarchical classifiers, the construction of the tree structure is supervised and based on a prior knowledge of class proximity (for example violin is close to viola but far from piano). It is therefore interesting to verify whether the assumed structure used during the experiment (see Fig.5) corresponds to a “natural” organization of sound classes. In order to check the assumed structure, several possibilities can be considered as the analysis of the class distribution among the leaves of a decision tree. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 9 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES Table 3: Comparison of flat, hierarchical and decision tree classifiers using LODO methodology T1 T2 T3 F-GC 98 78 55 F-GC (BC+LDA) 99 81 54 F-KNN (K=10, LDA) 99 77 51 H-GC 98 80 57 H-GC (BC+LDA) 99 85 64 H-KNN (K=10, LDA) 99 84 64 BERT 95 65 42 C4.5. 65 48 PART 71 42 Herrera proposed in [11] an interesting method in order to estimate similarities and differences between instrument classes in the case of unpitched percussion sounds. This method allows the estimation of a two-dimensional map obtained by MultiDimensional scaling analysis of a similarity matrix between class parameters. Multi-dimensional Scaling (MDS) allows representing a set of data observed through their dissimilarities into a low-dimensional space such that, in this space the distances between the data is preserved as much as possible. MDS has been used to represent the underlying perceptual dimension of musical instrument sounds in a low-dimensional space [20] [9] [18]. In these studies, people were asked for dissimilarity judgements on pairs of sounds. MDS was then used to represent the stimuli into a lower dimensional space. In [18], a three-dimensional space has been found for musical instrument sounds with the three axes assigned to the attack time, the brightness and the spectral flux of sounds. In [11], the MDS representation is derived from the acoustic features (signal features) instead of dissimilarity judgements. A similar approach is followed here for the case of musical instrument sounds. Our classification system has been trained using a flat gaussian classifier (without any assumption related to classes proximity) and the whole set of databases. Resulting from this training is the representation of each instrument class in terms of acoustic parameters (mean vector and covariance matrix for each class). The “between-groups F-matrix” is computed from the class parameters and used as an index of similarity between classes. An MDS analysis (using Kruskal’s STRESS formula 1 scaling method) is then performed on this similarity matrix. The results from the MDS analysis is a threedimensional space represented in Fig.7. The instrument name abbreviations used in Fig.7 are explained in Table 6.3. Since this low-dimensional space is supposed to preserve (as much as possible) the similarity between the various instrument classes, it should allow identifying possible class organization. Dimension 1 separates the non-sustained sounds on the negative values (PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLP) from the sustained sounds on the positive values. Dimension 1 seems therefore to be associated to both the attack-time and decrease time. Dimension 2 could be associated to brightness since it separates some dark sounds (TUBB, BSN, TBTB, FHOR) from some bright sounds (PICC, CLA, FLTU) although some sounds such as the DBL contradicts this assumption. Dimension 3 remains unexplained except that it allows the separation of bowed-strings (VLN, VLA, CELL, DBL) from the other instruments and that it could therefore be explained by the amount of modulation of the sounds. In Fig.7, several clusters are observed: the bowedstring sounds (VLN, CLA, CELL, DBL), the brass sounds (TBTB, FHOR, TUBB with the exception of TRPU) and the non-sustained sounds (PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLP). Another cluster appears in the center of the space containing a mix between single/double reeds and brass instruments (SAXSO, SAXAL, SAXTE, ACC, EHOR, CORN). From this analysis, it appears that the assumed class structure is only partly verified by the analysis of the MDS map. Only the non-sustained brass and bowed-string families are observed as clusters in the MDS map. 7. CONCLUSION In this paper we investigated the classification of large musical instrument databases. We proposed a new feature selection algorithm based on the maximization of the ratio of the between-class inertia to the total inertia, and compared it successfully with the widely used CFS algorithm. We compared various classifiers: gaussian, KNN classifiers, AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 10 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES 0 −1 Dimension 3 −2 FHOR TUBB TBTB CELL DBL BSN VLA 0 EHOR SAXAL ACC SAXSOCORN SAXTE VLN −1 TRPU FLTU Dimension 3 −2 VLAP 1 CELLP −1.5 GUI −1 CELL −0.5 PICC TBTB FHOR TUBB SAXALEHOR TRPU SAXSO OBOESAXTE ACC VLAP CORN RECO VLNP CLA PICC −2 1.5 BSN PIAN CELLP HARP GUI 1 DBLP 1 VLA DBL FLTU 0 −1 HARP VLNP RECO OBOE CLA 2 PIAN VLN 2 0.5 DBLP 0 0 0 −1 −0.5 −2 1.5 0.5 1 0.5 0 −0.5 −1.5 −1.5 −1 Dimension 2 −1 −2 Dimension1 Dimension1 −1 −0.5 −1.5 1 0 −2 0.5 1 Dimension 2 Fig. 7: Multi-dimensional scaling solution for musical instrument sounds: two different angles of view of the 3-dimensional map their corresponding hierarchical form and various decision tree algorithms. The highest recognition rates were obtained when hierarchical gaussian and KNN classifiers are used. This tends to favor the use of “smooth” classifiers (based on probability like the gaussian classifier) instead of “hard” classifiers (based on Boolean boundaries like the decision tree classifier) for the task of musical instrument sounds recognition. In order to validate the class hierarchy used in the experiment, we studied the organization of the classes through an MDS analysis using an acoustic feature representation of the instrument classes. This study leads to the conclusion that nonsustained bowed-string and brass instrument families form clusters in the acoustic feature space, while the rest of the instrument families (reed families) are at best sparsely grouped. This is also verified by the analysis of the confusion matrix. The recognition rate obtained with our system (64% for 23 instruments, 85% for instrument families) must be compared to the results reported by previous studies: Martin (respectively Eronen), 39% for 14 instruments, 76% for instrument families (respectively 35% for 16 instruments, 77% for instrument families). The increased recognition rates obtained in the present study can be mainly attributed to the use of new signal features. APPENDIX In Fig.8, we present the main selected features by the IRMFSP algorithm at each node of the H-GC tree. In Fig.9, we represent the mean confusion matrix (expressed in percent of the sounds of the original class) for the 6 experiments of the LODO evaluation method. The last column of the figure represents the total number of sounds used for each instrument class. Clearly visible in the matrix, is the low confusion between sustained and non-sustained sounds. The largest confusions occur inside each instrument family (viola recognized at 37% as a cello, violin at 14% as a viola and 16% as a cello, Frenchhorn at 23% as a tuba, cornet at 47% as a trumpet, English-horn at 49% as a oboe, oboe at 20% as a clarinet). Note that the classes with the smallest recognition rate (cornet at 30% and English-horn at 12%) are also the classes for which the training set was the smallest (53 cornet sounds and 41 Englishhorn sounds). More surprising are the confusions inside the non-sustained sounds (piano recognized as guitar or harp, guitar recognized as cello-pizz). Cross-family confusions as the trombone recognized at 12% as a bassoon, recorder recognized at 10% as a clarinet or clarinet recognized at 23% as a flute can be explained perceptually (we have considered a large pitch range for each instrument, therefore AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 11 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES Table 4: Instrument name abbreviations used in Fig.7 Abbreviation PIAN GUI HARP VLNP VLAP CELLP DBLP VLN VLA CELL DBL TRPU CORN TBTB FHOR TUBB FLTU PICC RECO CLA SAXTE SAXAL SAXSO ACC OBOE BSN EHOR Instrument name Piano Guitar Harp Violin pizz Viola pizz Cello pizz Double pizz Violin Viola Cello Double Trumpet Cornet Trombone French-horn Tuba Flute Piccolo Recorder clarinet Tenor sax Alto sax Soprano sax Accordeon Oboe Bassoon English-horn 8. REFERENCES [1] G. Ballet. Studio online, 1998. [2] G. Box and D. Cox. An analysis of transformations. Journal of the Royal Statistical Society, pages 211–252, 1964. [3] J. Brown. Computer identification of musical instruments using pattern recognition with cepstral coefficients as features. JASA, 105(3):1933–1941, 1999. [4] A. Eronen. Comparison of features for musical instrument recognition. In WASPAA (IEEE Workshop on Applications of Signal Processing to Audio and Acoustics), New York, USA, 2001. [5] J. Foote. Decision-Tree Probability Modeling for HMM Speech Recognition. Phd thesis, Brown University, 1994. [6] E. Frank, L. Trigg, M. Hall, and R. Kirkby. Weka: Waikato environment for knowledge analysis, 1999-2000. [7] E. Frank and I. H. Witten. Generating accurate rule sets without global optimization. In Fifteenth International Symposium on Machine Learning, pages 144–151, 1998. [8] L. Fritts. University of iowa musical instrument samples, 1997. [9] J. M. Grey and J. W. Gordon. Perceptual effects of spectral modifications on musical timbres. JASA, 63(5):1493–1500, 1978. the timbre of a single instrument can drastically change). [10] M. Hall. Feature selection for discrete and numeric class machine learning. Technical report, 1999. ACKNOWLEDGEMENT [11] P. Herrera, A. Dehamel, and F. Gouyon. Automatic labeling of unpitched percussion sounds. In AES 114th Convention, Amsterdam, The Nederlands, 2003. Part of this work was conducted in the context of the European I.S.T. project CUIDADO [28] (http://www.cuidado.mu). Results obtained using CFS algorithm, C4.5. and PART have been done with the open source software Weka (http://www.cs.waikato.ac.nz/ ml/ weka) [6]. Thanks to Thomas Helie, Perfecto Herrera, Arie Livshin and Xavier Rodet for fruitful discussions. [12] P. Herrera, G. Peeters, and S. Dubnov. Automatic classification of musical instrument sounds. Journal of New Musical Research, 2003. [13] K. Jensen and K. Arnspang. Binary decision tree classification of musical sounds. In ICMC, Bejing, China, 1999. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 12 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES [14] J. Krimphoff, S. McAdams, and S. Windsberg. Caractrisation du timbre des sons complexes. ii: Analyses acoustiques et quantification psychophysique. Journal de physique, 4:625–628, 1994. [15] A. Livshin, G. Peeters, and X. Rodet. Studies and improvements in automatic classification of musical sound samples. In submitted to ICMC, Singapore, 2003. [16] K. Martin. Sound source recognition: a theory and computational model. Phd thesis, MIT, 1999. [17] K. Martin and Y. Kim. 2pmu9. instrument identification: a pattern-recognition approach. In 136th Meet. Ac. Soc. of America, 1998. [27] E. Scheirer and M. Slaney. Construction and evaluation of a robust multifeature speech/music discriminator. In ICASSP, Munich, Germany, 1997. [28] H. Vinet, P. Herrera, and F. Pachet. The cuidado project. In ISMIR, Paris, France, 2002. [29] E. Wold, T. Blum, D. Keislar, and J. Wheaton. Classification, search and retrieval of audio. In B. Furth, editor, CRC Handbook of Multimedia Computing, pages 207–226. CRC Press, Boca Raton, FLA, 1999. [30] H. Zhang. Heuristic approach for generic audio data segmentation and annotation. In ACM Multimedia, Orlando, Florida, 1999. [18] S. McAdams, S. Windsberg, S. Donnadieu, G. DeSoete, and J. Krimphoff. Perceptual scaling of synthesized musical timbres: common dimensions specificities and latent subject classes. Psychological research, 58:177–192, 1995. [19] Microsoft. Musical instruments cd-rom. [20] J. R. Miller and C. E. C. Perceptual space for musical structures. JASA, 58:711–720, 1975. [21] L. Molina, L. Belanche, and A. Nebot. Feature selection algorithms: A survey and experimental evaluation. In International Conference on Data Mining, Maebashi City, Japan, 2002. [22] MPEG-7. Information technology - multimedia content description interface - part 4: Audio, 2002. [23] F. Opolko and J. Wapnick. Mcgill university master samples cd-rom for samplecell volume 1, 1991. [24] G. Peeters, S. McAdams, and P. Herrera. Instrument sound description in the context of mpeg-7. In ICMC, Berlin, Germany, 2000. [25] G. Peeters and X. Rodet. Automatically selecting signal descriptors for sound classification. In ICMC, Goteborg, Sweden, 2002. [26] J. R. Quinlan. C4.5.: Programs for machine learning. Morgan Kaufmann, San Mateo, CA, 1993. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 13 PEETERS AUTOMATIC CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES sust/non-sust non-sust temporal increase temporal decrease pluckstring temporal decrease temporal centroid pizzstring sust bowedstring brass reedair reedsingledouble temporal decrease temporal decrease temporal log-attack spectral centroid spectral centroid spectal spread +std spectral spread spectral spread spectral centroid spectral centroid spectral skewness spectral centroid spectral spread spectral spread spectral slope sharpness spectral skewness spectral spread spectral skewness spectral kurtosis + std spectral spread spectral skewness spectral variation + std spectral kurtosis spectrall kurtosis std spectral slope spectral skewness harmonic deviation mfcc2,6 std tristimulus + std various mfcc various mfcc spectral kurtosis + std sharpness spectral slope spectral variation spectrall skewness std spectral variation spectral decrease std spectral kurtosis tristimulus harmonic deviation tristimulus various mfcc spectral variation std noisiness mfcc3,4,6 xcorr 3, 6, 8 harmonic deviation tristimulus tristimuls std harmonic deviation xcorr3 xcorr3 1 french-horn 2 2 1 4 5 2 37 14 93 5 5 4 68 3 14 6 16 55 1 1 50 1 2 1 4 2 4 2 3 clarinet 1 1 44 1 bassoon recorder piccolo flute trumpet 1 tuba trombone cornet 1 1 1 4 2 1 1 1 15 23 5 2 1 30 15 47 2 2 4 13 3 49 10 7 3 12 1 13 7 61 4 1 4 9 15 79 1 1 2 1 77 4 2 5 3 4 1 1 10 71 4 1 1 5 3 10 5 59 3 10 3 3 2 2 1 81 3 1 2 23 5 46 1 14 10 2 12 10 12 49 1 4 2 4 8 1 20 4 58 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184 violin cello bass viola violin-pizz cello-pizz bass-pizz viola-pizz harp guitar 36 29 24 1 1 2 3 3 48 22 3 20 1 4 12 68 6 2 4 2 85 4 6 2 1 3 76 18 2 8 5 1 12 71 1 1 2 9 88 number of sounds piano guitar harp viola-pizz bass-pizz cello-pizz violin-pizz viola bass cello violin french-horn cornet trombone trumpet tuba flute piccolo recorder bassoon clarinet english-horn oboe oboe real class piano classified as english-horn Fig. 8: Main selected features by the IRMFSP algorithm at each node of the H-GC tree 1 2 1 3 2 1 1 5 1 2 1 2 1 7 2 Fig. 9: Overall confusion matrix (expressed in percent of the sounds of the original class) for the LODO evaluation method. Thin lines separate the instrument families while thick lines separate the sustained/nonsustained sounds. AES 115TH CONVENTION, NEW YORK, NY, USA, 2003 OCTOBER 10–13 14
© Copyright 2026 Paperzz