International Journal of Advanced Intelligence Volume 9, Number 1, pp.63-75, March, 2017. c AIA International Advanced Information Institute Development of Singing-by-Onomatopoeia corpus for Query-by-Singing Music Information Retrieval system Motoyuki Suzuki Faculty of Information Science and Technology, Osaka Institute of Technology 1–79–1, Kitayama, Hirakata, Osaka, 573–0196, JAPAN [email protected] Akimitsu Hisaoka Faculty of Information Science and Technology, Osaka Institute of Technology 1–79–1, Kitayama, Hirakata, Osaka, 573–0196, JAPAN [email protected] Received (15 Sep. 2016) Revised (15 Feb. 2017) Query-by-Singing Music Information Retrieval (MIR) system has an advantage compared to the Query-by-Humming MIR system because it can use lyrical content in addition to the melodic information. However, it cannot be used for instrumental music which has no lyrics. Such music is often represented using onomatopoeia. There are many kinds of onomatopoeia, related to the characteristics of sounds such as tone and timbre. If such a relationship can be realized, then the onomatopoeia can be used for retrieving keys instead of lyrics. In order to investigate the relationship between musical sound and onomatopoeia we constructed a new singing voice corpus. Forty-nine participants sung by onomatopoeia, and the phonetic transcriptions of the data have been made by hand. A total of 452 audio data samples and corresponding transcriptions were obtained. In addition to constructing the corpus, several basic analyses were carried out. First, we defined the “word” unit of onomatopoeia by using a statistical language model and compiled a list of 492 such words. “Ta” and “la” were found to be the most commonly used, but the variation in the usage of onomatopoeia among the singers were very wide. Some singers used many kinds of onomatopoeias while some singers used only a few words for any type of music. Some of the musical instruments were associated with a particular onomatopoeia but these relationships were not strongly defined. Keywords: Onomatopoeia; Query-by-Singing; Music Information Retrieval System. 1. Introduction In recent years, a huge number of songs can be stored in a small music device such as an iPod or an MP3 player. Users are able to listen to many songs randomly but it is difficult to search and retrieve a specific song as most of the small players do not have any convenient input device such as a keyboard, large-size touch pad, or mouse. Therefore, a user cannot input retrieving keys (title of the song, singer name, a part of lyrics, etc.) easily. 63 64 M. Suzuki, A. Hisaoka In order to solve this problem, several content-based music information retrieval (MIR) systems have been proposed. Many of the traditional content-based MIR systems1,2,3 use a humming sample as the retrieval key in a procedure known as Query-by-Humming (QbH). QbH-MIR systems extract melodic information, which consists of the tone and the length of notes from an input humming and uses it as a retrieval key. These systems cannot achieve high performance because of the error in extracting melodic information. Query-by-Singing (QbS) MIR systems4,5 have been also proposed. These systems accept a singing voice as input and extract lyrical content in addition to melodic information by using a speech recognition technology6 . In general, these systems show higher performance than the QbH-MIR system because both lyrical and melodic information can be used for retrieving the required music piece. However, the QbS-MIR system cannot use lyrical content for retrieving instrumental music. In general, a user sings an instrumental music by using humming. In that case, what kinds of phones are usually used? Meaningless words such as “ta,” “la,” “cha,” and “tang” are commonly used. These are called onomatopoeia and there are several of such expressions in humming. Each onomatopoeia relates to characteristics of sound. For example, both “pong” and “ping” express a piano sound, but “ping” expresses a higher tone than “pong.” “Dong” expresses a loud drum sound, while “ton” expresses a small drum sound. If such a relationship between an onomatopoeia and a sound characteristic can be found, then this information can be used to distinguish two pieces of music, both of which have a similar melody but different tone, played using different musical instruments. Several researches used onomatopoeia as query of MIR, but these were used in restricted situations. Ishihara et al.7 extracted only length of notes from onomatopoeia expressions because it accepted only one onomatopoeia (“la”) and its variations (“laa,” “laaa” and so on). Some other researches8,9 also used onomatopoeia as input, but these systems only dealt with percussions. In this paper, we investigate relationships between onomatopoeia and sound characteristics in general music. In order to do it, a singing-by-onomatopoeia corpus is constructed. We collect many audio clips of singers singing instrumental music with onomatopoeias and make phonetic transcriptions of all the data. Subsequently, some basic relationships are investigated using statistical techniques. This corpus can be also used for developing an automatic construction system of an onomatopoeia database for MIR system. Input of this system is a music, and it is converted into a corresponding onomatopoeia sequence. The system can be realized by using automatic speech recognition technique. Srinivasamurthy et al.10 has been proposed the similar method for translating percussion music into syllable expressions. This method can be applied to general music by using the corpus. Development of Singing-by-Onomatopoeia corpus for QbS-MIR system 65 2. Relationship between onomatopoeia and simple sound There are several researches11,12 investigating relationships between onomatopoeia and simple sound in Japanese. First, the tone of a sound mainly corresponds to the vowel. The vowel /o/ is used for lower sound (under 1 kHz), /i/ is used for higher sound (upper 2 kHz), and the vowel /a/ is used for middle range (between 1 kHz and 2 kHz). Second, the length of sound is expressed the end of onomatopoeia. Long vowel (e.g. /a:/) or long vowel with syllabic nasal (e.g. /i:N/) are frequently used for a long sound. On the other hand, onomatopoeia representing a very short sound ends by geminate consonant (/Q/). A sound with middle length is represented by a short vowel with a syllabic nasal (e.g. /iN/). When a short sound is played twice, or two similar sounds are played continually, the structure of the corresponding onomatopoeia becomes /cvtv/, where /c/ is a consonant and /v/ is a vowel. Two vowels in this structure are the same, and the second consonant is fixed to /t/ (e.g. /kata/, /koto/). In such a case, when the second sound is of smaller or higher tone, /r/ is used instead of the second consonant /t/ (/cvrv/). On the other hand, when the second sound is louder, the syllabic nasal (/N/) follows the onomatopoeia (/cvtvN/). Moreover, when the second sound has a longer duration, then the second vowel becomes a long vowel (/cvtv:N/). When similar sounds are played repeatedly, the second consonant and vowel are used repeatedly (e.g. /katata/, /pororo/). As can be inferred from above, some simple sounds can be translated into onomatopoeias. However, we do not know what kind of onomatopoeias should be used for a complex sound such as music. How to represent a difference among musical instruments using onomatopoeia? Is there any individuality in the usage of onomatopoeia? In order to answer these questions, a singing-by-onomatopoeia corpus is needed. 3. Construction of the corpus 3.1. Design policy of the corpus The purpose of constructing the corpus is to develop a new QbS-MIR system. The rules followed in the construction are stated below: • The input music has no lyrics. Specifically, classical music samples are used. • The length of input is set to about 10 seconds because the input voice to the MIR system is typically short in length. • The singing is without any accompaniment. The singer is allowed to listen to the music in advance but has to sing from memory. After recording, phonetic transcriptions of all data are given by hand. 66 M. Suzuki, A. Hisaoka Table 1. Music list used in the database Music title Composer The Marriage of Figaro Sabre Dance The Trout Carmen The Nutcracker The Planets Wedding March Dance of the Knights Triumphal March Heroic Polonaise Revolutionary Étude For Elise Canon Radetzky March Air on the G string Hungarian Dances Je te veux Turkish March The Four Seasons Swan Lake W. A. Mozart A. Khachaturian F. Schubert G. Bizet P. I. Tchaikovsky G. Holst F. Mendelssohn S. Prokofiev G. Verdi F. Chopin F. Chopin L. Beethoven J. Pachelbel J. Strauss I J. S. Bach J. Brahms E. Satie W. A. Mozart A. Vivaldi P. I. Tchaikovsky 3.2. Target songs Twenty popular classical music pieces were used as target songs. Table 1 shows a list of classical music used here. Some of the music was played by an orchestra, while other pieces were played on string instruments, the piano, and so on. Four representative periods (each of which was about 10 ∼ 20 seconds) were extracted from the music. A total of 80 musical periods were used for recording. 3.3. Recording procedure A singer was made to sing onomatopoeias without any reference sounds to generate the input for a QbH-MIR system. Details of the recording procedure are provided next. (1) A singer listens to a target musical period by using headphones and thinks about how to sing the music using onomatopoeias. The singer should memorize the music in order to sing it without any reference sounds (“a cappella”). The music can be listened to repeatedly until the singer memorizes it. Development of Singing-by-Onomatopoeia corpus for QbS-MIR system 67 (2) The singer sings the music with onomatopoeias. If the singer cannot memorize the music, or cannot think of appropriate onomatopoeias, then the singer can pass that piece of music. Thirty-nine males and 10 females participated in the evaluation as singers. Each singer attempted to sing 10 musical periods, but some periods were passed because a singer could not memorize it. A total of 452 songs were recorded; 9.2 songs were sung by a singer on an average. On the other hand, the number of singers per music piece was not well-defined. The maximum number of singers for a piece was 29 and the minimum number of singers was 17. 3.4. Making transcriptions After recording the songs, transcriptions were generated manually for all the audio clips. The sung onomatopoeias were transcribed into a sequence of Japanese syllables by an operator. It was difficult to determine the onomatopoeia for some of the data because of ambiguous pronunciation. In these cases, the operator determined the most similar Japanese syllable to the sound. Six operators, who were different from singers, were used for the transcription. 4. Definition of “words” In order to analyze the relationship between onomatopoeia and music, all transcriptions should be split into “words.” It is difficult to define the scope of a “word” in onomatopoeia. For example, for the transcription “talilalilalilang,” which splitting pattern is the most appropriate — “tali lali lali lang,” “ta li la li la li lang,” or “talilali lalilang?” In this section, we define “words” for onomatopoeia sequences from a statistical viewpoint. 4.1. Overview of the algorithm A word in a natural language is a cluster of phones which appears frequently in sentences. Our objective is to define a minimum unit (corresponding to the phone) for onomatopoeia and find clusters appearing frequently in onomatopoeia sequences. The minimum unit of onomatopoeia is defined based on the Japanese syllable. In fact, it is defined as /[c]v[Q][N]/, where “c” denotes a consonant, “v” denotes a vowel (including the long vowel), “Q” denotes a geminate consonant, and “N” denotes a syllabic nasal. The bracket indicates that the phonemes can be omitted. We also employ a single /N/ as the unit. Some singers sing music by using only /N/. It is similar to humming and the transcription of it becomes /NNNNN· · · /. In general, /N/ is not included in Japanese syllable independently, but in this study, /N/ is also used as one of the minimum unit of onomatopoeia. 68 M. Suzuki, A. Hisaoka For example, a transcription “talilalilalilang” is split into “ta li la li la li lang” according to the definition of the minimum unit. After splitting, some clusters that appear repeatedly can be used to define a “word.” “lila” or “lali” can be employed as the “word” in this case because each of these is repeated twice in the example. This problem setting is similar to the unsupervised word segmentation from letter sequence or phonetic transcription. Several methods have been proposed to address this problem and we employed the Bayesian unsupervised word segmentation model using the Hierarchical Pitman-Yor language model (HPYLM)13,14 . In this method, it is assumed that a set of onomatopoeia sequences X is generated by a statistical language model G. It has a vocabulary and statistical model and generates word sequences. X is generated by concatenating the word sequences. P (G) is the prior probability of G and the joint probability P (X , G) is calculated by using Eq. (1). P (X , G) = P(X |G)P(G) (1) The method finds the most appropriate structure and parameters of G which maximizes P (X , G). This method maximizes P (X , G) for given onomatopoeia sequences X . It means that over-fitting may occur when the number of onomatopoeia sequences is small. In that case, the length of a word becomes longer because the variation in the sequences becomes less. 4.2. Statistics of obtained “words” We applied the word segmentation tool15 to all the 452 transcription data samples. A vocabulary size of 492 was obtained and all the transcriptions were split into 13,602 words. Table 2 shows the top-10 highest frequency words. These words were simple and commonly used for onomatopoeia singing. Eight words have the Table 2. Frequently occurred words Word /ta/ /ra/ /ra:/ R /t ja/ /taN/ /ta:/ /te/ /pa/ /ru/ /rara/ Frequency 1579 1098 622 611 521 479 405 368 319 305 Development of Singing-by-Onomatopoeia corpus for QbS-MIR system 69 140 Number of vocabulary 120 100 80 60 40 20 0 0 5 10 15 20 25 30 35 40 Number of syllables Fig. 1. Histogram of the number of different words per the number of syllables vowel /a/ or /a:/, and eight words have the consonant /t/ or /r/. Hence, it can be concluded that /ta/ and /ra/ are the most popular representation in Japanese onomatopoeia singing. Of the total words, 138 words consisted of two syllables and 120 words consisted of only one syllable. About 80% of the words had less than five syllables. On the other hand, 16 words had more than nine syllables and the longest word consisted of 35 syllables. These longer words were not reasonable and are generated due to over-fitting. Figure 1 shows the vocabulary size per word length. 5. Analysis 5.1. Variation of onomatopoeia within a singer The usage of onomatopoeia words is different among singers. Some singers use various words for a piece of music, while some singers use only a few type of words. We checked the deviation in selecting words by calculating the entropy for each singer. The entropy is calculated using a unigram (frequency probability distribution for words) and corresponds to the deviation of distribution. Mathematically, the entropy E can be calculated by using Eq. (2). X E=− P (w) log2 P (w) (2) w∈W where, W denotes a set of words (vocabulary) and P (w) denotes the frequency probability of the word w. As is seen from the equation, if all the words are used with equal probability, then the entropy reaches its maximum value of E = log2 (N ), where N denotes the number of words. On the other hand, if only one word is used, then the entropy is equal to 1, which is the minimum value. A smaller entropy implies that the distribution is more biased. Figure 2 shows the number of singers 70 M. Suzuki, A. Hisaoka 9 8 Number of singers 7 6 5 4 3 2 1 0 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Entropy Fig. 2. Histogram of the number of singers per entropy per entropy value. The maximum entropy was 5.04 and the minimum was 2.34. Many singers showed similar entropy. In fact, the entropies of 28 singers were within the range of 4.1 to 4.6. However, several singers showed a very low entropy. Seventeen singers showed entropy lower than 4.0. Especially, the entropy value of four singers were lower than 2.8. These singers used only a few words. For example, the singers with the lowest entropy used only 10 words and 82% of onomatopoeia were occupied by only three words. 5.2. Similarity among singers The entropy indicates range in word variations. However, it does not mean that two singers who have the same entropy use the same words. In order to investigate how a word set is similar to another singer’s word set, the distance between the singers were calculated. The distance is defined as the Bhattacharyya distance between two unigrams. After calculating the distance between all combinations of two singers, all the singers were mapped into a three-dimensional space by using the MultiDimensional Scaling (MDS) method. Figure 3 shows the three-dimensional representation for all singers. As observed, there are four clusters (R1 ∼ R4) and nine solitary singers. These nine singers used singular words compared with other singers. For example, man10 did not use /ta/, /ra/, or any onomatopoeia including these two onomatopoeia. He frequently used R /t a/ and variations of it (about 80% of his singing data). Man11 only used /ra/ and its variations (long vowel, add /Q/, and so on). Entropies of these two singers were 2.50 and 2.34, respectively. Some of the other singers (man15, man26, and woman10) used very long words such as /babababa:Nbaba:N/ and /tukutuQtukutuQtuku/. No other singer used such words. We also analyzed the statistical characteristics of the four groups as shown in Table 3. In this table, the “Group entropy” was calculated using all the transcription data in the same group and “Individual entropy” was Development of Singing-by-Onomatopoeia corpus for QbS-MIR system 71 Fig. 3. 3D plot of all singers Table 3. Statistics for four groups Group R1 R2 R3 R4 #Singers Vocabulary size 11 7 10 12 224 105 166 267 Entropy Group Individual 4.64 4.13 5.24 6.06 4.46 3.37 4.09 4.32 calculated using the average of individual entropies, each of which was calculated by using the transcription data associated with a singer. If both the entropies are similar, all the singers in the group used similar onomatopoeia word set. On the other hand, if the group entropy is bigger than the individual entropy, then each singer in the same group used a different word set. In group R2, both the entropies were small and the number of vocabulary was also small. It means that all the singers in the group used similar vocabulary to each other and the vocabulary size was small. Singers in group R1 also used similar vocabulary, but the vocabulary size was bigger than that of group R2. On the other hand, singers R3 and R4 used different vocabulary from each other because the 72 M. Suzuki, A. Hisaoka Table 4. Entropy for each music Music title Dance of the Knights Radetzky March Revolutionary Étude Sabre Dance The Marriage of Figaro Carmen Turkish March Hungarian Dances The Trout The Nutcracker The Planets Je te veux Wedding March Canon Heroic Polonaise Swan Lake The Four Seasons For Elise Triumphal March Air on the G string Entropy 6.04 5.59 5.43 5.42 5.33 5.32 5.21 5.02 4.97 4.96 4.92 4.92 4.74 4.61 4.43 4.38 4.34 4.23 4.19 4.17 Main instruments Violin, horn Violin Piano Xylophone, trumpet Violin Flute, oboe Piano Violin, flute Violin Flute Violin Oboe Trumpet Violin Piano Oboe Violin Piano Trumpet Violin group entropy was bigger than the individual entropy value. 5.3. Analysis focused on musical instruments The usage of onomatopoeia is strongly related to music. The musical instrument used in the main melody is especially important. In order to investigate the relationship between onomatopoeia and musical instruments, we calculate the entropy for each music and investigate the similarity between musical pieces using the same analysis techniques used in section 5.1 and section 5.2. Table 4 shows the entropy calculated for each music. From this table, the music with higher entropy is played by many musical instruments and/or using a fast tempo. On the other hand, the music with lower entropy is played by single instrument and/or using a slow tempo. In a music pieces with slow tempo, the same onomatopoeia (e.g. /ta:/, /ra:/, etc.) was used repeatedly (e.g. /ra:ra:ra:/). However, higher tempo music was constructed using various onomatopoeias because repeating the same onomatopoeia is difficult to verbalize quickly. For example, the onomatopoeia /tarariraraNrariraN/ is easier to recite quickly as compared to the onomatopoeia /tatatatatatatata/. Thus, it can be concluded that a music piece with fast tempo is sung using various onomatopoeia words. Development of Singing-by-Onomatopoeia corpus for QbS-MIR system 73 Fig. 4. Three dimensional plot for music In order to investigate the relationship between musical instruments and onomatopoeias, all the music was plotted in a three-dimensional space using MDS. Figure 4 shows the three-dimensional space plot. In this figure, each color circle corresponds to each music piece and the color denotes the main musical instrument (violin, piano, trumpet and so on). Some music pieces are represented by two circles because the main instrument cannot be determined. Each numerical number located near a circle denotes the tempo (BPM). It can be observed that the location of the music pieces can be roughly classified on the basis of the main musical instruments. The piano pieces are located in the right side of the bottom half, the trumpet pieces are located in the left side, and the flute pieces are located in the center. The results imply that the onomatopoeia are strongly related to the main musical instrument. The piano timbre was frequently represented by “la,” “ta,” and “cha” groups (/ra/, /ra:/, /ra:N/ etc) and the trumpet was represented by “pa” and “la” groups. However, the pieces played on the violin were widely distributed in space. We will investigate the difference among such pieces and analyze the relationship between the violin timbre and the onomatopoeia representation. 74 M. Suzuki, A. Hisaoka 6. Conclusion Query-by-Singing MIR system is advantageous over the Query-by-Humming MIR system but it cannot retrieve instrumental music. Such pieces are often sung using onomatopoeias. There are many kinds of onomatopoeias and these are related to sound characteristics such as tone, length, and timbre. If the music is converted into a sequence of onomatopoeias, then the onomatopoeias can be used as retrieving keys instead of lyrical context. In order to investigate the relationship between sounds and onomatopoeias, we constructed a new singing voice corpus. Instrumental pieces were sung with onomatopoeias by 49 singers and phonetic transcriptions were generated manually. Overall, 452 data samples were recorded. In addition to constructing the corpus, several basic analyses were carried out. First, we defined the “word” unit of onomatopoeia by using a statistical language model and 492 words were defined. “Ta” and “la” were most commonly used but the variation in the usage of onomatopoeia among singers was very wide. Some singers used many kinds of onomatopoeias while some singers used only a few words for any type of music. Some of the musical instruments were associated with a particular onomatopoeia but these relationships were not well-defined. This analysis was based on 452 transcriptions. In future work, we anticipate using more number of transcriptions. After that, we would like to investigate the following issues from the viewpoint of the transcriptions. sung data: • Connectivity of onomatopoeias (bigram, trigram, and more) • More various musical instruments (Beyond violin and piano) • Relationship between native language of singer and onomatopoeia Acknowledgement A part of this work was supported by JSPS KAKENHI Grant Number 25330140. References 1. N. Kosugi, Y. Nishihara, T. Sakata, M. Yamamoto, and K. Kushima, “A Practical Query-ByHumming System for a Large Music Database,” in ACM Multimedia 2000, 2000, pp. 333–342. 2. A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by humming: Musical information retrieval in an audio database,” in Proc. ACM Multimedia, 1995, pp. 231–236. 3. B. Liu, Y. Wu, and Y. Li, “A Linear Hidden Markov Model for Music Information Retrieval Based on Humming,” in Proc. ICASSP 2003, vol. V, 2003, pp. 533–536. 4. M. Suzuki, T. Hosoya, A. Ito, and S. Makino, “Music information retrieval from a singing voice using lyrics and melody information,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. Article ID 38 727, 8 pages, 2007, doi:10.1155/2007/38727. 5. ——, “Music information retrieval from a singing voice based on verification of recognized hypotheses,” in Proc. ISMIR, 2006, pp. 168–171. 6. T. Hosoya, M. Suzuki, A. Ito, and S. Makino, “Lyrics recognition from a singing voice based on finite state automaton for music information retrieval,” in Proc. ISMIR, 2005, pp. 532–535. 7. K. Ishihara, F. Kimura, and A. Maeda, “Music retrieval using onomatopoeic query,” in Proc. World Congress on Engineering and Computer Science (WCECS), 2013, pp. 437–442. Development of Singing-by-Onomatopoeia corpus for QbS-MIR system 75 8. T. Masui, “Music composition by onomatopoeia,” in IFIP Advances in Information and Communication Technology, 2002, pp. 297–304. 9. T. Nakano, J. Ogata, M. Goto, and Y. Hiraga, “A drum pattern retrieval method by voice percussion,” in Proc. ISMIR, 2004, pp. 550–553. 10. A. Srinivasamurthy, R. C. Repetto, H. Sundar, and X. Serra, “Transcription and recognition of syllable based percussion patterns: the case of Beijing opera,” in Proc. ISMIR, 2014, pp. 431–436. 11. K. Tanaka, K. Matsubara, and T. Sato, “Study of onomatopoeia expressing strange sounds : Cases of impulse sounds and beat sounds,” Transactions of the Japan Society of Mechanical Engineers Ser.C, vol. 61, no. 592, pp. 4730–4735, 1995, (in Japanese). 12. K. Hiyane, N. Sawabe, and J. Iio, “Study of spectrum structure of short-time sounds and its onomatopoeia expression,” The institute of Electronics, Information and Communication Engineers, Technical Report of IEICE SP97-125, 1998, (in Japanese). 13. D. Mochihashi, T. Yamada, and N. Ueda, “Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling,” in Proc. ACL, 2009, pp. 100–108. 14. G. Neubig, M. Mimura, S. Mori, and T. Kawahara, “Learning a language model from continuous speech,” in Proc. INTERSPEECH, 2010, pp. 1053–1056. 15. G. neubig. [Online]. Available: http://www.phontron.com/latticelm/ Motoyuki Suzuki He received the B.E., M.E., and Ph.D. degrees from Tohoku University, Sendai, Japan, in 1993, 1995, and 2004, respectively. Since 1996, he worked with the Tohoku University as a Research Associate. From 2006 to 2007, he worked with the University of Edinburgh, UK, as a Visiting Researcher. In 2008 he became an Associate Professor in the University of Tokushima, and currently he worked with the Osaka Institute of Technology, Osaka, Japan. He has been engaged in spoken language processing, music information retrieval, and pattern recognition using statistical modeling. He is a member of the IEICE, IPSJ, ASJ, and JSAI. Akimitsu Hisaoka He received the B.E. from Osaka Institute of Technology, Osaka Japan, in 2016. He studied music information retrieval and statistical approach for natural language processing when he was in college.
© Copyright 2026 Paperzz