Characterizing music dynamics for improvisation S H Srinivasan Applied Research Group Satyam Computer Services Ltd, Bangalore SH [email protected] Abstract Characterizing music dynamics is important for music generation, music analogies, music retrieval, music improvisation etc. Though there has been some work on modeling music, most of them do not consider the temporal or dynamical aspects of music. In this paper we show that it is possible to characterize music dynamics using linear prediction. We provide a technique for music improvisation using this model. The improvisation scheme is then applied to the generation of background music for videos. 1. Introduction Music is one of the most compelling form of audio. Music has no language barrier though music appreciation has cultural bias [7]. Music fragments are required in several applications like animation, games, etc [13]. Music synthesis requires the specification of the parameters like pitch, duration, beats, etc. Music has several aspects like melody, harmony, tempo, etc. Tempo is usually specified by the user informally as slow, fast, etc. or is obvious from the application scenario. The harmonic aspects, though complex, can be handled by postprocessing techniques like “harmonizing melodies”. Hence melody is the most important and most complex aspect of music synthesis. Melody is usually thought of in terms of melodic contours – the variation of pitch with time. The term itself shows that temporal or dynamic aspects are important. Most of the current models ignore the temporal aspects – by considering pitch histograms, for example [11]. In this paper we characterize the pitch dynamics using linear prediction. We also interpret music improvisation in the framework of linear prediction. Finally the problem of generating background music for videos using a given melody is posed as an improvisation problem. (Caveat: In music terminology, dynamics refers to amplitude or volume. We have taken freedom to use the term dynamics to refer to the usual technical sense. We will use the word amplitude for 0-7803-8603-5/04/$20.00 ©2004 IEEE. sound intensity.) This paper is organized as follows. Section 2 discusses current work in music synthesis. Section 3 introduces the LPC model for pitch dynamics. Section 4 provides an interpretation of music improvisation and shows how music improvisation can be performed using the LPC model. Section 5 uses the ideas in previous sections for music generation for videos. The paper closes with a discussion (section 6). 2. Music synthesis In this section, we consider several types of music synthesis: task-based and rule-based music synthesis, music analogies, and music improvisation. Task-based music synthesis synthesizes music for a given data or task. Music synthesis has been performed in several domains like proteins [4], network traffic [8], etc. In these applications, music acts like a “sound oscilloscope” a signal which reflects changes in the original data. The listener can use the auditory pattern recognition capability of human brain to get some insight into the data “auralized”. Since music has to remain faithful to data, it may not be aesthetically sound. Most of the music synthesis programs are rule-based [3]. The rules of “good music” like chord progressions are codified and used in music generation. Identifying rules is timeconsuming and not always possible. Hence techniques for synthesizing music based on examples is needed. There are two techniques for example-based music generation: music analogies and improvisation. Music analogy – as formulated in [10] – has two inputs an exact pitch profile A and an approximate pitch profile B. B is changed to match the style of A. This formulation is based on matching pitch profiles. There are other formulations also. The analogy proposed in [12] does not consider pitch variations - only tempo and amplitudes are considered. Improvisation has fewer constraints and is more subjective. Given a music clip A, the goal is to improvise on A using the “theme” of A. This has several interpretations. 70 15 Impulses 65 10 60 5 55 0 50 −5 45 −10 40 −15 0 50 100 150 200 250 Figure 1. Pitch contour of Mozart Viennese Sonatina. The chords were reduced to their root for calculating the pitch profile. The note durations are also ignored (for this plot). The notes were considered to be of equal duration. The X-axis is the note number and Y axis is the MIDI pitch number. In [11], for example, pitch (class) histograms are calculated from about 10 minutes of training data (warmup). Then during improvisation, notes are generated from appropriate components. But “Melodically, · · · previously improvised notes do not affect future notes (and vice versa).” Noise 0 10 20 30 40 50 60 70 80 90 100 Figure 2. Second order LPC residual of Mozart. It can be seen that single impulses and white noise like signals are present. The prediction error is close to zero for most notes. (Frame-based) LPC has been used extensively in speech coding. The error signal e n takes the following forms in speech coding [9]. periodic impulse train: for voiced signals like vowels, ‘b’, ‘d’, ‘g’, etc. single impulse: for unvoiced plosives like ‘p’, ‘t’, ‘k’, etc. white noise: for unvoiced fricatives like ‘f’, ‘s’, etc. 3. Music dynamics Consider figure 1 which shows the pitch profile of Mozart Viennese Sonatina No 1 in C major Movement 3. 1 We want to capture the movement of pitch in time. As mentioned before atemporal histograms have been used for this. In our previous work [9], we have used Haar approximation for capturing the average profile. This paper is based on the following observation. Each pitch is dependent on the previous pitch values. We capture this dependence through linear prediction. Let s 1 , s2 , · · · be a time series. In this time series, we make a prediction of sn using p of the previous values. Let ŝ n be the predicted value. p ak sn−p ŝn = k=1 The coefficients ak are determined to minimize the LPC residual en = sn − ŝn 1 The music clips used in this study were obtained from KernScores site (http://kern.humdrum.net/) in MIDI format. Figure 2 shows the residual for pitch profile shown in figure 1. (Similar residuals have been observed for other music clips.) Single impulses and white noise like signals are present in the residual. The impulses in the residual signal indicate beginning of new phrase. Hence it is possible to detect music phrase boundaries from the residual. The above representation is invertible. It is possible to synthesize sn from ak and en . We will use this property but with a different e n sequence. Choosing the model order The LP coefficients depend on the value of p – called the model order. In speech coding, the model order is fixed (around 10). It is possible to arrive at the optimal model or der as follows. We calculate the cumulative error n e2n as a function of p. The value of p which provides the minimum error is taken as the optimal model order. 4. Improvisation Most of the improvisation models use randomness or genetic algorithms [5] [6] [11]. Purely random signals are not 0.02 70 0.015 65 0.01 60 0.005 55 0 50 −0.005 −0.01 45 −0.015 40 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50 45 50 70 75 65 70 65 60 60 55 55 50 50 45 45 40 40 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 40 100 The red (gray) curve is the original pitch contour. The blue (dark) curves are the pitch contours for LPC-based improvisation. The top figure is based on order 1 LPC approximation. The bottom figure uses order 2 approximation. Order 1 approximation has lower error than order 2 approximation. This is reflected in this figure. Order 1 follows the original contour more closely than order 2 approximation. Figure 3. LPC-based improvisation. The top figure shows the normalized mean-subtracted hue profile of a shot of video. This is taken as the LPC residual and with the coefficients provided by LPC analysis of Mozart. The synthesized pitch profile is shown in red (gray) in the bottom. The original pitch profile is shown in blue. It can be seen that when the hue is close to mean, the synthesized contour is close to the original. When there is a deviation from the mean, the synthesized contour also deviates – maintaining the general shape. Figure 4. Music synthesis for videos. music. Hence the key to successful improvisation is controlled randomness. The model should capture aspects of music and randomness should be introduced at an appropriate place. Another major shortcoming of most existing models of improvisation is that they do not consider the temporal aspects. From our LPC-based analysis of pitch contours (or pitch time series), we conclude the following. 1. The LP coefficients capture the underlying dynamics. 2. The residual (which is random except for the impulse occurrences) provides the novel part. So we improvise on a melody by using the same LPC coefficients and change the residual depending on application. The residual can be replaced by a random signal which is controlled appropriately. In this paper, we perturb the original LPC residual using small random values which are distributed normally. en = en + N (0, σ 2 ) where σ controls the level of noise. Since the original error distribution itself normal, this perturbation does not affect the fundamental character of the residual. The synthesized signal using the residual e n and the original LP coefficients provides the improvisation contour. Figure 3 shows the results of improvisation. 5. Music generation for videos Home videos lack compelling audio content. Adding audio makes videos more “presentable”. In [10] we explored synthesizing music based on video content. The audio– video feature match is obtained using the principles of computational media aesthetics [14]. It is easy to see that music tempo and video motion activity should be matched for aesthetic reasons. It is less obvious that the pitch values (of music) and hue values (of video) should also be matched. The hue values are nearly constant inside video shots. See the top profile of figure 4 for an example hue profile. If the pitch profile is made directly proportional hue, then it will be piecewise constant resulting in monotonous music. To produce pleasing music, rule-based and analogy-based techniques were proposed in [10]. In this section we show that improvisation can be used to produce aesthetically pleasing music. In the application scenario, the user provides a video and a music clip (in MIDI format) to be mixed. Systems like Muvee [1] mix audio and video files with no matching. We want to adapt music to video content. It is easy to scale the tempo of music according to the video motion activity. The hue profile of video and pitch profile of music may not be matched. We use improvisation on the music so that the hue profile and pitch profile are matched. The definition of improvisation is slightly different in this case. In usual improvisation models, new music is produced based on an example using inspiration (in humans) or randomness (in computers). In case of music synthesis, the video content provides the basis for “inspiration”. This is very easy to do in our LPC model. As mentioned before, the LPC analysis produces a residual signal. We replace the original residual by a signal which depends on the hue profile of video. If h is the video profile, we let en = f (hn ) where f (·) is a monotonic function which adjusts the dynamic range appropriately. A pitch profile is synthesized using the LPC coefficients of the original clip and the huedependent residual. Figure 4 shows the computation of pitch profile. The pitch contour is converted to music by 1. substituting pitch values by corresponding chords of the original music. 2. scaling the tempo of the synthesized music to that of motion activity in video. 6. Discussion The central ideas of this paper are 1. Pitch dynamics or “melodic theme” in music can be captured using low order LPC. 2. The LPC residual (”innovation”) captures the variation from the “theme”. We have used this idea of residual as innovation for improvisation and hue dependent music improvisation. The work reported here can be extended in several directions. The LPC model can be used for content-based music retrieval. Its use in improvisation can be enhanced in the following ways. We can use LPC-model for other parameters like note duration and note sound amplitude. In this paper we have not modified LP coefficients. It is possible to change the LPC coefficients to get related pitch dynamics. A central difficulty of works of this this type is the difficulty in evaluation of quality. While compression quality can be more objectively measured using human subjects, it is more difficult to evaluate music quality since background, training, and “taste” influence the judgements (apart from factors like age, sex, etc.). At the same time, music is an important media element and audio-video interactions are important [2]. Hence studies of this nature are valuable. But common public databases and evaluation procedures need to be standardized before meaningful comparisons can be done. Unfortunately, study of music has not attracted the same level of attention and resources as audio and video. The main reason for this can be the perception that music is more an “art” than a “science”. References [1] Muvee technologies. http://www.muvee.com. [2] M. Chion. Audio-vision: sound on screen. Columbia University Press, 1994. [3] D. Cope. An expert system for computer-assisted composition. Computer Music Journal, 1987. [4] DNA & Protein Music, Aug 2002. http://linkage. rockefeller.edu/wli/dna_corr/music.html. [5] S. Fels and J. Manzolli. Interactive, evolutionary textured sound composition. In Eurographics Workshop on Multimedia, 2001. [6] G. Papadopoulos and G. Wiggins. A genetic algorithm for the generation of Jazz melodies. In STeP, 1998. [7] R. Parncutt. Harmony: A psychoacoustical approach. Springer-Verlag, 1989. [8] L. D. Paulson. Researcher puts the network to music. IEEE Computer, pages 23–23, Mar. 2003. [9] T. F. Quatieri. Discrete-time speech signal processing: Principles and practice. Prentice Hall, 2002. [10] S. H. Srinivasan, M. Gajanan, and M. Kankanhalli. Music synthesis for home videos. Submitted, 2003. [11] B. Thom. Unsupervised learning and interactive jazz/blues improvisation. In AAAI, 2000. [12] A. Tobudic and G. Widmer. Playing Mozart phrase by phrase. International Conference on Case-based Reasoning, 2003. [13] C. Yu. Computer generated music composition, 1996. http://www.oz.net/˜cyu/Thesis.html. [14] H. Zettl. Sight, sound, motion: Applied media aesthetics. Wadsworth, 1998.
© Copyright 2026 Paperzz