Characterizing music dynamics for improvisation

Characterizing music dynamics for improvisation
S H Srinivasan
Applied Research Group
Satyam Computer Services Ltd, Bangalore
SH [email protected]
Abstract
Characterizing music dynamics is important for music
generation, music analogies, music retrieval, music improvisation etc. Though there has been some work on modeling music, most of them do not consider the temporal or
dynamical aspects of music. In this paper we show that
it is possible to characterize music dynamics using linear
prediction. We provide a technique for music improvisation
using this model. The improvisation scheme is then applied
to the generation of background music for videos.
1. Introduction
Music is one of the most compelling form of audio. Music has no language barrier though music appreciation has
cultural bias [7]. Music fragments are required in several
applications like animation, games, etc [13]. Music synthesis requires the specification of the parameters like pitch,
duration, beats, etc.
Music has several aspects like melody, harmony, tempo,
etc. Tempo is usually specified by the user informally as
slow, fast, etc. or is obvious from the application scenario.
The harmonic aspects, though complex, can be handled
by postprocessing techniques like “harmonizing melodies”.
Hence melody is the most important and most complex aspect of music synthesis. Melody is usually thought of in
terms of melodic contours – the variation of pitch with time.
The term itself shows that temporal or dynamic aspects are
important. Most of the current models ignore the temporal
aspects – by considering pitch histograms, for example [11].
In this paper we characterize the pitch dynamics using linear prediction. We also interpret music improvisation in the
framework of linear prediction. Finally the problem of generating background music for videos using a given melody
is posed as an improvisation problem. (Caveat: In music
terminology, dynamics refers to amplitude or volume. We
have taken freedom to use the term dynamics to refer to the
usual technical sense. We will use the word amplitude for
0-7803-8603-5/04/$20.00 ©2004 IEEE.
sound intensity.)
This paper is organized as follows. Section 2 discusses
current work in music synthesis. Section 3 introduces the
LPC model for pitch dynamics. Section 4 provides an interpretation of music improvisation and shows how music
improvisation can be performed using the LPC model. Section 5 uses the ideas in previous sections for music generation for videos. The paper closes with a discussion (section 6).
2. Music synthesis
In this section, we consider several types of music synthesis: task-based and rule-based music synthesis, music
analogies, and music improvisation.
Task-based music synthesis synthesizes music for a
given data or task. Music synthesis has been performed in
several domains like proteins [4], network traffic [8], etc. In
these applications, music acts like a “sound oscilloscope” a signal which reflects changes in the original data. The listener can use the auditory pattern recognition capability of
human brain to get some insight into the data “auralized”.
Since music has to remain faithful to data, it may not be
aesthetically sound.
Most of the music synthesis programs are rule-based [3].
The rules of “good music” like chord progressions are codified and used in music generation. Identifying rules is timeconsuming and not always possible. Hence techniques for
synthesizing music based on examples is needed. There are
two techniques for example-based music generation: music
analogies and improvisation.
Music analogy – as formulated in [10] – has two inputs
an exact pitch profile A and an approximate pitch profile B.
B is changed to match the style of A. This formulation is
based on matching pitch profiles. There are other formulations also. The analogy proposed in [12] does not consider
pitch variations - only tempo and amplitudes are considered.
Improvisation has fewer constraints and is more subjective. Given a music clip A, the goal is to improvise on A
using the “theme” of A. This has several interpretations.
70
15
Impulses
65
10
60
5
55
0
50
−5
45
−10
40
−15
0
50
100
150
200
250
Figure 1. Pitch contour of Mozart Viennese
Sonatina. The chords were reduced to their
root for calculating the pitch profile. The note
durations are also ignored (for this plot). The
notes were considered to be of equal duration. The X-axis is the note number and Y axis is the MIDI pitch number.
In [11], for example, pitch (class) histograms are calculated
from about 10 minutes of training data (warmup). Then
during improvisation, notes are generated from appropriate
components. But “Melodically, · · · previously improvised
notes do not affect future notes (and vice versa).”
Noise
0
10
20
30
40
50
60
70
80
90
100
Figure 2. Second order LPC residual of
Mozart. It can be seen that single impulses
and white noise like signals are present. The
prediction error is close to zero for most
notes.
(Frame-based) LPC has been used extensively in speech
coding. The error signal e n takes the following forms in
speech coding [9].
periodic impulse train: for voiced signals like vowels,
‘b’, ‘d’, ‘g’, etc.
single impulse: for unvoiced plosives like ‘p’, ‘t’, ‘k’, etc.
white noise: for unvoiced fricatives like ‘f’, ‘s’, etc.
3. Music dynamics
Consider figure 1 which shows the pitch profile of
Mozart Viennese Sonatina No 1 in C major Movement 3. 1
We want to capture the movement of pitch in time. As mentioned before atemporal histograms have been used for this.
In our previous work [9], we have used Haar approximation
for capturing the average profile.
This paper is based on the following observation. Each
pitch is dependent on the previous pitch values. We capture
this dependence through linear prediction. Let s 1 , s2 , · · · be
a time series. In this time series, we make a prediction of
sn using p of the previous values. Let ŝ n be the predicted
value.
p
ak sn−p
ŝn =
k=1
The coefficients ak are determined to minimize the LPC
residual
en = sn − ŝn
1 The music clips used in this study were obtained from KernScores site
(http://kern.humdrum.net/) in MIDI format.
Figure 2 shows the residual for pitch profile shown in
figure 1. (Similar residuals have been observed for other
music clips.) Single impulses and white noise like signals
are present in the residual.
The impulses in the residual signal indicate beginning
of new phrase. Hence it is possible to detect music phrase
boundaries from the residual.
The above representation is invertible. It is possible to
synthesize sn from ak and en . We will use this property but
with a different e n sequence.
Choosing the model order
The LP coefficients depend on the value of p – called the
model order. In speech coding, the model order is fixed
(around 10). It is possible to arrive at the optimal model
or
der as follows. We calculate the cumulative error n e2n as
a function of p. The value of p which provides the minimum
error is taken as the optimal model order.
4. Improvisation
Most of the improvisation models use randomness or genetic algorithms [5] [6] [11]. Purely random signals are not
0.02
70
0.015
65
0.01
60
0.005
55
0
50
−0.005
−0.01
45
−0.015
40
0
10
20
30
40
50
60
70
80
90
100
0
5
10
15
20
25
30
35
40
45
50
45
50
70
75
65
70
65
60
60
55
55
50
50
45
45
40
40
0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
100
The red (gray) curve is the original pitch contour. The blue
(dark) curves are the pitch contours for LPC-based improvisation. The top figure is based on order 1 LPC approximation. The bottom figure uses order 2 approximation. Order
1 approximation has lower error than order 2 approximation. This is reflected in this figure. Order 1 follows the
original contour more closely than order 2 approximation.
Figure 3. LPC-based improvisation.
The top figure shows the normalized mean-subtracted hue
profile of a shot of video. This is taken as the LPC residual and with the coefficients provided by LPC analysis of
Mozart. The synthesized pitch profile is shown in red (gray)
in the bottom. The original pitch profile is shown in blue.
It can be seen that when the hue is close to mean, the synthesized contour is close to the original. When there is a
deviation from the mean, the synthesized contour also deviates – maintaining the general shape.
Figure 4. Music synthesis for videos.
music. Hence the key to successful improvisation is controlled randomness. The model should capture aspects of
music and randomness should be introduced at an appropriate place. Another major shortcoming of most existing
models of improvisation is that they do not consider the
temporal aspects.
From our LPC-based analysis of pitch contours (or pitch
time series), we conclude the following.
1. The LP coefficients capture the underlying dynamics.
2. The residual (which is random except for the impulse
occurrences) provides the novel part.
So we improvise on a melody by using the same LPC coefficients and change the residual depending on application.
The residual can be replaced by a random signal which is
controlled appropriately.
In this paper, we perturb the original LPC residual using
small random values which are distributed normally.
en = en + N (0, σ 2 )
where σ controls the level of noise. Since the original error
distribution itself normal, this perturbation does not affect
the fundamental character of the residual. The synthesized
signal using the residual e n and the original LP coefficients
provides the improvisation contour.
Figure 3 shows the results of improvisation.
5. Music generation for videos
Home videos lack compelling audio content. Adding audio makes videos more “presentable”. In [10] we explored
synthesizing music based on video content. The audio–
video feature match is obtained using the principles of computational media aesthetics [14]. It is easy to see that music tempo and video motion activity should be matched for
aesthetic reasons. It is less obvious that the pitch values (of
music) and hue values (of video) should also be matched.
The hue values are nearly constant inside video shots. See
the top profile of figure 4 for an example hue profile. If
the pitch profile is made directly proportional hue, then it
will be piecewise constant resulting in monotonous music.
To produce pleasing music, rule-based and analogy-based
techniques were proposed in [10].
In this section we show that improvisation can be used
to produce aesthetically pleasing music. In the application
scenario, the user provides a video and a music clip (in
MIDI format) to be mixed. Systems like Muvee [1] mix
audio and video files with no matching. We want to adapt
music to video content. It is easy to scale the tempo of music according to the video motion activity. The hue profile
of video and pitch profile of music may not be matched.
We use improvisation on the music so that the hue profile
and pitch profile are matched. The definition of improvisation is slightly different in this case. In usual improvisation
models, new music is produced based on an example using inspiration (in humans) or randomness (in computers).
In case of music synthesis, the video content provides the
basis for “inspiration”. This is very easy to do in our LPC
model. As mentioned before, the LPC analysis produces a
residual signal. We replace the original residual by a signal
which depends on the hue profile of video. If h is the video
profile, we let
en = f (hn )
where f (·) is a monotonic function which adjusts the dynamic range appropriately. A pitch profile is synthesized
using the LPC coefficients of the original clip and the huedependent residual.
Figure 4 shows the computation of pitch profile. The
pitch contour is converted to music by
1. substituting pitch values by corresponding chords of
the original music.
2. scaling the tempo of the synthesized music to that of
motion activity in video.
6. Discussion
The central ideas of this paper are
1. Pitch dynamics or “melodic theme” in music can be
captured using low order LPC.
2. The LPC residual (”innovation”) captures the variation
from the “theme”.
We have used this idea of residual as innovation for improvisation and hue dependent music improvisation.
The work reported here can be extended in several directions. The LPC model can be used for content-based music
retrieval. Its use in improvisation can be enhanced in the
following ways. We can use LPC-model for other parameters like note duration and note sound amplitude. In this
paper we have not modified LP coefficients. It is possible to
change the LPC coefficients to get related pitch dynamics.
A central difficulty of works of this this type is the difficulty in evaluation of quality. While compression quality
can be more objectively measured using human subjects, it
is more difficult to evaluate music quality since background,
training, and “taste” influence the judgements (apart from
factors like age, sex, etc.).
At the same time, music is an important media element and audio-video interactions are important [2]. Hence
studies of this nature are valuable. But common public
databases and evaluation procedures need to be standardized before meaningful comparisons can be done. Unfortunately, study of music has not attracted the same level of
attention and resources as audio and video. The main reason for this can be the perception that music is more an “art”
than a “science”.
References
[1] Muvee technologies. http://www.muvee.com.
[2] M. Chion. Audio-vision: sound on screen. Columbia University Press, 1994.
[3] D. Cope. An expert system for computer-assisted composition. Computer Music Journal, 1987.
[4] DNA & Protein Music, Aug 2002. http://linkage.
rockefeller.edu/wli/dna_corr/music.html.
[5] S. Fels and J. Manzolli. Interactive, evolutionary textured
sound composition. In Eurographics Workshop on Multimedia, 2001.
[6] G. Papadopoulos and G. Wiggins. A genetic algorithm for
the generation of Jazz melodies. In STeP, 1998.
[7] R. Parncutt. Harmony: A psychoacoustical approach.
Springer-Verlag, 1989.
[8] L. D. Paulson. Researcher puts the network to music. IEEE
Computer, pages 23–23, Mar. 2003.
[9] T. F. Quatieri. Discrete-time speech signal processing: Principles and practice. Prentice Hall, 2002.
[10] S. H. Srinivasan, M. Gajanan, and M. Kankanhalli. Music
synthesis for home videos. Submitted, 2003.
[11] B. Thom. Unsupervised learning and interactive jazz/blues
improvisation. In AAAI, 2000.
[12] A. Tobudic and G. Widmer. Playing Mozart phrase by
phrase. International Conference on Case-based Reasoning,
2003.
[13] C. Yu. Computer generated music composition, 1996.
http://www.oz.net/˜cyu/Thesis.html.
[14] H. Zettl. Sight, sound, motion: Applied media aesthetics.
Wadsworth, 1998.