link

A Multi-tiered Approach for Analyzing
Expressive Timing in Music Performance
Panayotis Mavromatis
Department of Music and Performing Arts,
New York University,
35 West 4th St., Suite 777,
New York, NY 10012, USA
[email protected]
http://theory.smusic.nyu.edu/pm/
Abstract. This paper presents a method for analyzing expressive timing data from music performances. The goal is to uncover rules which
explain a performer’s systematic timing manipulations in terms of structural features of the music such as form, harmonic progression, texture,
and rhythm. A multi-tiered approach is adopted, in which one first identifies a continuous tempo curve by performing non-linear regression on
the durations of performed time spans at all levels in the metric hierarchy. Once the effect of tempo has been factored out, subsequent tiers
of analysis examine how the performed subdivision of each metric layer
(e.g., quarter note) typically deviates from an even rendering of the next
lowest layer (e.g., two equal eighth notes) as a function of time. Structural features in the music are identified that contribute to a performer’s
tempo fluctuations and metric deviations.
1
Introduction
The study of expressive musical performance has been the subject of experimental as well as computational research [1,2]. It is generally acknowledged that
expressive timing—a performer’s deviations from an exact temporal rendering
of the score—is an important component of musical expression. By manipulating timing, a performer is able to communicate musical structure and shape a
listener’s experience of the music. This paper presents a method for analyzing expressive timing data, extracted through audio analysis of recorded performances.
The purpose is to uncover rules which explain a performer’s systematic timing
manipulations in terms of structural features such as form, harmonic progression,
texture, and rhythm.
A fundamental assumption of this analysis is that a performer controls a hierarchically structured metrical cycle of measure, beat, and subdivision levels [3,4].
At each point in time, the performer’s mental clock fires at a given tempo, which
is evidenced by the cumulative effect of all the levels in the metrical cycle. The
performer’s clock rate as a function of time is represented by a tempo curve.
Identifying this curve forms a natural first tier of analysis.
E. Chew, A. Childs, and C.-H. Chuan (Eds.): MCM 2009, CCIS 38, pp. 193–204, 2009.
c Springer-Verlag Berlin Heidelberg 2009
194
P. Mavromatis
Once the effect of tempo has been factored out, it is possible to examine
how the performance rendering of subdivisions between adjacent metrical layers
(e.g., the subdivision of a quarter note into two eighth notes) deviates from the
corresponding exact duration ratios (e.g., 0.5 / 0.5). In the subsequent tiers of
analysis, systematic deviations of this type are identified at each level in the
metric hierarchy.
As justification for the proposed multi-tiered approach, we first note that
it is supported by informal musical discourse: terms such as ritardando and
accelerando typically refer to the first tier of expressive timing, whereas terms
such as rubato, notes inégales, or “swing” most commonly represent deviations
in the subsequent tiers.
More to the point, it appears that, in principle at least, a skilled performer
can control each tier independently. For instance, a performer may be asked to
manipulate the tempo of a performance, while maintaining even metric subdivisions. Conversely, the performer may be requested to perform at steady tempo,
while producing various types of uneven metric subdivisions. Moreover, these
uneven subdivisions can be executed independently at any particular metric
level—up to a certain depth—while maintaining even timing at higher metrical
levels.
One of the challenges of expressive performance research is to understand
the cognitive mechanisms that underlie expert expressive rendering of a musical
score. In line with current views on cognitive modeling, it is natural to seek
modular rules that specialize in responding to specific features of the musical
structure (such as metric accent) by shaping the expression in specific ways
(such as lengthening the strong half of a subdivision). This modularity requirement poses challenges for any analytical approach to expressive performance
data, which must identify and isolate the effect of individual rules from their
surrounding context, where other rules may be simultaneously contributing to
expressive deviations. It is in this spirit that the present analysis is offered; it
represents work aiming towards a complete rule system for expressive timing.
We believe that the multi-tiered analytical approach proposed in this paper can
help identify and isolate the right ingredients in this complex multi-faceted manifestation of expert musical skill.
2
Related Previous Research
Several studies have focused on modeling specific aspects of expressive performance, such as rubato [5] or the final ritardando (see [6] for a review). In addition,
some research groups have aimed for comprehensive models that integrate many
different components of performance expression. As expected, timing plays a
central role in such models.
An important early attempt at an integrated model was the work of Eric
Clarke [7]. Clarke proposed nine generative rules to explain expressive deviations in terms of the performed piece’s structural features, such as grouping and
meter. These rules were derived from measurements of piano performances in
A Multi-tiered Approach for Analyzing Expressive Timing
195
experimental studies by Clarke and collaborators. Another important contribution was the KTH model by Sundberg and his group [8]. This model represents a
synthetic approach, where expression rules were formulated by querying expert
performers as to their expressive deviation practices.
The approach proposed in the present paper is inspired in part by the work of
Gerhard Widmer and his collaborators [9,10,2], perhaps the most sophisticated
proposal to date for an integrated model of expressive performance. Widmer’s
group applied machine learning techniques to analyze measurements of expressive performance by skilled musicians.
Most relevant to our approach is the fact that Widmer employed a two-tiered
data analytic model, in which local note-to-note expressive deviations were separated from the more global expressive shaping of grouping units, such as phrases.
Following earlier research [11], Widmer hypothesized that each grouping unit in
the music contributes a parabolically shaped accelerando-ritardando component
to the performance’s tempo curve. The overall tempo curve is assumed to be the
product of all such contributions coming from each grouping unit.
The first tier of Widmer’s analysis consisted of identifying the parabolic coefficients corresponding to each unit of grouping. The process started from the
highest grouping level and proceeded to the lowest. At each level, the coefficients were identified by least-squares fitting. After each level’s contribution was
factored out, the analysis was repeated at the next lowest level, until all levels of grouping were accounted for. The residual timing deviations were then
attributed to local note-to-note expressive timing rules, which were extracted
from the data via a machine learning algorithm.
The present work extends and modifies Widmer’s approach in two different
ways. First, we do not make the assumption that grouping is the only factor
contributing to the shape of the tempo curve. Instead, we consider sources of additional contributions, such as texture and the tonal/formal function of phrases
and sections. As we will see, there is indeed evidence that such factors come into
play in determining a performance’s tempo fluctuations.
The second difference between Widmer’s approach and ours is that, rather
than examining a single layer of low-level residual timing deviations, we analyze
separately the deviations at each subdivision level in the metric hierarchy. As
we will see, there is evidence that this separation could lead to simpler, more
modular rules. At the same time, this allows us to develop rules that are specific
to the absolute time scale of metric subdivision, as measured in seconds. Indeed,
different time scales of pulsation can have different cognitive properties, as evidenced by several experimental studies, which are nicely summarized in [4] (see
especially Chapter 2).
3
Tempo Curve Calculation Using a Non-Parametric
Regression Model
If we give up a specific functional dependence of the tempo curve on grouping,
as implemented by the fitting of parabolic segments, we must consider the most
196
P. Mavromatis
general options for calculating the tempo curve from the timing data. This naturally leads to a non-parametric regression analysis, which does not assume a
specific functional form for the tempo curve.
For the purposes of this study, we found it most flexible to use a non-linear
regression model based on radial basis functions. The technique was first proposed in [12,13], and is a particular instance of density estimation using Parzen
windows [14]. In its simplest form, the process can be illustrated as follows:
Let {xi : i = 1 . . . N } be a set of values for the independent variable X, and
let {yi : i = 1 . . . N } be the corresponding values of the dependent variable Y ,
so that (xi , yi ) are the coordinates of the i-th point in the data set. Then the
regression curve y(x) obtained from the above data set is given by
N
2
2
i=1 yi exp[−(x − xi ) /2σ ]
y(x) = N
2
2
i=1 exp[−(x − xi ) /2σ ]
under the assumption of a Gaussian Parzen window. This expression, calculated
through the Parzen density estimation formula, has a simple interpretation: it
tells us that the predicted value of y at point x is equal to a weighted sum of the
yi observed at each xi . The weights are determined by the distance of each xi
from x, and decay rapidly with that distance, according to a Gaussian function.
The variance σ of the Parzen window is also known as the window width,
and can be viewed as a kind of smoothing parameter. Thus, the regression is
formally equivalent to a (Gaussian-weighted) moving average filter. However, it
should be noted that the window width is not set a priori, but it is inferred from
the data. Indeed, a central problem of this regression analysis is to determine
the right value of σ: If the latter is too large, the regression curve becomes too
coarse to capture the meaningful fluctuations in the data. Conversely, when σ
is too small, the regression curve displays over-fitting, i.e., it captures random
noise fluctuations in the data and is a poor predictive model. A simple, yet
effective way to determine the appropriate value of σ is through a form of N -fold
cross-validation. This is effected by minimizing a cost function that represents a
least-squares error on the cross-validation training sets (see [13] for more details).
The starting point for our analysis is a performance’s set of inter-onset durations. These are extracted from the audio recording using Tristan Jehan’s Echo
Nest API. The latter is a programming toolkit for digital audio analysis that
contains a tool for automatic note onset detection1 . Depending on the value of a
resolution parameter, the algorithm can miss a real note onset (if the resolution
is too low), or detect a spurious one (e.g. caused by reverb, if the resolution is
too high). There is no single optimal resolution, and so it is generally safest to
perform onset detection using a relatively high value, to ensure that no notes
have been missed. As a result, any spurious onsets detected by the algorithm
must be filtered out manually by listening. The algorithm produces the time of
each onset in seconds, correct to four decimal places, from which the inter-onset
duration values can be calculated at the same precision.
1
See http://developer.echonest.com/pages/overview (last visited March 2009)
A Multi-tiered Approach for Analyzing Expressive Timing
197
Since each note’s inter-onset duration reflects not only the local tempo, but
also the note’s nominal duration value (e.g., quarter-note, eighth note, etc.),
we must normalize each of the raw inter-onset durations by dividing it by the
corresponding note’s nominal value, where a whole note equals 1.0, a quarter
note equals 0.25, etc. This way, each normalized inter-onset duration is a consistent indicator of the local tempo: its value reflects the whole-note duration
corresponding to the tempo at that specific point in time. Our solution is essentially equivalent to Widmer’s representation of his timing data using percentage
deviations instead of absolute durations, but has the added advantage that it
keeps track of absolute tempo information, and not just its relation to some
average. The normalized inter-onset durations for each performance were used
as data presented to the non-linear regression model, in order to obtain that
performance’s tempo curve.
Figure 1 shows the application of the above analysis to a recording of Bach’s
F minor prelude, BWV 881, from the Well-Tempered Clavier, Book 2. The piece
is performed on the harpsichord by an expert, and is recorded on a commercially
available CD. This performance will be used as an illustration throughout the
paper. In Figure 1, the data points corresponding to the normalized inter-onset
durations are shown in grey. The tempo curve derived from the regression is
shown in black.
The performances of three contrasting Bach preludes (BWV 845, 863, 881)
were analyzed, each of them performed by two different harpsichordists. The
most salient factors shaping the tempo curve appear to be
– an initial small accelerando;
– a pronounced final ritardando;
– less pronounced, but consistent ritardandi leading to important cadences,
with magnitude usually reflecting the cadence’s hierarchical depth in the
Schenkerian sense;
– small but measurable contrasts in tempo to highlight sections marked off by
distinctive texture or tonal function (e.g., extended dominant pedal).
Once the effect of tempo is factored out, one can examine the lengthening or
shortening of individual measures with respect to their neighbors, in response to
specific features of the music. This individual manipulation of measure lengths
is distinct from overall tempo change, and can be represented in a graph such as
that of Fig. 2. Identified variations of this type include lengthening a measure
that
– begins a hypermetric pair;
– effects tonal arrival or resolution of a dissonant chord;
– contains unexpected material, such as a highly chromatic chord in a diatonic
context.
One intriguing feature of the non-parametric regression analysis is that the
optimal Parzen window width σ leading to each tempo curve emerges out of the
regression analysis through the process of cross-validation. The absolute value of
198
P. Mavromatis
σ is usually in the range of 2–4 seconds (2.0591 secs for the curve of Fig. 1). It is
an open question whether this value may hold some special significance, either
in terms of tempo, or the structure of the piece, or even in terms of psychological
properties of time perception and production.
4
The Hierarchy of Metric Deviations
The performed subdivision of each metric layer (e.g., quarter note) typically
deviates from an even rendering of the next lowest layer (e.g., two equal eighth
notes) as a function of time. This information can be represented in a graph
such as that of Figures 3 and 4. The nature of such deviations varies with metric
depth. They are often embedded in a small amount of random noise, which
reflects limits in the perception and production of exact rhythmic ratios [15].
However, some systematic variations are noteworthy. For instance, a consistent
lengthening of the metrically strongest half in a two-fold subdivision highlights
its stronger metric position through agogic accent. This is in line with findings
reported in many other approaches [7,8,9]. In our analysis, such specialized rules
are generally arrived at by inspection, and are subsequently confirmed using
standard statistical tests. The possibility of employing some machine-learning
classifier to uncover such rules algorithmically, in a manner akin to [9], is currently under investigation.
It is perhaps most remarkable that, even though deviation from exact subdivision is free to vary on a point-by-point basis, the deviations observed in
performance often vary smoothly over extended time spans, which typically corresponding to formal units such as phrases (see Figs 3 and 4). This suggests
that manipulation of subdivisions is not always controlled on a pulse-by-pulse
basis, which might impose excessive demands on real-time processing. Instead,
it is shaped by broader gestures in a performer’s motor programs, coordinated
so as to reinforce communication of musical structure. We would like to suggest
that our multi-tiered analysis, which separates each layer of subdivision in the
metric hierarchy, makes it easier for such patterns to be identified.
It should be added that the same non-parametric regression technique that
is used to construct the tempo curve has been applied to subdivision timing
data such as those of Figs 3 and 4, in order to extract the underlying envelope.
Once that envelope is identified, one can seek rules that cause the subdivisions
of particular pulses to deviate from the overall envelope. Such deviations can be
typically attributed to the need to project some type of accent.
5
Conclusions and Future Directions
The present paper proposed a data analytic method that aims to uncover rules
linking musical structure to specific expressive timing gestures in music performance. Several links were suggested between musical structure and expressive
timing at one or several tiers in a hierarchy. The description of structure-totiming associations remains to some extent qualitative at this stage. This could
A Multi-tiered Approach for Analyzing Expressive Timing
199
perhaps be partly attributed to an inevitable element of unpredictability that
may exist from performance to performance, even for the same player under
different circumstances. However, given the present analysis, there are many
ways to explore the possibility of precise quantitative relations between musical
structure and expressive timing deviations.
For instance, correlations between structural features of the music and specific
expressive deviations can be established by (i) annotating the score with a large
number of potentially relevant features, some of them objectively identifiable
(e.g., location of cadences), and some requiring annotations by independent musical experts; (ii) seeking correlations between the above features and expressive
deviation gestures such as peaks in the tempo curve, lengthened measures, or
lengthened beats. Such features can be tabulated in contingency tables, to which
standard statistical tests can be applied.
Another analytical approach might involve modeling the exact shape of the
tempo curve, seeking a quantitative predictive model of tempo fluctuations as
a function of specific musical features. This would entail (i) quantification of
all the possibly relevant features as continuous functions of time [16], and (ii)
complex regression analysis to identify features that are the best predictors of the
tempo curve. We are currently exploring certain multivariate time-series models
that could lead to such quantitative relations. As for the expressive subdivisions
within each layer of the metric hierarchy, they can be effectively modeled as they
unfold in time using the technique of Hidden Markov Models.
Fig. 1. Tempo curve for a recorded performance of Bach’s F minor prelude, BWV 881 (WTC, Book 2), plotted against normalized
inter-onset durations for each note in the piece. The x-axis represents position in the notated score using measure numbers (e.g., 2.5 is
the middle of the second measure). The y-axis represents local tempo, measured by the duration of a whole note in seconds.
200
P. Mavromatis
Fig. 2. Graph showing how individual measure durations deviate from the duration corresponding to performed tempo at each point in
time. The x-axis represents position in the notated score using measure numbers (see Fig. 1). The y-axis represents individual measure
durations in seconds. Therefore, a high spike represents a markedly lengthened measure. The graph comes from the same performance
as that of Fig. 1.
A Multi-tiered Approach for Analyzing Expressive Timing
201
Fig. 3. Graph showing the performed subdivision of each quarter note into two eighth notes. As before, the x-axis represents position in
the notated score using measure numbers (see Fig. 1). The y-axis represents the duration ratio of each subdivision. E.g. the first quarter
note in the Figure is subdivided into two eighth-note time spans of duration ratio 1.06 : 0.94. The measurements are taken from the
same performance as that of Figures 1–2. Figure 3 focuses on mm. 1–23. The complete piece is represented in Fig. 4.
202
P. Mavromatis
Fig. 4. The measurements presented in Fig. 3, now shown over a larger time scale that encompasses the entire piece. The axes carry the
same meaning as those of Fig. 3. From this graph, one can clearly observe how the deviations from exact subdivisions vary smoothly in
magnitude over time, as witnessed by the smooth envelope to the curve. These smooth variations are shaped by gestures (lumps in the
envelope) corresponding to meaningful formal units such as phrases.
A Multi-tiered Approach for Analyzing Expressive Timing
203
204
P. Mavromatis
References
1. Gabrielsson, A.: Music Performance. In: Deutsch, D. (ed.) The Psychology of Music, 2nd edn. Academic Press, San Diego (1999)
2. Widmer, G., Goebl, W.: Computational Models of Expressive Music Performance:
The State of the Art. Journal of New Music Research 33, 203–216 (2004)
3. Lerdahl, F., Jackendoff, R.S.: A Generative Theory of Tonal Music. MIT Press,
Cambridge (1983)
4. London, J.: Hearing in Time: Psychological Aspects of Musical Meter. Oxford
University Press, Oxford (2004)
5. Todd, N.P.M.: A Computational Model of Rubato. Contemporary Music Review 3,
69–88 (1989)
6. Honing, H.: Computational Modeling of Music Cognition: A Case Study on Model
Selection. Music Perception 23, 365–376 (2006)
7. Clarke, E.F.: Generative Principles in Music Performance. In: Sloboda, J.A. (ed.)
Generative Processes in Music: The Psychology of Performance, Improvisation,
and Composition. Clarendon Press, Oxford (1988)
8. Friberg, A.: A Quantitative Rule System for Musical Performance. Doctoral dissertation, Royal Institute of Technology, Stockholm (1995)
9. Widmer, G.: Machine Discoveries: A Few Simple, Robust Local Expression Principles. Journal of New Music Research 31, 37–50 (2002)
10. Widmer, G., Tobudic, A.: Playing Mozart by Analogy: Learning Multi-Level Timing and Dynamics Strategies. Journal of New Music Research 32, 259–268 (2003)
11. Todd, N.P.M.: A Model of Expressive Timing in Tonal Music. Music Perception 3,
33–58 (1985)
12. Specht, D.F.: A General Regression Neural Network. IEEE Transactions on Neural
Networks 2, 568–576 (1991)
13. Specht, D.F.: Probabilistic and General Regression Neural Networks. In: Chen,
C.H. (ed.) Fuzzy Logic and Neural Network Handbook. McGraw-Hill, New York
(1996)
14. Parzen, E.: On Estimation of a Probability Density Function and Mode. Annals of
Mathematical Statistics 33, 1065–1076 (1962)
15. Clarke, E.F.: Rhythm and Timing in Music. In: Deutsch, D. (ed.) The Psychology
of Music, 2nd edn. Academic Press, San Diego (1999)
16. Farbood, M.M.: A Quantitative, Parametric Model of Musical Tension. Doctoral
dissertation. MIT Press, Cambridge (2006)