Iterative Multi Scale Dynamic Time Warping

COM
PUTER
G
RAPHICS
TECH
NICAL
R
EPORTS
CG-2006/1
Iterative Multi Scale Dynamic Time Warping
Arno Zinke
Universität Bonn. [email protected]
Dessislava Mayer
Institut für Informatik II
Universität Bonn
D-53117 Bonn, Germany
c Universität Bonn 2006
ISSN 1610-8892
Iterative Multi Scale Dynamic Time Warping
A. Zinke and D. Mayer
Institut für Informatik II, Universität Bonn, Germany
Email: [email protected]
Abstract
Dynamic Time Warping (DTW) is a frequently used technique for the optimal alignment of sequences with respect
to given constraints. The main disadvantage of DTW are both its time and memory complexity. We present a
novel iterative scheme which can significantly improve the DTW performance with respect to computation time
and memory requirements in case of very large sequences. In contrast to previous iterative approaches which were
designed for clustering time series with respect to shape, our method is suitable for precise alignments for a wide
range of different features and similarity measures.
Keywords: Dynamic time warping, Multi Scale DTW
1. Introduction
Dynamic time warping (DTW) is a frequently used technique for the optimal alignment of sequences with respect
to given constraints. It has been successfully applied to
many realms such as data mining, speech processing and
most recently to motion morphing and music alignment
( [SRS03, TE03, KP99, KG03]). The basic idea of DTW is
to compare two sequences (which can be seen as signals)
with respect to certain features (like for instance shape or
local frequency content) and to find an optimal alignment
of both sequences by stretching them with respect to time
(Fig.1 Right). For the comparison a distance or similarity
measure is needed.
One major drawback of conventional dynamic time warping is that it is much too slow and memory consuming
for aligning large sequences, due to its quadratic time and
memory complexity. We present a novel approach called
Iterative Multi-Scale Dynamic Time Warping (IMDTW),
significantly speeding up the classic algorithm and decreasing memory requirements for DTW alignments of large signals.
It combines the two ideas of
• restricting the path candidates and
• iteratively exploiting the DTW at different resolutions
(Fig.4).
In contrast to many previous approaches IMDTW can be
applied for a wide range of different features and measures
and is not restricted to only one special case like for instance for shape alignment of signals. To check if a "featuremeasure-combination" may be suitable for IMDTW we furthermore introduce a "stability" criterion. Although this criterion is not a sufficient assumption, our tests were indicating, that if this criterion is satisfied, IMDTW will virtually
give same alignments as classical DTW.
Moreover, our approach can be applied in the case of less
constrained alignments, where other methods typically fail
see Fig.2 Right).
Since music alignment is a very important field of application for classical DTW, it will serve as an interesting example for IMDTW.
2. Related Work
Several techniques were proposed to improve the complexity of DTW at least for the average case. Global path constraints for restricting the warping path to a certain area were
introduced by [SC71, Ita75]. All these constraints basically
limit the distance from the DTW path to the main diagonal.
Such constraints may slightly speed up the DTW computation and can help to prevent "pathological warpings". On
the other hand, however, they heavily restrict the number
of candidates for the optimal path which may cause undesired alignment results (Fig.2 Right). Interesting approaches
for speeding up the DTW algorithm were developed in the
realm of data mining. Especially for classification and clustering of large collections of time series the classic DTW
approach gets almost impracticable with respect to computation time. The Euclidean distance measure or some extension or modification thereof, is widely used for similarity calculations in conjunction with DTW to align signals
with respect to similar shapes. Such alignments typically do
not change drastically in case of a piecewise approximation
of the data. Therefore one can take advantage of this fact
and may exploit the DTW at a lower level of time resolution without loosing much information about the similarity.
A piecewise linear representation of a series was proposed
by Keogh et al. (SDTW, [KP99]). In a following paper
( [CKHP02]) Chu et al. introduced another reduced dimensionality representation called Piecewise Aggregate Approximation (PAA) which can be basically seen as a down sampled version of the original data. Applying DTW to such
modified data is called Piecewise Dynamic Time Warping
(PDTW). Chu et al. presented an iterative scheme for performing DTW named Iterative Deepening Dynamic Time
Warping (IDDTW) for classifying and clustering time series
of large databases with respect to a given query. The basic
idea is to calculate a PAA representation of the query and
each data set for different time scales and to apply a PDTW
at these modified series starting at a very low resolution. At
every iteration step a probabilistic algorithm together with a
user specified tolerance parameter decides whether to apply
the PDTW to a higher PAA resolution or to keep the current
PDTW approximation. SDTW, PDTW and IDDTW were
not designed for precise alignments, but can drastically reduce the computation time for classification and clustering
of massive data sets, although the worst case time and memory complexity remains O(n · n0 ), with n and n0 being the
numbers of frames of the two series.
Especially in the realm of music alignment DTW is commonly used in conjunction with non Euclidean measures in
the signal space in the frequency domain, since most qualitative information about of music comes from spectral analysis in the frequency domain and various other nontrivial features (like for example rhythm), but not from the actual envelope of the signal in the time domain. Music recordings are
typically stored in a Pulse Code Modulation format (PCMformat), a time-dependent encoding of air pressure fluctuations (volume) due to instruments or voice. Such encodings
can be seen as discrete time dependent signals. Thus, DTW
can be used to align music signals with respect to similarity measures in the spectral domain. Unfortunately methods
like SDTW, PDTW and IDDTW for instance, change the
frequency content of a signal. The piecewise approximation
is adequate with respect to Euclidean distance measures for
time domain shape alignments, but may be unsuitable, for
instance, in case of frequency domain features, since higher
frequency detail is wiped out. Furthermore a rough lowresolution DTW may be sufficient for classification purposes
(which was the intention of SDTW,PDTW and IDDTW), but
unacceptable for precise alignments.
3. Dynamic Time Warping for Signal Alignment (DTW)
Since classical Dynamic Time Warping is explained in many
previous publications, we only give a brief and informal description of the basic work flow in context of signal alignment.
Given two signals S of length l and S0 of length l 0 , the
alignment is typically a three stage process:
In the first step the two signals are divided into sequences
{ f1 , ..., fn } of n and { f10 , ..., fn00 } of n0 frames, respectively,
where each frame consists of m samples. These frames may
overlap by a constant number of o samples (Fig.2 Left). The
relationship between signal length l, overlap o, frame width
m and frame number n is given by:
n=
l −o
m−o
(1)
In a second step, one feature vector vi per frame i is extracted. In the case of music alignment vi often contains
spectral information of a frame, which is commonly obtained through short time fourier analysis. Finally, a DTW
algorithm calculates the optimal alignment for both feature
vector sequences (Fig.1) with respect to a given similarity
measure. To this end the Local Distance Matrix (LDM) is
computed. It stores all information about the pairwise similarity (distance) of all feature vectors from the first signal
compared to the features of the second signal. Based on
the LDM an optimal mapping between S and S0 has to be
found, which aligns every feature vector (and thereby frame)
of S(S0 ) to at least on feature vector (frame) of S0 (S). A
mapping is basically defined by a warping path W P, which
can be seen as a contiguous set of LDM elements: W P :=
{w1 , ..., wm }. The optimal warping path (DTW path) is then
given by the path with minimal cumulative costs ∑m
k=1 wk
(see Fig.1). Several "global and local constraints", like pairwise aligning both the first frames of the signals as well as
the last frames, may further restrict the set of all possible
alignments.
4. Iterative Multi-Scale Dynamic Time Warping and
Multi-Scale Feature extraction
From now on we assume (for simplicity) overlap o being of
zero length. Hence, the number of frames n (which can be
seen as the "resolution" of the alignment) only depends on
the frame width m.
To check if a "feature-measure-combination" may be suitable for IMDTW we now introduce a "stability" criterion.
This criterion ensures, that the basic structure of a LDM
Figure 1: Left: An optimal DTW-path (red) as a result of comparing each feature vector of signal one two each feature vector
of signal two with respect to a given measure. Right: A final alignment for two signals based on an optimal DTW-path.
Figure 2: Left: A signal (black curve) of l samples is subdivided into frames of m samples each, which overlap by o samples.
Right: Both, Sakoe-Chiba Band (left) and Itakura Parallelogram (right) fail, if the warping path (black) is not included in the
Sakoe-Chiba Band and Itakura Parallelogram.
does not change abruptly, if the frame number decreases. Although this criterion is not a sufficient assumption, we found
that if this criterion is satisfied IMDTW will very likely give
same alignments as classical DTW.
Avg M[i1 :i2 ],[ j1 : j2 ] :=
(4)
1
(i2 −i1 +1)·( j2 − j1 +1)
i
j
2
2
∑i=i
∑ j=
j1 mi j
1
4.1. Stability
Let M[i1 :i2 ],[ j1 : j2 ] denote the set of all matrix coefficients mi j
within a rectangular area of a matrix M (a sub matrix), characterized by indexes for rows and columns ranging from i1
to i2 and from j1 to j2 , respectively (Fig.3):
M[i1 :i2 ],[ j1 : j2 ] := mi j i
1 ≤i≤i2 , j1 ≤ j≤ j2
(2)
Let furthermore Submats(M) be the set of all sub matrices of a matrix M with M ∈ Mat(n × n0 ) :
n
o
Submats(M) := M[i1 :i2 ],[ j1 : j2 ]
1≤i1 ,i2 ≤n,1≤ j1 , j2 ≤n0
(3)
The average of all matrix coefficients of a sub
matrix M[i1 :i2 ],[ j1 : j2 ] ∈ R(i2 −i1 +1)×( j2 − j1 +1) is denoted
Avg M[i1 :i2 ],[ j1 : j2 ] , hence:
The complexity of DTW could be basically reduced by
applying it to a lower dimensional representation of the signal. However, this would imply a modification of the original signals. Since there are features (like high frequency features in the frequency domain) being very sensitive to such
modifications we chose another way: Instead of subsampling the signals S and S0 and subdividing them into fixed
frames of m samples each (like for PDTW), we reduced the
complexity of the DTW by scaling the frame width m by a
factor of α > 1 without modifying S and S0 . This of course
requires feature functions and similarity measures that differ
from the originals, which were defined for frames of width
m. For every scaled frame width a specific feature extraction
function and a corresponding measure is needed. In a sense,
these functions extract features at multiple time resolutions.
Let Fi be a member of a family F = {F1 , ..., Fi , ..., Fk } of feature functions extracting a feature vector of length L(i) from
a given frame f consisting of i samples. Let furthermore
D j denote a member of a family D = {D1 , ..., D j , ..., Dk } of
Figure 3: Intuitive meaning of M[i1 :i2 ],[ j1 : j2 ] and of "stability" of a family of measures with respect to a family of feature
extraction functions. The two images show LDM matrices at different resolutions obtained with a stable measure. The values of
the coefficients range from one (white) to zero(black). Note, the similar overall structure of both matrices. The lower resolution
LDM (left) looks like a low pass filtered version of the right one, which has α2 times more coefficients. The sub matrices marked
in both matrices have the same aspect ratios, relative size and position and differ only by a scale factor of α.
similarity measures comparing two feature vectors of length
j each.
Given a set of real-valued signals S and two signals S, S0 ∈
S subdivided into n and n0 frames of m samples each, the
LDM can be constructed with respect to Fm and DL(m) . To
increase the number of frames n (n0 ) by a factor of α, the
frame width has to be decreased by the same factor and the
corresponding higher resolution LDM has to be computed
with respect to Fm/α and ML(m/α) (α, m ∈ N and m mod α =
0).
In this context we call a f amily o f measures D stable
with respect to a f amily o f f eature f unctions F f or a
set o f signals S, if the average of all sub-matrices of the
the LDM scales linearly with respect to scaling of the frame
width:
∀S, S0 ∈ S :
∃x, y ∈ R :
∀LDMm [i1 :i2 ],[ j1 : j2 ] ∈ Submats(LDM m ) :
(5)
Avg LDMm [i1 :i2 ],[ j1 : j2 ]
= x · Avg LDMm/α [α·i1 :α·i2 ],[α· j1 :α· j2 ] + y
Here LDM m = LDM(S, S0 , m, Fm , DL(m) ) denotes the
LDM of the two signals for a given frame width m with
respect to the corresponding feature function and measure.
Intuitively the definition of "stable" means that the structure
of a local distance matrix does not change abruptly if the
frame width is scaled by a factor α (the "LDM resolution"
decreases) and if the features are extracted from this bigger frame. The lower resolution LDM-matrix is a low pass
filtered (averaged) version of the higher resolution matrix
(Fig.3). This assumption at least approximately holds for a
wide range of similarity measures (features) like Euclidean
distance (shape of the signals), Short Time Fourier distance
(Short Time Fourier transform of the signals), per-frameaveraged derivatives and integrals (shape of the signals), because they transform linearly with respect to frame width
scaling. We will not give a proof here, but the basic idea can
be summed up as follows:
• Represent the signals in each frame by a superposition
of its decomposition into α components of α disjunct and
equally sized sub frames. Thus one obtains α components
being equal to the original signals within a certain sub
frame and zero outside (for some cases a strict restriction
of each component to only one sub frame may be more
suitably).
• The linearity of the LDM computation with respect to
such a superposition has to be shown.
Our test were indicating that if the stability assumption is
satisfied, the geometrical shape of the optimal DTW path at
a lower resolution LDM is very likely similar to the DTW
path of a higher resolution LDM and only lacks of higher
resolution detail. But since our definition of stability is not
sufficient for "geometrically stable" warping paths, there are
(pathological) counter-examples(Fig.6). Note that the basic
idea behind the definition of "the stability of a measure" can
easily be extended to α ∈ R+ , non zero overlaps and cases
where m mod α = 0 is not satisfied, too.
4.2. Iterative Multi-Scale Dynamic Time Warping
(IMDTW)
Usually one is interested in stable DTW based alignments of
signals with respect to a set of given features. The quality
of an alignment heavily depends on the choice of features
and similarity measures used to construct the LDM. If the
overall shape of the optimal DTW path is very sensitive to
slight changes of frame width in general, the measures or
features may be inadequate. In this paper, we focus on fea-
Figure 4: Iterative refinement of the optimal path during IMDTW. The left image shows the resolution of the LDM in the first
iteration step, where all coefficients have to be calculated. The optimal path (red) and a tube-like neighborhood (orange) are
scaled "to match the LDM-resolution" for the next step (middle image). The geometrical shape of both scaled previous path and
neighborhood define the coefficients, which have to be calculated in the next iteration step and the new path has to lie within
this yellow-marked area. The process continues until a final resolution has reached (right image). If a LDM coefficient is not
computed at all it is marked grey.
tures/similarity measures that don’t suffer from such problems and which are stable according to 4.1. Then the shape
of the optimal DTW path at a lower resolution is approximately a "downsampled" version of the path constructed at a
higher resolution. Ideally the path could first be constructed
at a very low resolution by applying the classical DTW algorithm and then be refined iteratively, by just accounting for
aligned frames along the optimal DTW path of the lower resolution and performing the DTW only between those frames
at a higher resolution. In a sense the low resolution alignment restricts the search area for the higher resolution DTW
for the next iteration step. This involves a scaling of the previously calculated "path area" by a factor of α2 to match the
new resolution. The process can be repeated until a desired
resolution is reached. We call this scheme terative MultiScaling Dynamic Time Warping (IMDTW) (Fig.4,Fig.7).
To avoid artifacts, to reduce the error rate an additional
neighborhood close to the path can be regarded, too. For
simplicity we chose a "tube" of constant width (orthogonal
to the path) of p frames around the optimal path. At lower
resolutions the relative amount of matrix coefficients (compared to the total size of the LDM) which has to be calculated
is lower than at higher resolutions, allowing more flexible refinements of the alignment. The higher the resolution gets,
the more the path is restricted to a certain shape. Note that
since the tube-like neighborhood is added at every iteration
step, the search area of a higher resolution DTW may be not
restricted to the previous search area. This helps to prevent
the alignment to get stuck in a strong local minimum.
Thus, the IMDTW process can be adjusted by three additional parameters compared to classical DTW: a "tube
width" p, the desired number of iterations k, and a scale factor α. For more details see Figure xx.
4.2.1. Time and Memory Complexity
Classical DTW algorithms have both time and memory complexity of O(n · n0 ), with n and n0 denoting the number of
frames of the two signals.
The time complexity of the IMDTW algorithm depends
on two aspects:
• the way the lowest resolution is chosen and
• the costs of feature extraction (time complexity T f , memory complexity S f ) and for calculating the similarity measure.
Assume w.l.o.g. two signals S and S0 of the same length
of l samples ar given and let us suppose an alignment for
n = l/m frames of width m each. If T f 6∈O(l2 ), IMDTW
makes no sense because the total time complexity is then
dominated by the feature extraction. For a feature extraction
time complexity better than O(l 2 ), there exist basically two
strategies for applying the IMDTW algorithm:
• a fixed number of frames n f = l/(βm) (with β = αc , c ∈
N, l mod (βm) = 0) for the first iteration step (lowest resolution) regardless of l ,
• or a number of frames chosen proportional to l.
The time complexity of IMDTW (TIMDTW (l)) does not
change compared to classical DTW, if the lowest LDM resolution is chosen proportional to the signal size, but IMDTW
can speed up the calculation by a factor. If a fixed n f (which
does not depend on l) is used instead, the time complexity
TIMDTW (l) changes. In the worst case, the optimal DTW
path of the i-th iteration step has a length of 2ni coefficients,
where ni is the corresponding frame count (Fig.5 Left). The
iteration process stops after k iterations, if the frame number
for the final alignment is reached. Since a "tube" neighborhood of fixed width p is added at every iteration step, one
has to calculate up to 2ni · α · p matrix coefficients per iteration. The total LDM is computed only once in the lowest
resolution of n f fames. Assuming a time complexity for the
similarity measure being proportional to the sample number
of a frame, and with mi denoting the frame width for the i-th
iteration step, one obtains:
4.3. IMDTW with PAA
TIMDTW (l) = O
n f · n f + ∑c−1
i=1 mi · 2 ni · α · p
+ T f (l)
(log l/(n f m))−1
= O ∑i=1 α
mi m2li α p + T f (l)
= O l log l +T f (l)
(6)
This estimation is based on an overlap of zero length.
The memory complexity of IMDTW (SIMDTW (l)) depends on the width of the neighborhood p, the final resolution of the alignment and the complexity of both similarity
measure and feature extraction. If the memory complexity of
the measures does only linearly depend on the frame width
and with S f (l) denoting the memory complexity for feature
extraction one obtains for SIMDTW (l):
In the last section we introduced a IMDTW approach which
is suitable for a wide range of different features and measures. For restricted classes of alignments like for shape
alignment in the time domain, or alignment with respect to
low frequency content of signals the IMDTW approach may
be modified to improve the performance. The basic idea for
reducing the complexity is to resample the signal to a lower
resolution instead of scaling the frame width. In this case
the IMDTW algorithm can be applied in a similar way to
multi-resolution PAA (or other reduced dimensionality) representations of signals, while keeping the frame width constant. In this context the "stability property" must be redefined with respect to different scales of a signal (instead of
different scales of the window width).
5. Results
SIMDTW (l) = O (2 n · α · p) + S f (l)
(7)
= O (l) + S f (l)
If the number of frames of the lowest resolution is chosen
proportional to l, the memory complexity does not change,
but nevertheless the memory requirements may drop drastically.
Note, that both space and time complexities derived in this
section remain the same for overlaps being different from
zero and for two signals of different lengths.
4.2.2. Spectral Features
In spectral feature extraction, the Short Time Fourier Transform (STFT) is commonly used to analyze the frequency
content of every frame. Hence, if only STFT is used for extracting features, the signal is windowed with respect to each
frame and the resulting spectrum (or a part of it) becomes a
feature vector. The family of feature functions F introduced
in the previous section then consists of STFTs for different
frame widths. A typical family of similarity measures M is
given by the normalized dot products of two feature vectors
("cosine measure"), which has a complexity of O(n).
The STFT has a time complexity of O(l · log l) and has
to be executed k = c − 1 = (logα l/(n f m)) − 1 times during
IMDTW for a fixed number number n f . Hence, the overall
IMDTW time complexity becomes:
TST FT −IMDTW (l) = O(l log l) + O(l log l · k)
(8)
= O(l log2 l)
Since both STFT and cosine measure require O(l) memory, the overall memory complexity remains O(l) according
to ....
Although there exist several feature extraction methods that
are simpler and have a better time complexity, we ran systematic tests of the IMDTW algorithm only in conjunction
with STFT-features as described in the section before, since
STFT (or at least some variant) is commonly used for music
alignment. Even though STFT does not perfectly satisfy the
stability assumption (due to windowing artifacts), the quality of the IMDTW-alignment is still sufficient for many real
world purposes. We used the freely available FFTW3 library
for all STFT related computations, which performed very
well in case of large signal lengths.
Since our intention was not to find an optimal measure
for music alignment, but to analyze the effectiveness of
IMDTW, our tests were focusing on a class of alignments,
where the alignment with respect to STFT is almost stable
and not very sensitive to scaling of the frame width.
To judge the quality of an IMDTW-alignment, compared
to DTW-alignment we first computed the original DTWalignment path and measured both the relative geometrical
distance of the IMDTW-alignment path to the original, as
well as the relative difference in the costs of the two paths
(Fig. xx). Furthermore the total computation time, the
total time for computing the STFT and the ratio between
the number of all matrix coefficients calculated during the
IMDTW-process weighted by their corresponding window
lengths and the number of LDM coefficients of the final resolution was estimated for each setup (Fig.5 Right) This ratio
can be seen as an overall benchmark for the computational
complexity excluding feature extraction.
The first test set consisted of ten different pop songs. For
each of those songs we prepared two locally tempo distorted (preserving the pitch) versions and aligned them to
the original (Fig.7). We analyzed the alignment for different combinations of scale factors α, iteration counts k,
neighborhood widths p and for different frame counts. The
optimal alignment was very stable with respect to frame
Figure 5: Left: Worst case alignment with respect to the length of an optimal path (red). Right: The relative geometrical error
of two DTW-paths (red and blue curves) is given by the ratio of the area enclosed by the paths (green) and the total area of the
matrix.
Figure 6: A counter-example: The "transformation" of the LDM is stable with respect to def. 5, but the shapes of the optimal
paths (yellow dots) differ for two resolutions of the LDM. The right image shows a LDM with twice the resolution compared to
the left.
width scaling due to a very sharp and low cost minimum for
the optimal alignment. Hence, the geometrical error of the
IMDTW-alignment compared to DTW-alignment was very
small in case of small neighborhoods. For the same reason
the cost error was sometimes relatively big, even for small
differences in the optimal path geometry. For sharp DTWalignments, IMDTW tends to get very close to the final
shape of the alignment path with very little computational
effort, whereas the minimization of the relative cost error
needs a bigger p and more time. An error-free and significantly faster (by roughly one order of magnitude for about
10000 frames) alignment was always possible with a fixed
setting for all songs, even for of diagonal DTW-paths, where
classic acceleration techniques like Sakoe-Chiba Bands or
Itakura Parallelograms fail. Some further results are presented in Figure 8.
In a second test, we calculated the IMDTW-alignment for
every pair of those ten different pop songs. We found that
typically many warping paths had costs very close (or equal)
to the costs of the optimal alignment, indicating that simple
STFT is not adequate for aligning different songs. As a result, the geometric shape of the alignment path was not very
stable with respect to frame scaling. Nevertheless, both geometrical and cost errors dropped to zero if the tube width was
enlarged and a significantly faster alignment was possible
even in this unfavorable case. For more detailed information
see Figure 9.
All tests indicated, that higher iteration counts do not automatically result in a better performance. Due to the computational overhead (especially by complex feature extraction methods), the starting resolution has to be chosen carefully. The larger the width of the neighborhood around the
optimal path (p) was chosen, the less both results of IMDT
and classical DTW differed. Even for p = 9 both relative
errors were below five percent. Interestingly, we found no
strict general relationship between final matrix size, relative
errors and p. It seems that the width p needed to achieve a
certain error level depends more on the quality of an alignment than on the actual number of frames. We had done
tests with more frames, but unfortunately this would have exceeded our memory capacities in the case of classical DTW.
However, due to the quadratic time complexity of the classical DTW algorithm it can be expected that the speed gain of
IMDTW will increase with respect to frame count.
6. Conclusion and Future Work
In this paper we described a technique for effectively exploiting DTW for large signals for a wide range of differ-
ent features and similarity measures. Our approach combines both ideas of restricting the path candidates (like
Sakoe-Chiba bands) and iterative multi-resolution DTW.
The method is especially suitable for alignments approximately satisfying the stability criterion given in section x.x.
Furthermore it can be easily integrated into existing DTW
frameworks and needs only three simple additional parameters (compared to classical DTW).
Further research will show how IMDTW performs in case
of more complicated features and measures and how the idea
of multi-resolution feature analysis can be applied to other
techniques.
References
[CKHP02] C HU S., K EOGH E., H ART D., PAZZANI M.:
Iterative depening dynamic time warping for time series. In SIAM International Conference on Data Mining
(2002).
[Ita75] I TAKURA F.: Line spectrum representation of linear predictive coefficients of speech signals. In J. Acoust.
Soc. Amer., (1975), vol. 57, p. S35.
[KG03] KOVAR L., GLEICHER M.: Flexible automatic motion blending with registration curves, 2003.
[KP99] K EOGH E., PAZZANI M.: Scaling up dynamic
time warping to massive datasets. In 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’99) (Prague, Czech Republic, 1999), Zytkow J. M., Rauch J., (Eds.), vol. 1704,
Springer, pp. 1–11.
[SC71] S AKOE H., C HIBA S.: A Dynamic Programming
Approach to Continuous Speech Recognition. In Proc.
Intl. Congress on Acoustics (Budapest, Hungary, 1971).
[SRS03] S OULEZ F., RODET X., S CHWARZ D.: Improving polyphonic and poly-instrumental music to score
alignment. In 4th International Symposium on Music Information Retrieval ISMIR-03 (2003).
[TE03] T URETSKY R. J., E LLIS D. P. W.: Ground-truth
transcriptions of real music from force-aligned midi syntheses. In 4th International Symposium on Music Information Retrieval ISMIR-03 (2003).
Figure 7: Left: Example (first test): IMDTW alignment (LDM) of two music signals, where the first signal is a piecewise tempo
distorted version of the second. Note that black regions are not computed at all.
Figure 8: In a first test we ran several IMDTW alignments, where the first music signal was a piecewise tempo distorted version
of the second. We averaged the results of 25 individual alignments. For feature extraction STFT was used (m = 4096 for the
final resolution; relative overlap was 0.8, thus o = 0.8 · mi ). The final LDM matrices consisted of about 10000 by 10000
coefficients obtained from normalized dot product ("cosine measure") of two feature vectors. The leftmost column shows results
for a scale factor α = 2, the middle column for α = 4 and the right column for α = 8. The first row presents both relative cost
and geometrical errors (yellow and blue). The middle row shows plots of the relative computation times (compared to classical
DTW, which was 297s). The yellow bars represent the time for feature extraction (STFT), the blue bars the total time for both
STFT plus time warping. The last row shows the relative number of all LDM coefficients computed during IMDTW compared to
the coefficient number of the final LDM in the highest resolution and the ratio of the number of all matrix coefficients calculated
during the IMDTW-process weighted by their corresponding window lengths and the matrix coefficient number of the final
resolution (yellow and blue). Note that these two ratios can be seen as a benchmark for the complexity for "IMDTW with signal
scaling (PAA)" and "IMDTW with frame width scaling" respectively, excluding feature extraction costs. All results are plotted
with respect to different iteration counts k and path widths (3-257). The iteration count increases from the left to the right.
Figure 9: In a second test we ran 25 alignments of different pop songs (Queen...) and averaged the results. The figures have
the same meaning as described in Fig.8.