COM PUTER G RAPHICS TECH NICAL R EPORTS CG-2006/1 Iterative Multi Scale Dynamic Time Warping Arno Zinke Universität Bonn. [email protected] Dessislava Mayer Institut für Informatik II Universität Bonn D-53117 Bonn, Germany c Universität Bonn 2006 ISSN 1610-8892 Iterative Multi Scale Dynamic Time Warping A. Zinke and D. Mayer Institut für Informatik II, Universität Bonn, Germany Email: [email protected] Abstract Dynamic Time Warping (DTW) is a frequently used technique for the optimal alignment of sequences with respect to given constraints. The main disadvantage of DTW are both its time and memory complexity. We present a novel iterative scheme which can significantly improve the DTW performance with respect to computation time and memory requirements in case of very large sequences. In contrast to previous iterative approaches which were designed for clustering time series with respect to shape, our method is suitable for precise alignments for a wide range of different features and similarity measures. Keywords: Dynamic time warping, Multi Scale DTW 1. Introduction Dynamic time warping (DTW) is a frequently used technique for the optimal alignment of sequences with respect to given constraints. It has been successfully applied to many realms such as data mining, speech processing and most recently to motion morphing and music alignment ( [SRS03, TE03, KP99, KG03]). The basic idea of DTW is to compare two sequences (which can be seen as signals) with respect to certain features (like for instance shape or local frequency content) and to find an optimal alignment of both sequences by stretching them with respect to time (Fig.1 Right). For the comparison a distance or similarity measure is needed. One major drawback of conventional dynamic time warping is that it is much too slow and memory consuming for aligning large sequences, due to its quadratic time and memory complexity. We present a novel approach called Iterative Multi-Scale Dynamic Time Warping (IMDTW), significantly speeding up the classic algorithm and decreasing memory requirements for DTW alignments of large signals. It combines the two ideas of • restricting the path candidates and • iteratively exploiting the DTW at different resolutions (Fig.4). In contrast to many previous approaches IMDTW can be applied for a wide range of different features and measures and is not restricted to only one special case like for instance for shape alignment of signals. To check if a "featuremeasure-combination" may be suitable for IMDTW we furthermore introduce a "stability" criterion. Although this criterion is not a sufficient assumption, our tests were indicating, that if this criterion is satisfied, IMDTW will virtually give same alignments as classical DTW. Moreover, our approach can be applied in the case of less constrained alignments, where other methods typically fail see Fig.2 Right). Since music alignment is a very important field of application for classical DTW, it will serve as an interesting example for IMDTW. 2. Related Work Several techniques were proposed to improve the complexity of DTW at least for the average case. Global path constraints for restricting the warping path to a certain area were introduced by [SC71, Ita75]. All these constraints basically limit the distance from the DTW path to the main diagonal. Such constraints may slightly speed up the DTW computation and can help to prevent "pathological warpings". On the other hand, however, they heavily restrict the number of candidates for the optimal path which may cause undesired alignment results (Fig.2 Right). Interesting approaches for speeding up the DTW algorithm were developed in the realm of data mining. Especially for classification and clustering of large collections of time series the classic DTW approach gets almost impracticable with respect to computation time. The Euclidean distance measure or some extension or modification thereof, is widely used for similarity calculations in conjunction with DTW to align signals with respect to similar shapes. Such alignments typically do not change drastically in case of a piecewise approximation of the data. Therefore one can take advantage of this fact and may exploit the DTW at a lower level of time resolution without loosing much information about the similarity. A piecewise linear representation of a series was proposed by Keogh et al. (SDTW, [KP99]). In a following paper ( [CKHP02]) Chu et al. introduced another reduced dimensionality representation called Piecewise Aggregate Approximation (PAA) which can be basically seen as a down sampled version of the original data. Applying DTW to such modified data is called Piecewise Dynamic Time Warping (PDTW). Chu et al. presented an iterative scheme for performing DTW named Iterative Deepening Dynamic Time Warping (IDDTW) for classifying and clustering time series of large databases with respect to a given query. The basic idea is to calculate a PAA representation of the query and each data set for different time scales and to apply a PDTW at these modified series starting at a very low resolution. At every iteration step a probabilistic algorithm together with a user specified tolerance parameter decides whether to apply the PDTW to a higher PAA resolution or to keep the current PDTW approximation. SDTW, PDTW and IDDTW were not designed for precise alignments, but can drastically reduce the computation time for classification and clustering of massive data sets, although the worst case time and memory complexity remains O(n · n0 ), with n and n0 being the numbers of frames of the two series. Especially in the realm of music alignment DTW is commonly used in conjunction with non Euclidean measures in the signal space in the frequency domain, since most qualitative information about of music comes from spectral analysis in the frequency domain and various other nontrivial features (like for example rhythm), but not from the actual envelope of the signal in the time domain. Music recordings are typically stored in a Pulse Code Modulation format (PCMformat), a time-dependent encoding of air pressure fluctuations (volume) due to instruments or voice. Such encodings can be seen as discrete time dependent signals. Thus, DTW can be used to align music signals with respect to similarity measures in the spectral domain. Unfortunately methods like SDTW, PDTW and IDDTW for instance, change the frequency content of a signal. The piecewise approximation is adequate with respect to Euclidean distance measures for time domain shape alignments, but may be unsuitable, for instance, in case of frequency domain features, since higher frequency detail is wiped out. Furthermore a rough lowresolution DTW may be sufficient for classification purposes (which was the intention of SDTW,PDTW and IDDTW), but unacceptable for precise alignments. 3. Dynamic Time Warping for Signal Alignment (DTW) Since classical Dynamic Time Warping is explained in many previous publications, we only give a brief and informal description of the basic work flow in context of signal alignment. Given two signals S of length l and S0 of length l 0 , the alignment is typically a three stage process: In the first step the two signals are divided into sequences { f1 , ..., fn } of n and { f10 , ..., fn00 } of n0 frames, respectively, where each frame consists of m samples. These frames may overlap by a constant number of o samples (Fig.2 Left). The relationship between signal length l, overlap o, frame width m and frame number n is given by: n= l −o m−o (1) In a second step, one feature vector vi per frame i is extracted. In the case of music alignment vi often contains spectral information of a frame, which is commonly obtained through short time fourier analysis. Finally, a DTW algorithm calculates the optimal alignment for both feature vector sequences (Fig.1) with respect to a given similarity measure. To this end the Local Distance Matrix (LDM) is computed. It stores all information about the pairwise similarity (distance) of all feature vectors from the first signal compared to the features of the second signal. Based on the LDM an optimal mapping between S and S0 has to be found, which aligns every feature vector (and thereby frame) of S(S0 ) to at least on feature vector (frame) of S0 (S). A mapping is basically defined by a warping path W P, which can be seen as a contiguous set of LDM elements: W P := {w1 , ..., wm }. The optimal warping path (DTW path) is then given by the path with minimal cumulative costs ∑m k=1 wk (see Fig.1). Several "global and local constraints", like pairwise aligning both the first frames of the signals as well as the last frames, may further restrict the set of all possible alignments. 4. Iterative Multi-Scale Dynamic Time Warping and Multi-Scale Feature extraction From now on we assume (for simplicity) overlap o being of zero length. Hence, the number of frames n (which can be seen as the "resolution" of the alignment) only depends on the frame width m. To check if a "feature-measure-combination" may be suitable for IMDTW we now introduce a "stability" criterion. This criterion ensures, that the basic structure of a LDM Figure 1: Left: An optimal DTW-path (red) as a result of comparing each feature vector of signal one two each feature vector of signal two with respect to a given measure. Right: A final alignment for two signals based on an optimal DTW-path. Figure 2: Left: A signal (black curve) of l samples is subdivided into frames of m samples each, which overlap by o samples. Right: Both, Sakoe-Chiba Band (left) and Itakura Parallelogram (right) fail, if the warping path (black) is not included in the Sakoe-Chiba Band and Itakura Parallelogram. does not change abruptly, if the frame number decreases. Although this criterion is not a sufficient assumption, we found that if this criterion is satisfied IMDTW will very likely give same alignments as classical DTW. Avg M[i1 :i2 ],[ j1 : j2 ] := (4) 1 (i2 −i1 +1)·( j2 − j1 +1) i j 2 2 ∑i=i ∑ j= j1 mi j 1 4.1. Stability Let M[i1 :i2 ],[ j1 : j2 ] denote the set of all matrix coefficients mi j within a rectangular area of a matrix M (a sub matrix), characterized by indexes for rows and columns ranging from i1 to i2 and from j1 to j2 , respectively (Fig.3): M[i1 :i2 ],[ j1 : j2 ] := mi j i 1 ≤i≤i2 , j1 ≤ j≤ j2 (2) Let furthermore Submats(M) be the set of all sub matrices of a matrix M with M ∈ Mat(n × n0 ) : n o Submats(M) := M[i1 :i2 ],[ j1 : j2 ] 1≤i1 ,i2 ≤n,1≤ j1 , j2 ≤n0 (3) The average of all matrix coefficients of a sub matrix M[i1 :i2 ],[ j1 : j2 ] ∈ R(i2 −i1 +1)×( j2 − j1 +1) is denoted Avg M[i1 :i2 ],[ j1 : j2 ] , hence: The complexity of DTW could be basically reduced by applying it to a lower dimensional representation of the signal. However, this would imply a modification of the original signals. Since there are features (like high frequency features in the frequency domain) being very sensitive to such modifications we chose another way: Instead of subsampling the signals S and S0 and subdividing them into fixed frames of m samples each (like for PDTW), we reduced the complexity of the DTW by scaling the frame width m by a factor of α > 1 without modifying S and S0 . This of course requires feature functions and similarity measures that differ from the originals, which were defined for frames of width m. For every scaled frame width a specific feature extraction function and a corresponding measure is needed. In a sense, these functions extract features at multiple time resolutions. Let Fi be a member of a family F = {F1 , ..., Fi , ..., Fk } of feature functions extracting a feature vector of length L(i) from a given frame f consisting of i samples. Let furthermore D j denote a member of a family D = {D1 , ..., D j , ..., Dk } of Figure 3: Intuitive meaning of M[i1 :i2 ],[ j1 : j2 ] and of "stability" of a family of measures with respect to a family of feature extraction functions. The two images show LDM matrices at different resolutions obtained with a stable measure. The values of the coefficients range from one (white) to zero(black). Note, the similar overall structure of both matrices. The lower resolution LDM (left) looks like a low pass filtered version of the right one, which has α2 times more coefficients. The sub matrices marked in both matrices have the same aspect ratios, relative size and position and differ only by a scale factor of α. similarity measures comparing two feature vectors of length j each. Given a set of real-valued signals S and two signals S, S0 ∈ S subdivided into n and n0 frames of m samples each, the LDM can be constructed with respect to Fm and DL(m) . To increase the number of frames n (n0 ) by a factor of α, the frame width has to be decreased by the same factor and the corresponding higher resolution LDM has to be computed with respect to Fm/α and ML(m/α) (α, m ∈ N and m mod α = 0). In this context we call a f amily o f measures D stable with respect to a f amily o f f eature f unctions F f or a set o f signals S, if the average of all sub-matrices of the the LDM scales linearly with respect to scaling of the frame width: ∀S, S0 ∈ S : ∃x, y ∈ R : ∀LDMm [i1 :i2 ],[ j1 : j2 ] ∈ Submats(LDM m ) : (5) Avg LDMm [i1 :i2 ],[ j1 : j2 ] = x · Avg LDMm/α [α·i1 :α·i2 ],[α· j1 :α· j2 ] + y Here LDM m = LDM(S, S0 , m, Fm , DL(m) ) denotes the LDM of the two signals for a given frame width m with respect to the corresponding feature function and measure. Intuitively the definition of "stable" means that the structure of a local distance matrix does not change abruptly if the frame width is scaled by a factor α (the "LDM resolution" decreases) and if the features are extracted from this bigger frame. The lower resolution LDM-matrix is a low pass filtered (averaged) version of the higher resolution matrix (Fig.3). This assumption at least approximately holds for a wide range of similarity measures (features) like Euclidean distance (shape of the signals), Short Time Fourier distance (Short Time Fourier transform of the signals), per-frameaveraged derivatives and integrals (shape of the signals), because they transform linearly with respect to frame width scaling. We will not give a proof here, but the basic idea can be summed up as follows: • Represent the signals in each frame by a superposition of its decomposition into α components of α disjunct and equally sized sub frames. Thus one obtains α components being equal to the original signals within a certain sub frame and zero outside (for some cases a strict restriction of each component to only one sub frame may be more suitably). • The linearity of the LDM computation with respect to such a superposition has to be shown. Our test were indicating that if the stability assumption is satisfied, the geometrical shape of the optimal DTW path at a lower resolution LDM is very likely similar to the DTW path of a higher resolution LDM and only lacks of higher resolution detail. But since our definition of stability is not sufficient for "geometrically stable" warping paths, there are (pathological) counter-examples(Fig.6). Note that the basic idea behind the definition of "the stability of a measure" can easily be extended to α ∈ R+ , non zero overlaps and cases where m mod α = 0 is not satisfied, too. 4.2. Iterative Multi-Scale Dynamic Time Warping (IMDTW) Usually one is interested in stable DTW based alignments of signals with respect to a set of given features. The quality of an alignment heavily depends on the choice of features and similarity measures used to construct the LDM. If the overall shape of the optimal DTW path is very sensitive to slight changes of frame width in general, the measures or features may be inadequate. In this paper, we focus on fea- Figure 4: Iterative refinement of the optimal path during IMDTW. The left image shows the resolution of the LDM in the first iteration step, where all coefficients have to be calculated. The optimal path (red) and a tube-like neighborhood (orange) are scaled "to match the LDM-resolution" for the next step (middle image). The geometrical shape of both scaled previous path and neighborhood define the coefficients, which have to be calculated in the next iteration step and the new path has to lie within this yellow-marked area. The process continues until a final resolution has reached (right image). If a LDM coefficient is not computed at all it is marked grey. tures/similarity measures that don’t suffer from such problems and which are stable according to 4.1. Then the shape of the optimal DTW path at a lower resolution is approximately a "downsampled" version of the path constructed at a higher resolution. Ideally the path could first be constructed at a very low resolution by applying the classical DTW algorithm and then be refined iteratively, by just accounting for aligned frames along the optimal DTW path of the lower resolution and performing the DTW only between those frames at a higher resolution. In a sense the low resolution alignment restricts the search area for the higher resolution DTW for the next iteration step. This involves a scaling of the previously calculated "path area" by a factor of α2 to match the new resolution. The process can be repeated until a desired resolution is reached. We call this scheme terative MultiScaling Dynamic Time Warping (IMDTW) (Fig.4,Fig.7). To avoid artifacts, to reduce the error rate an additional neighborhood close to the path can be regarded, too. For simplicity we chose a "tube" of constant width (orthogonal to the path) of p frames around the optimal path. At lower resolutions the relative amount of matrix coefficients (compared to the total size of the LDM) which has to be calculated is lower than at higher resolutions, allowing more flexible refinements of the alignment. The higher the resolution gets, the more the path is restricted to a certain shape. Note that since the tube-like neighborhood is added at every iteration step, the search area of a higher resolution DTW may be not restricted to the previous search area. This helps to prevent the alignment to get stuck in a strong local minimum. Thus, the IMDTW process can be adjusted by three additional parameters compared to classical DTW: a "tube width" p, the desired number of iterations k, and a scale factor α. For more details see Figure xx. 4.2.1. Time and Memory Complexity Classical DTW algorithms have both time and memory complexity of O(n · n0 ), with n and n0 denoting the number of frames of the two signals. The time complexity of the IMDTW algorithm depends on two aspects: • the way the lowest resolution is chosen and • the costs of feature extraction (time complexity T f , memory complexity S f ) and for calculating the similarity measure. Assume w.l.o.g. two signals S and S0 of the same length of l samples ar given and let us suppose an alignment for n = l/m frames of width m each. If T f 6∈O(l2 ), IMDTW makes no sense because the total time complexity is then dominated by the feature extraction. For a feature extraction time complexity better than O(l 2 ), there exist basically two strategies for applying the IMDTW algorithm: • a fixed number of frames n f = l/(βm) (with β = αc , c ∈ N, l mod (βm) = 0) for the first iteration step (lowest resolution) regardless of l , • or a number of frames chosen proportional to l. The time complexity of IMDTW (TIMDTW (l)) does not change compared to classical DTW, if the lowest LDM resolution is chosen proportional to the signal size, but IMDTW can speed up the calculation by a factor. If a fixed n f (which does not depend on l) is used instead, the time complexity TIMDTW (l) changes. In the worst case, the optimal DTW path of the i-th iteration step has a length of 2ni coefficients, where ni is the corresponding frame count (Fig.5 Left). The iteration process stops after k iterations, if the frame number for the final alignment is reached. Since a "tube" neighborhood of fixed width p is added at every iteration step, one has to calculate up to 2ni · α · p matrix coefficients per iteration. The total LDM is computed only once in the lowest resolution of n f fames. Assuming a time complexity for the similarity measure being proportional to the sample number of a frame, and with mi denoting the frame width for the i-th iteration step, one obtains: 4.3. IMDTW with PAA TIMDTW (l) = O n f · n f + ∑c−1 i=1 mi · 2 ni · α · p + T f (l) (log l/(n f m))−1 = O ∑i=1 α mi m2li α p + T f (l) = O l log l +T f (l) (6) This estimation is based on an overlap of zero length. The memory complexity of IMDTW (SIMDTW (l)) depends on the width of the neighborhood p, the final resolution of the alignment and the complexity of both similarity measure and feature extraction. If the memory complexity of the measures does only linearly depend on the frame width and with S f (l) denoting the memory complexity for feature extraction one obtains for SIMDTW (l): In the last section we introduced a IMDTW approach which is suitable for a wide range of different features and measures. For restricted classes of alignments like for shape alignment in the time domain, or alignment with respect to low frequency content of signals the IMDTW approach may be modified to improve the performance. The basic idea for reducing the complexity is to resample the signal to a lower resolution instead of scaling the frame width. In this case the IMDTW algorithm can be applied in a similar way to multi-resolution PAA (or other reduced dimensionality) representations of signals, while keeping the frame width constant. In this context the "stability property" must be redefined with respect to different scales of a signal (instead of different scales of the window width). 5. Results SIMDTW (l) = O (2 n · α · p) + S f (l) (7) = O (l) + S f (l) If the number of frames of the lowest resolution is chosen proportional to l, the memory complexity does not change, but nevertheless the memory requirements may drop drastically. Note, that both space and time complexities derived in this section remain the same for overlaps being different from zero and for two signals of different lengths. 4.2.2. Spectral Features In spectral feature extraction, the Short Time Fourier Transform (STFT) is commonly used to analyze the frequency content of every frame. Hence, if only STFT is used for extracting features, the signal is windowed with respect to each frame and the resulting spectrum (or a part of it) becomes a feature vector. The family of feature functions F introduced in the previous section then consists of STFTs for different frame widths. A typical family of similarity measures M is given by the normalized dot products of two feature vectors ("cosine measure"), which has a complexity of O(n). The STFT has a time complexity of O(l · log l) and has to be executed k = c − 1 = (logα l/(n f m)) − 1 times during IMDTW for a fixed number number n f . Hence, the overall IMDTW time complexity becomes: TST FT −IMDTW (l) = O(l log l) + O(l log l · k) (8) = O(l log2 l) Since both STFT and cosine measure require O(l) memory, the overall memory complexity remains O(l) according to .... Although there exist several feature extraction methods that are simpler and have a better time complexity, we ran systematic tests of the IMDTW algorithm only in conjunction with STFT-features as described in the section before, since STFT (or at least some variant) is commonly used for music alignment. Even though STFT does not perfectly satisfy the stability assumption (due to windowing artifacts), the quality of the IMDTW-alignment is still sufficient for many real world purposes. We used the freely available FFTW3 library for all STFT related computations, which performed very well in case of large signal lengths. Since our intention was not to find an optimal measure for music alignment, but to analyze the effectiveness of IMDTW, our tests were focusing on a class of alignments, where the alignment with respect to STFT is almost stable and not very sensitive to scaling of the frame width. To judge the quality of an IMDTW-alignment, compared to DTW-alignment we first computed the original DTWalignment path and measured both the relative geometrical distance of the IMDTW-alignment path to the original, as well as the relative difference in the costs of the two paths (Fig. xx). Furthermore the total computation time, the total time for computing the STFT and the ratio between the number of all matrix coefficients calculated during the IMDTW-process weighted by their corresponding window lengths and the number of LDM coefficients of the final resolution was estimated for each setup (Fig.5 Right) This ratio can be seen as an overall benchmark for the computational complexity excluding feature extraction. The first test set consisted of ten different pop songs. For each of those songs we prepared two locally tempo distorted (preserving the pitch) versions and aligned them to the original (Fig.7). We analyzed the alignment for different combinations of scale factors α, iteration counts k, neighborhood widths p and for different frame counts. The optimal alignment was very stable with respect to frame Figure 5: Left: Worst case alignment with respect to the length of an optimal path (red). Right: The relative geometrical error of two DTW-paths (red and blue curves) is given by the ratio of the area enclosed by the paths (green) and the total area of the matrix. Figure 6: A counter-example: The "transformation" of the LDM is stable with respect to def. 5, but the shapes of the optimal paths (yellow dots) differ for two resolutions of the LDM. The right image shows a LDM with twice the resolution compared to the left. width scaling due to a very sharp and low cost minimum for the optimal alignment. Hence, the geometrical error of the IMDTW-alignment compared to DTW-alignment was very small in case of small neighborhoods. For the same reason the cost error was sometimes relatively big, even for small differences in the optimal path geometry. For sharp DTWalignments, IMDTW tends to get very close to the final shape of the alignment path with very little computational effort, whereas the minimization of the relative cost error needs a bigger p and more time. An error-free and significantly faster (by roughly one order of magnitude for about 10000 frames) alignment was always possible with a fixed setting for all songs, even for of diagonal DTW-paths, where classic acceleration techniques like Sakoe-Chiba Bands or Itakura Parallelograms fail. Some further results are presented in Figure 8. In a second test, we calculated the IMDTW-alignment for every pair of those ten different pop songs. We found that typically many warping paths had costs very close (or equal) to the costs of the optimal alignment, indicating that simple STFT is not adequate for aligning different songs. As a result, the geometric shape of the alignment path was not very stable with respect to frame scaling. Nevertheless, both geometrical and cost errors dropped to zero if the tube width was enlarged and a significantly faster alignment was possible even in this unfavorable case. For more detailed information see Figure 9. All tests indicated, that higher iteration counts do not automatically result in a better performance. Due to the computational overhead (especially by complex feature extraction methods), the starting resolution has to be chosen carefully. The larger the width of the neighborhood around the optimal path (p) was chosen, the less both results of IMDT and classical DTW differed. Even for p = 9 both relative errors were below five percent. Interestingly, we found no strict general relationship between final matrix size, relative errors and p. It seems that the width p needed to achieve a certain error level depends more on the quality of an alignment than on the actual number of frames. We had done tests with more frames, but unfortunately this would have exceeded our memory capacities in the case of classical DTW. However, due to the quadratic time complexity of the classical DTW algorithm it can be expected that the speed gain of IMDTW will increase with respect to frame count. 6. Conclusion and Future Work In this paper we described a technique for effectively exploiting DTW for large signals for a wide range of differ- ent features and similarity measures. Our approach combines both ideas of restricting the path candidates (like Sakoe-Chiba bands) and iterative multi-resolution DTW. The method is especially suitable for alignments approximately satisfying the stability criterion given in section x.x. Furthermore it can be easily integrated into existing DTW frameworks and needs only three simple additional parameters (compared to classical DTW). Further research will show how IMDTW performs in case of more complicated features and measures and how the idea of multi-resolution feature analysis can be applied to other techniques. References [CKHP02] C HU S., K EOGH E., H ART D., PAZZANI M.: Iterative depening dynamic time warping for time series. In SIAM International Conference on Data Mining (2002). [Ita75] I TAKURA F.: Line spectrum representation of linear predictive coefficients of speech signals. In J. Acoust. Soc. Amer., (1975), vol. 57, p. S35. [KG03] KOVAR L., GLEICHER M.: Flexible automatic motion blending with registration curves, 2003. [KP99] K EOGH E., PAZZANI M.: Scaling up dynamic time warping to massive datasets. In 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’99) (Prague, Czech Republic, 1999), Zytkow J. M., Rauch J., (Eds.), vol. 1704, Springer, pp. 1–11. [SC71] S AKOE H., C HIBA S.: A Dynamic Programming Approach to Continuous Speech Recognition. In Proc. Intl. Congress on Acoustics (Budapest, Hungary, 1971). [SRS03] S OULEZ F., RODET X., S CHWARZ D.: Improving polyphonic and poly-instrumental music to score alignment. In 4th International Symposium on Music Information Retrieval ISMIR-03 (2003). [TE03] T URETSKY R. J., E LLIS D. P. W.: Ground-truth transcriptions of real music from force-aligned midi syntheses. In 4th International Symposium on Music Information Retrieval ISMIR-03 (2003). Figure 7: Left: Example (first test): IMDTW alignment (LDM) of two music signals, where the first signal is a piecewise tempo distorted version of the second. Note that black regions are not computed at all. Figure 8: In a first test we ran several IMDTW alignments, where the first music signal was a piecewise tempo distorted version of the second. We averaged the results of 25 individual alignments. For feature extraction STFT was used (m = 4096 for the final resolution; relative overlap was 0.8, thus o = 0.8 · mi ). The final LDM matrices consisted of about 10000 by 10000 coefficients obtained from normalized dot product ("cosine measure") of two feature vectors. The leftmost column shows results for a scale factor α = 2, the middle column for α = 4 and the right column for α = 8. The first row presents both relative cost and geometrical errors (yellow and blue). The middle row shows plots of the relative computation times (compared to classical DTW, which was 297s). The yellow bars represent the time for feature extraction (STFT), the blue bars the total time for both STFT plus time warping. The last row shows the relative number of all LDM coefficients computed during IMDTW compared to the coefficient number of the final LDM in the highest resolution and the ratio of the number of all matrix coefficients calculated during the IMDTW-process weighted by their corresponding window lengths and the matrix coefficient number of the final resolution (yellow and blue). Note that these two ratios can be seen as a benchmark for the complexity for "IMDTW with signal scaling (PAA)" and "IMDTW with frame width scaling" respectively, excluding feature extraction costs. All results are plotted with respect to different iteration counts k and path widths (3-257). The iteration count increases from the left to the right. Figure 9: In a second test we ran 25 alignments of different pop songs (Queen...) and averaged the results. The figures have the same meaning as described in Fig.8.
© Copyright 2026 Paperzz