Constrained Motif Discovery Yasser Mohammad and Toyoaki Nishida Abstract— The goal of motif discovery algorithms is to efficiently find unknown recurring patterns in time series. Most available algorithms cannot utilize domain knowledge in any way which results in quadratic or at least sub-quadratic time and space complexity. For large time series datasets for which domain knowledge can be available this is a severe limitation. In this paper we define the Constrained Motif Discovery problem which enables utilization of domain knowledge into the motif discovery process. We also show that most unconstrained motif discovery problems be converted into constrained motif discovery problem using a change point detection algorithm. We provide two algorithms for solving this problem and compare their performance to state-of-the-art motif discovery algorithms on a large set of synthetic time series. The proposed algorithms can provide linear time and constant space complexity. The proposed algorithms provided four to ten folds increase in speed compared to two state of the art motif discovery algorithms without loss of accuracy and provided better noise robustness in high noise levels. I. INTRODUCTION The research in unsupervised motif discovery have led to many techniques including the PROJECTIONS algorithm [1], PERUSE [2], Gemoda [3] among many others ([4]). With the exception of Gemoda which is quadratic in time and space complexities, these algorithms aim to achieve sub-quadratic time complexity by first looking for candidate motif stems using some heuristic method and then doing an exhaustive motif detection instead of motif discovery which is linear in time. The most used method for finding these stems is the PROJECTIONS algorithm [1] which requires discritization of the data using the SAX [5] algorithm. A major drawback of all methods relying on discritization of the data is the need to specify a word length and a vocabulary size that are difficult to decide in real world situations. Another problem of most of the previously proposed techniques is the need to specify an exact or at least roughly correct motif length. A third problem is deciding when to stop searching for new motifs ([4] suggested using density estimation for this purpose). One common problem to all the algorithms based on the PROJECTIONS algorithm [4], [1] is the need to construct and keep the collision matrix which is in general quadratic in the length of the SAX word describing the time series. Given that the optimal word size can be Yasser Mohammad is with the Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University [email protected] Toyoaki Nishida is with the Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University [email protected] The first author would like to thank the Egyptian Ministry of Higher Education for supporting his PhD study during which this research took place very short for short motifs in signals with high frequency components, the size of this matrix can grow quadratic with the length of the time series. Catalano et al. [6] suggested a very efficient algorithm for locating variable length patterns in data series using random sampling that allows it to run in linear time and constant memory. The main problem of this approach is that it relies on random sampling which can lead to poor performance for long time series with infrequent embedded motifs including long records of human activities. All of the methods proposed for motif discovery that we are aware of assume no prior knowledge of the probable locations of the motifs which leads to this explosion in the processing time or space needed. This paper introduces the Constrained Motif Discovery problem in which we try to find motifs utilizing information about their probable locations in the data series. This leads to a linear time algorithm with constant space complexity based on the user choice. The introduction of constraints also allows interactive implementations which can utilize the human eye as the ultimate pattern recognizer to achieve higher performance. II. C ONSTRAINED M OTIF D ISCOVERY P ROBLEM Definition 1: Time series A time series of length n (X n ) is defined as an ordered set of values x (t) where 1 ≤ t ≤ n and x (t) ∈ U . U is called the domain of X n . Definition 2: Distance Function A distance function d over a domain U is a function that implements the transformation: d y1 l1 , y2 l2 : U l1 × U l2 → <+ for any two time series y l1 and y l2 defined over the domain U . The transformation function be nonzero and transitive. Definition 3: Subsequence set Given a time series X n of length n, the subsequence set of lmin :lmax X of length range lmin : lmax (Xsub ) is defined as the l set of all unique time series xsub that satisfy: 1) lmin ≤ l ≤ lmax l 2) xsub (t) = x (t + i) for all 1 ≤ t ≤ l and some integer i≥0 Definition 4: Trivial Match Set Given a subsequence xsub of a time series X, the trivial match set of xsub (triv (xsub )) is defined as the set of all time subsequences Tsub that intersects the subsequence xsub in one or more points. Definition 5: Motif Given a time series X n , and a distance function d and two length limits (lmin and lmax ), a motif M is defined as a set of time series called motif occurrences (mi ) that satisfy: l :lmax min 1) mi ∈ Xsub 2) lmin ≤ len (mi ) ≤ lmax 3) arg min (d (mi , x)) ∈ M, x ∈ x n l :l o min max xsub − triv (mi ) d (mi , x) > d (mi , y) 4) wherex ∈ M, y ∈ Xsub − M − triv (mi ) 5) mi is maximal. This means that mi is not a subsequence of any larger motif occurrence satisfying the previous conditions. 6) The number of elements in the motif M is larger than the mean of the number of elements in any other subsequence set that satisfy the previous conditions. Formally this can be stated as: There is a real value η > 1 so that: |M | > η × E {|N M |} where N M is the set of all time series sets satisfying the other conditions in this list, |S| is the number of elements in the set S, and E {} is the expectation operator. Definition 6: Constrained Motif Discovery Problem Given a time series X n , another time series P n of the same length called a constraint (where 0.0 ≤ p (t) < 1.0), two real valued limits lmin and lmax , and a distance function d find all motifs (M lmin :lmax ) of length l where lmin ≤ l ≤ lmax . The unconstrained motif discovery problem solved by available motif discovery algorithms can be considered a special case of the constrained motif discovery problem where p (t1 ) = p (t2 ) for all 1 ≤ t1 ≤ n and 1 ≤ t2 ≤ n. In the remaining of this paper we will assume without loss of generality that the domain of the time series (U ) is the set of real numbers (<). Discrete Time Wrapping (DTW) will be used as the distance function. III. MC A LGORITHMS Catalano et al. [6] proposed an efficient algorithm for locating recurring motifs in time series that works in linear time and constant memory. The algorithm works by processing a fixed sized set of candidate and comparison windows randomly sampled from the time series. The steps of the algorithm can be summarized as: 1) Select a subsequence sw of length w ≥ lmax . 2) Select w values randomly from X and concatenate them to form a noise sequence nw 3) select a set of nc comparison subsequences of X ({cw i }) each of length w. 4) find the set S ŵ of subsequences of of length ŵ where ŵ ≤ lmin for sw ,cw i , and nw . Then normalize all of the resulting subsequences to have unit mean square. 5) For the candidate subsequence of sw (sŵ k ) do the following a) Randomly select w − ŵ − 1 subsequences from the set of all subsequences of the comparison windows (cw i ). call this set the comparison set ĉŵ j w b) find the distances dkj = d sŵ k , ĉ j . c) Group the set ĉŵ j with their parent subsequence cw i and for every group select ĉŵ j that has the minimum distance dkj . This leads to a set of R subsequences c̃ŵ r where R ≤ nc d) keep only the R̂ subsequences of c̃ŵ r with least dkr . e) repeat the previous three steps for subsequences of the noise subsequence nw . This leads to another set of R̂ subsequences called ñŵ r 6) Remove all candidate subsequences sŵ k that has similar ŵ average distance with both ñŵ r and c̃r and then repeat the steps above using this reduced set. Repeat this reduction for nr times. 7) If the final set sŵ k is not empty output each of them as a motif seed Ms after concatenating any continuous subset of them. Assuming that the constraint p (t) is specifying the probability that a motif occurrence ends at or near the time step t, we can modify the first three steps of the original algorithm to speed up the motif discovery process. Two alternative ways to do so are presented in the following subsections. These Modified Catalano algorithms are what we call the MC algorithms in the rest of this paper. In section V we will show that this simple modification can speedup the original Catalano algorithm significantly given that the constraint is selected wisely. In section IV we will show an effective domain independent method for calculating the constraint for any real valued time series. A. MCFull Algorithm If calculating the constraint P is not computationally expensive or cannot be calculated for any required point of the series X using only local information, we can speed up Catalano algorithm by modifying the first three steps as follows: 1) Apply a gaussian smoothing filter (N (0, σ 2 )) to the original P constraint which results on the smoothed constraint P̃ . n P 2) normalize P̃ so that p̂ (t) = 1 and 0 ≤ p̂ (t) ≤ 1. t=1 3) Randomly select a subsequence sw of length w ≥ lmax using P̂ as the probability distribution. 4) Randomly select w values from X and concatenate them to form a noise sequence nw using 1 − P̂ as the probability distribution. 5) Randomly select a set of nc comparison subsequences of X ({cw i }) each of length w using P̂ as the probability distribution. The rest of the algorithm goes exactly as the original algorithm. The smoothing step is required to account for any inaccuracy of the constraint. This way we understand the constraint as specifying the probability that a motif occurrence ends near and not necessary at every time step. This algorithm is called MCFull (standing for Modified Catalano Full) because the P constraint has to be calculated completely before the algorithm is run. B. MCInc Algorithm If calculating the constraint P is computationally expensive and more over if p (t) can be calculated using only local information around t, then an incremental version of MCFull can be constructed (named MCInc hereafter) by modifying the MCFull steps as follows: 1) Randomly select a time step 1 ≤ τ ≤ n using a uniform distribution. 2) Calculate p (t) for τ − m ≤ t ≤ τ + m and calculate τP +m 1 p (t). p̂ (τ ) = 2×m+1 t=τ −m 3) Repeat steps 1 and 2 as long as p̂ (τ ) ≤ th 4) select the subsequence sw of length w ≥ lmax that ends at τ 5) Randomly select w values from X and concatenate them to form a noise sequence nw . 6) Repeat steps 1,2,3 fornc times to select the comparison subsequences of X ({cw i }) each of length w. IV. C ALCULATING THE CONSTRAINT The Constrained Motif Discovery problem as specified in section II does not define any specific way to calculate the constraint value P , because this constraint can come from domain knowledge. In this section we briefly describe a method for calculating the constraint value P given only the time series X. Using this technique it is possible to apply algorithm designed to solve the constrained motif discovery problem even when there is no enough domain knowledge to calculate the constraint. In section V we will use this technique to compare our proposed algorithms with other motif discovery algorithms and will show that in many cases this indirect method of solving the unconstrained motif discovery problem can be even more efficient than using traditional approaches. To find the needed noisy estimation of the motif locations for cases in which domain knowledge cannot be utilized we use our Robust Singular Spectrum Transform (RSST) [7] to find change points in the time series under the assumption that motifs are corresponding to changes in the dynamics of the time series or the generating processes. Although pathological cases that does not follow this assumptions can be designed, in most real-world time series mining we are interested in motifs that cause or result from some change either in the time series itself or related variables. In all such cases the proposed algorithms can be of value in discovering the interesting motifs much faster than currently available techniques. The main idea is to find the change score at every point which is defined as the probability that there is a change in the dynamics of the time series and utilize this score as the constraint value at this point. Only the basic idea of the RSST algorithm is given here. For more details please refer to [7]. The main idea of the RSST algorithm is to calculate the directions of maximal change in the time series before and after every point t of the time series X. The maximal change directions of the past is calculated as the singular vectors associated with the l largest singular values of the matrix defined by arranging n overlapping subsequences of length w before the point in rows. The parameter l is determined dynamically at every point of the time series. The maximal change directions of the future is defined the same way. When used to find the constraint for a time series X we select w = lmin /2 and n = d2 × lmax /lmin e. The angles between directions estimated using the future of the time series and the subspace defined by the directions of the past of the time series is then used to find the change score p (t). V. E VALUATION To evaluate the effectiveness of the proposed modifications of Catalano algorithm in solving the constrained motif discovery problem we will compare the performance of the following algorithms on a wide range of synthetic data with embedded motifs: 1) Projections [Pro]: A general motif discovery algorithm introduced in [8] and is the basis of many other stateof-the-art motif discovery algorithms [8], [1]. This algorithm requires the specification of the exact motif length. 2) Catalano’s algorithm [Cat]: The original Catalano et al.’s algorithm as specified in [6]. 3) MCFull: proposed in section III-A 4) MCInc: proposed in section III-B 50400 different synthetic time series with controlled embedded motifs were produced by changing various features of the time series. Every time series was composed of a recurring pattern called the background signal with embedded motifs embedded at random locations with controllable numbers. Uniform noise was added to the time series. The features of the time series that was changed in the tests are: 1) time series length was changed from 1×103 to 1×106 2) The relative number of embedded motifs was changed from 1% to 10% of the length of the time series. 3) The length of the embedded motif was changed from 50 to 100 points. 4) The noise level defined as the peak value of the white noise added divided by the peak-to-peak value of the motifs from 0% to 20%. 5) The background signal strength defined as the peakto-peak value of the background signal divided by the peak-to-peak value of the embedded pattern was changed between 0% to 40%. 6) The generating process of both the background signal and the motif. For Catalano, MCFull and MCInc, the parameters lmin , lmax were fixed at 40 and 100 respectively. For the Projections algorithm the correct motif size was given to the algorithm. Fig. 1 shows the processing time per point for every one of the tested algorithms. It is clear that the Projections algorithm is subquadratic although not linear while the Catalano, MCFull, and MCInc are all linear specially for long time Fig. 1. Processing Time per point of the input time series for the four tested motif discovery algorithms. For MCFull and MCInc the time required to calculate the constraint using RSST is also added series. The average speedup obtained by using MCFull was 4.19 times over the Catalno algorithm and 10.76 times over the Projections algorithm. The average speedup obtained by MCInc was 4.09 over the Catalano algorithm and 9.84 times over the Projections algorithm. motif discovery problems into constrained motif discovery problems that can be solved in linear time. The evaluation of the proposed algorithms show four folds increase in speed over the original algorithm in a large set of synthetic time series. Comparison with the Projections algorithm also showed a ten folds increase in speed compared. The proposed algorithms was also more robust against noise for high noise levels even though the Projections algorithm was the best algorithm for low noise levels. The proposed algorithms can be used to solve many real world problems in which information about the locations of motifs can be inferred from other available time series. For example in interaction contexts, the behavior of the interactors are dependent on each other and the locations of motifs in one of them can correspond (with some delay) to the locations of motifs in the other. The ability to use this kind of knowledge to speedup and increase the accuracy of motif discovery is a unique feature of the proposed algorithms. Directions of future research include applying the proposed algorithms to real world data and modifying them to solve multidimensional constrained motif discovery problems. Speeding up the RSST algorithm in order to solve unconstrained motif discovery problems faster can be another direction of future research. Other algorithms for solving the constrained motif discovery problem that provide the exhaustiveness in finding motif occurrences guaranteed by PROJECTIONS based algorithms is a third direction of future research. R EFERENCES Fig. 2. The Speedup obtained over Catalano et al.’s original algorithm for the two proposed algorithms To further study the effect of time series length on the speedup obtained by using the MCFull and MCInc algorithms, Fig. 2 plots the speedup over the Catalano Algorithm as a function of the time series length. As the figure shows, MCFull can even be slower than Catalano algorithm for short time series while MCInc shows four folds increase in speed at the same conditions. When the time series length increases, the advantage of MCInc over MCFull decreases until for long time series, MCFull achieves faster processing than MCInc. VI. C ONCLUSION This paper presented the Constrained Motif Discovery problem and the first two algorithms to solve it. The two algorithms (MCFull and MCInc) are modifications of the recently developed algorithm by Catalano et al. for solving unconstrained motif discovery problems. The paper also briefly introduced the RSST algorithm for change point detection and showed how to use it to convert unconstrained [1] Bill Chiu, Eamonn Keogh, and Stefano Lonardi. Probabilistic discovery of time series motifs. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 493–498, New York, NY, USA, 2003. ACM. [2] Tom Oates. Peruse: An unsupervised algorithm for finding recurring patterns in time series. In International Conference on Data Mining, pages 330–337, 2002. [3] Kyle L. Jensen, Mark P. Styczynxki, Isidore Rigoutsos, and Greogory N. Stephanopoulos. A generic motif discovery algorithm for sequenctial data. BioInformatics, 22(1):21–28, 2006. [4] D. Minnen, T. Starner, I. Essa, and C.L. Isbell. Improving activity discovery with automatic neighborhood estimation. In Int. Joint Conf. on Artificial Intelligence, 2007. [5] E. Keogh, J. Lin, and A. Fu. Hot sax: efficiently finding the most unusual time series subsequence. Data Mining, Fifth IEEE International Conference on, pages 8 pp.–, Nov. 2005. [6] Joe Catalano, Tom Armstrong, and Tim Oates. Discovering patterns in real-valued time series. In Knowledge Discovery in Databases: PKDD 2006, pages 462–469, 2007. [7] Yasser Mohammad and Toyoaki Nishida. Change point detection using robust singular spectrum transform applied to mining human-human interaction records. In International Conference on Data Mining, 2009. submitted. [8] J. Buhler and M. Tompa. Finding motifs using random projections. In 5th Internatinal Conference on Computational Biology, pages 69–76, 2001.
© Copyright 2026 Paperzz