Constrained Motif Discovery

Constrained Motif Discovery
Yasser Mohammad and Toyoaki Nishida
Abstract— The goal of motif discovery algorithms is to
efficiently find unknown recurring patterns in time series. Most
available algorithms cannot utilize domain knowledge in any
way which results in quadratic or at least sub-quadratic time
and space complexity. For large time series datasets for which
domain knowledge can be available this is a severe limitation.
In this paper we define the Constrained Motif Discovery
problem which enables utilization of domain knowledge into the
motif discovery process. We also show that most unconstrained
motif discovery problems be converted into constrained motif
discovery problem using a change point detection algorithm. We
provide two algorithms for solving this problem and compare
their performance to state-of-the-art motif discovery algorithms
on a large set of synthetic time series. The proposed algorithms
can provide linear time and constant space complexity. The
proposed algorithms provided four to ten folds increase in speed
compared to two state of the art motif discovery algorithms
without loss of accuracy and provided better noise robustness
in high noise levels.
I. INTRODUCTION
The research in unsupervised motif discovery have led
to many techniques including the PROJECTIONS algorithm
[1], PERUSE [2], Gemoda [3] among many others ([4]).
With the exception of Gemoda which is quadratic in time
and space complexities, these algorithms aim to achieve
sub-quadratic time complexity by first looking for candidate
motif stems using some heuristic method and then doing an
exhaustive motif detection instead of motif discovery which
is linear in time. The most used method for finding these
stems is the PROJECTIONS algorithm [1] which requires
discritization of the data using the SAX [5] algorithm. A
major drawback of all methods relying on discritization of
the data is the need to specify a word length and a vocabulary
size that are difficult to decide in real world situations.
Another problem of most of the previously proposed techniques is the need to specify an exact or at least roughly
correct motif length. A third problem is deciding when to
stop searching for new motifs ([4] suggested using density
estimation for this purpose). One common problem to all the
algorithms based on the PROJECTIONS algorithm [4], [1] is
the need to construct and keep the collision matrix which is
in general quadratic in the length of the SAX word describing
the time series. Given that the optimal word size can be
Yasser Mohammad is with the Department of Intelligence Science
and Technology, Graduate School of Informatics, Kyoto University
[email protected]
Toyoaki Nishida is with the Department of Intelligence Science
and Technology, Graduate School of Informatics, Kyoto University
[email protected]
The first author would like to thank the Egyptian Ministry of Higher
Education for supporting his PhD study during which this research took
place
very short for short motifs in signals with high frequency
components, the size of this matrix can grow quadratic with
the length of the time series. Catalano et al. [6] suggested a
very efficient algorithm for locating variable length patterns
in data series using random sampling that allows it to run
in linear time and constant memory. The main problem of
this approach is that it relies on random sampling which can
lead to poor performance for long time series with infrequent
embedded motifs including long records of human activities.
All of the methods proposed for motif discovery that we
are aware of assume no prior knowledge of the probable
locations of the motifs which leads to this explosion in
the processing time or space needed. This paper introduces
the Constrained Motif Discovery problem in which we try
to find motifs utilizing information about their probable
locations in the data series. This leads to a linear time
algorithm with constant space complexity based on the user
choice. The introduction of constraints also allows interactive
implementations which can utilize the human eye as the
ultimate pattern recognizer to achieve higher performance.
II. C ONSTRAINED M OTIF D ISCOVERY P ROBLEM
Definition 1: Time series
A time series of length n (X n ) is defined as an ordered set
of values x (t) where 1 ≤ t ≤ n and x (t) ∈ U . U is called
the domain of X n .
Definition 2: Distance Function
A distance function d over a domain U is a function that
implements the transformation: d y1 l1 , y2 l2 : U l1 × U l2 →
<+ for any two time series y l1 and y l2 defined over the
domain U . The transformation function be nonzero and
transitive.
Definition 3: Subsequence set
Given a time series X n of length n, the subsequence set of
lmin :lmax
X of length range lmin : lmax (Xsub
) is defined as the
l
set of all unique time series xsub that satisfy:
1) lmin ≤ l ≤ lmax
l
2) xsub (t) = x (t + i) for all 1 ≤ t ≤ l and some integer
i≥0
Definition 4: Trivial Match Set
Given a subsequence xsub of a time series X, the trivial
match set of xsub (triv (xsub )) is defined as the set of all
time subsequences Tsub that intersects the subsequence xsub
in one or more points.
Definition 5: Motif
Given a time series X n , and a distance function d and two
length limits (lmin and lmax ), a motif M is defined as a set
of time series called motif occurrences (mi ) that satisfy:
l
:lmax
min
1) mi ∈ Xsub
2) lmin ≤ len (mi ) ≤ lmax
3) arg min (d (mi , x)) ∈ M, x
∈
x
n l :l o
min max
xsub
−
triv (mi )
d (mi , x) > d (mi , y)
4)
wherex ∈ M, y ∈ Xsub − M − triv (mi )
5) mi is maximal. This means that mi is not a subsequence of any larger motif occurrence satisfying the
previous conditions.
6) The number of elements in the motif M is larger
than the mean of the number of elements in any other
subsequence set that satisfy the previous conditions.
Formally this can be stated as: There is a real value
η > 1 so that:
|M | > η × E {|N M |}
where N M is the set of all time series sets satisfying
the other conditions in this list, |S| is the number of
elements in the set S, and E {} is the expectation
operator.
Definition 6: Constrained Motif Discovery Problem
Given a time series X n , another time series P n of the same
length called a constraint (where 0.0 ≤ p (t) < 1.0), two real
valued limits lmin and lmax , and a distance function d find
all motifs (M lmin :lmax ) of length l where lmin ≤ l ≤ lmax .
The unconstrained motif discovery problem solved by
available motif discovery algorithms can be considered a
special case of the constrained motif discovery problem
where p (t1 ) = p (t2 ) for all 1 ≤ t1 ≤ n and 1 ≤ t2 ≤ n.
In the remaining of this paper we will assume without loss
of generality that the domain of the time series (U ) is the set
of real numbers (<). Discrete Time Wrapping (DTW) will
be used as the distance function.
III. MC A LGORITHMS
Catalano et al. [6] proposed an efficient algorithm for
locating recurring motifs in time series that works in linear
time and constant memory. The algorithm works by processing a fixed sized set of candidate and comparison windows
randomly sampled from the time series. The steps of the
algorithm can be summarized as:
1) Select a subsequence sw of length w ≥ lmax .
2) Select w values randomly from X and concatenate
them to form a noise sequence nw
3) select a set of nc comparison subsequences of X
({cw i }) each of length w.
4) find the set S ŵ of subsequences of of length ŵ where
ŵ ≤ lmin for sw ,cw i , and nw . Then normalize all of
the resulting subsequences to have unit mean square.
5) For the candidate subsequence of sw (sŵ
k ) do the
following
a) Randomly select w − ŵ − 1 subsequences from
the set of all subsequences of the comparison
windows (cw i ). call this set the comparison set
ĉŵ
j
w
b) find the distances dkj = d sŵ
k , ĉ j .
c) Group the set ĉŵ
j with their parent subsequence
cw i and for every group select ĉŵ
j that has the
minimum distance dkj . This leads to a set of R
subsequences c̃ŵ
r where R ≤ nc
d) keep only the R̂ subsequences of c̃ŵ
r with least
dkr .
e) repeat the previous three steps for subsequences
of the noise subsequence nw . This leads to another set of R̂ subsequences called ñŵ
r
6) Remove all candidate subsequences sŵ
k that has similar
ŵ
average distance with both ñŵ
r and c̃r and then repeat
the steps above using this reduced set. Repeat this
reduction for nr times.
7) If the final set sŵ
k is not empty output each of them
as a motif seed Ms after concatenating any continuous
subset of them.
Assuming that the constraint p (t) is specifying the probability that a motif occurrence ends at or near the time step t,
we can modify the first three steps of the original algorithm
to speed up the motif discovery process. Two alternative
ways to do so are presented in the following subsections.
These Modified Catalano algorithms are what we call the
MC algorithms in the rest of this paper. In section V we will
show that this simple modification can speedup the original
Catalano algorithm significantly given that the constraint is
selected wisely. In section IV we will show an effective
domain independent method for calculating the constraint
for any real valued time series.
A. MCFull Algorithm
If calculating the constraint P is not computationally
expensive or cannot be calculated for any required point of
the series X using only local information, we can speed
up Catalano algorithm by modifying the first three steps as
follows:
1) Apply a gaussian smoothing filter (N (0, σ 2 )) to the
original P constraint which results on the smoothed
constraint P̃ .
n
P
2) normalize P̃ so that
p̂ (t) = 1 and 0 ≤ p̂ (t) ≤ 1.
t=1
3) Randomly select a subsequence sw of length w ≥ lmax
using P̂ as the probability distribution.
4) Randomly select w values from X and concatenate
them to form a noise sequence nw using 1 − P̂ as the
probability distribution.
5) Randomly select a set of nc comparison subsequences
of X ({cw i }) each of length w using P̂ as the
probability distribution.
The rest of the algorithm goes exactly as the original
algorithm. The smoothing step is required to account for
any inaccuracy of the constraint. This way we understand
the constraint as specifying the probability that a motif
occurrence ends near and not necessary at every time step.
This algorithm is called MCFull (standing for Modified
Catalano Full) because the P constraint has to be calculated
completely before the algorithm is run.
B. MCInc Algorithm
If calculating the constraint P is computationally expensive and more over if p (t) can be calculated using only local
information around t, then an incremental version of MCFull
can be constructed (named MCInc hereafter) by modifying
the MCFull steps as follows:
1) Randomly select a time step 1 ≤ τ ≤ n using a
uniform distribution.
2) Calculate p (t) for τ − m ≤ t ≤ τ + m and calculate
τP
+m
1
p (t).
p̂ (τ ) = 2×m+1
t=τ −m
3) Repeat steps 1 and 2 as long as p̂ (τ ) ≤ th
4) select the subsequence sw of length w ≥ lmax that
ends at τ
5) Randomly select w values from X and concatenate
them to form a noise sequence nw .
6) Repeat steps 1,2,3 fornc times to select the comparison
subsequences of X ({cw i }) each of length w.
IV. C ALCULATING THE CONSTRAINT
The Constrained Motif Discovery problem as specified in
section II does not define any specific way to calculate the
constraint value P , because this constraint can come from
domain knowledge. In this section we briefly describe a
method for calculating the constraint value P given only
the time series X. Using this technique it is possible to
apply algorithm designed to solve the constrained motif
discovery problem even when there is no enough domain
knowledge to calculate the constraint. In section V we will
use this technique to compare our proposed algorithms with
other motif discovery algorithms and will show that in many
cases this indirect method of solving the unconstrained motif
discovery problem can be even more efficient than using
traditional approaches.
To find the needed noisy estimation of the motif locations
for cases in which domain knowledge cannot be utilized we
use our Robust Singular Spectrum Transform (RSST) [7] to
find change points in the time series under the assumption
that motifs are corresponding to changes in the dynamics
of the time series or the generating processes. Although
pathological cases that does not follow this assumptions can
be designed, in most real-world time series mining we are
interested in motifs that cause or result from some change
either in the time series itself or related variables. In all such
cases the proposed algorithms can be of value in discovering
the interesting motifs much faster than currently available
techniques.
The main idea is to find the change score at every point
which is defined as the probability that there is a change in
the dynamics of the time series and utilize this score as the
constraint value at this point.
Only the basic idea of the RSST algorithm is given here.
For more details please refer to [7]. The main idea of the
RSST algorithm is to calculate the directions of maximal
change in the time series before and after every point t of
the time series X. The maximal change directions of the
past is calculated as the singular vectors associated with the
l largest singular values of the matrix defined by arranging
n overlapping subsequences of length w before the point in
rows. The parameter l is determined dynamically at every
point of the time series. The maximal change directions of
the future is defined the same way. When used to find the
constraint for a time series X we select w = lmin /2 and
n = d2 × lmax /lmin e.
The angles between directions estimated using the future
of the time series and the subspace defined by the directions
of the past of the time series is then used to find the change
score p (t).
V. E VALUATION
To evaluate the effectiveness of the proposed modifications of Catalano algorithm in solving the constrained motif
discovery problem we will compare the performance of the
following algorithms on a wide range of synthetic data with
embedded motifs:
1) Projections [Pro]: A general motif discovery algorithm
introduced in [8] and is the basis of many other stateof-the-art motif discovery algorithms [8], [1]. This
algorithm requires the specification of the exact motif
length.
2) Catalano’s algorithm [Cat]: The original Catalano et
al.’s algorithm as specified in [6].
3) MCFull: proposed in section III-A
4) MCInc: proposed in section III-B
50400 different synthetic time series with controlled embedded motifs were produced by changing various features
of the time series. Every time series was composed of a
recurring pattern called the background signal with embedded motifs embedded at random locations with controllable
numbers. Uniform noise was added to the time series. The
features of the time series that was changed in the tests are:
1) time series length was changed from 1×103 to 1×106
2) The relative number of embedded motifs was changed
from 1% to 10% of the length of the time series.
3) The length of the embedded motif was changed from
50 to 100 points.
4) The noise level defined as the peak value of the white
noise added divided by the peak-to-peak value of the
motifs from 0% to 20%.
5) The background signal strength defined as the peakto-peak value of the background signal divided by
the peak-to-peak value of the embedded pattern was
changed between 0% to 40%.
6) The generating process of both the background signal
and the motif.
For Catalano, MCFull and MCInc, the parameters lmin ,
lmax were fixed at 40 and 100 respectively. For the Projections algorithm the correct motif size was given to the
algorithm.
Fig. 1 shows the processing time per point for every one
of the tested algorithms. It is clear that the Projections algorithm is subquadratic although not linear while the Catalano,
MCFull, and MCInc are all linear specially for long time
Fig. 1. Processing Time per point of the input time series for the four tested
motif discovery algorithms. For MCFull and MCInc the time required to
calculate the constraint using RSST is also added
series. The average speedup obtained by using MCFull was
4.19 times over the Catalno algorithm and 10.76 times over
the Projections algorithm. The average speedup obtained by
MCInc was 4.09 over the Catalano algorithm and 9.84 times
over the Projections algorithm.
motif discovery problems into constrained motif discovery
problems that can be solved in linear time. The evaluation
of the proposed algorithms show four folds increase in
speed over the original algorithm in a large set of synthetic
time series. Comparison with the Projections algorithm also
showed a ten folds increase in speed compared. The proposed
algorithms was also more robust against noise for high noise
levels even though the Projections algorithm was the best
algorithm for low noise levels.
The proposed algorithms can be used to solve many real
world problems in which information about the locations
of motifs can be inferred from other available time series.
For example in interaction contexts, the behavior of the
interactors are dependent on each other and the locations
of motifs in one of them can correspond (with some delay)
to the locations of motifs in the other. The ability to use
this kind of knowledge to speedup and increase the accuracy
of motif discovery is a unique feature of the proposed
algorithms.
Directions of future research include applying the proposed algorithms to real world data and modifying them
to solve multidimensional constrained motif discovery problems. Speeding up the RSST algorithm in order to solve
unconstrained motif discovery problems faster can be another
direction of future research. Other algorithms for solving
the constrained motif discovery problem that provide the
exhaustiveness in finding motif occurrences guaranteed by
PROJECTIONS based algorithms is a third direction of
future research.
R EFERENCES
Fig. 2. The Speedup obtained over Catalano et al.’s original algorithm for
the two proposed algorithms
To further study the effect of time series length on the
speedup obtained by using the MCFull and MCInc algorithms, Fig. 2 plots the speedup over the Catalano Algorithm
as a function of the time series length. As the figure shows,
MCFull can even be slower than Catalano algorithm for short
time series while MCInc shows four folds increase in speed
at the same conditions. When the time series length increases,
the advantage of MCInc over MCFull decreases until for long
time series, MCFull achieves faster processing than MCInc.
VI. C ONCLUSION
This paper presented the Constrained Motif Discovery
problem and the first two algorithms to solve it. The two
algorithms (MCFull and MCInc) are modifications of the
recently developed algorithm by Catalano et al. for solving
unconstrained motif discovery problems. The paper also
briefly introduced the RSST algorithm for change point
detection and showed how to use it to convert unconstrained
[1] Bill Chiu, Eamonn Keogh, and Stefano Lonardi. Probabilistic discovery
of time series motifs. In KDD ’03: Proceedings of the ninth ACM
SIGKDD international conference on Knowledge discovery and data
mining, pages 493–498, New York, NY, USA, 2003. ACM.
[2] Tom Oates. Peruse: An unsupervised algorithm for finding recurring
patterns in time series. In International Conference on Data Mining,
pages 330–337, 2002.
[3] Kyle L. Jensen, Mark P. Styczynxki, Isidore Rigoutsos, and Greogory N.
Stephanopoulos. A generic motif discovery algorithm for sequenctial
data. BioInformatics, 22(1):21–28, 2006.
[4] D. Minnen, T. Starner, I. Essa, and C.L. Isbell. Improving activity
discovery with automatic neighborhood estimation. In Int. Joint Conf.
on Artificial Intelligence, 2007.
[5] E. Keogh, J. Lin, and A. Fu. Hot sax: efficiently finding the most
unusual time series subsequence. Data Mining, Fifth IEEE International
Conference on, pages 8 pp.–, Nov. 2005.
[6] Joe Catalano, Tom Armstrong, and Tim Oates. Discovering patterns in
real-valued time series. In Knowledge Discovery in Databases: PKDD
2006, pages 462–469, 2007.
[7] Yasser Mohammad and Toyoaki Nishida. Change point detection using
robust singular spectrum transform applied to mining human-human
interaction records. In International Conference on Data Mining, 2009.
submitted.
[8] J. Buhler and M. Tompa. Finding motifs using random projections. In
5th Internatinal Conference on Computational Biology, pages 69–76,
2001.