Conference PaperPDF Available

Efficient Mining of Partial Periodic Patterns in Time Series Database

Authors:

Abstract and Figures

Partial periodicity search, i.e., search for partial periodic patterns in time-series databases, is an interesting data mining problem. Previous studies on periodicity search mainly consider finding full periodic patterns, where every point in time contributes (precisely or approximately) to the periodicity. However, partial periodicity is very common in practice since it is more likely that only some of the time episodes may exhibit periodic patterns. We present several algorithms for efficient mining of partial periodic patterns, by exploring some interesting properties related to partial periodicity such as the Apriori property and the max-subpattern hit set property, and by shared mining of multiple periods. The max-subpattern hit set property is a vital new property which allows us to derive the counts of all frequent patterns from a relatively small subset of patterns existing in the time series. We show that mining partial periodicity needs only two scans over the time series database, even for mining multiple periods. The performance study shows our proposed methods are very efficient in mining long periodic patterns
Content may be subject to copyright.
Efficient Mining of Partial Periodic Patterns
in Time Series Database
In ICDE 99
Jiawei Han
School of Computing Science
Simon Fraser University
han@cs.sfu.ca
Guozhu Dong
Department of Computer Science and Engineering
Wright State University
gdong@cs.wright.edu
Yiwen Yin
School of Computing Science
Simon Fraser University
yiweny@cs.sfu.ca
Abstract
Partial periodicity search, i.e., search for partial peri-
odic patterns in time-series databases, is an interesting data
mining problem. Previous studies on periodicity search
mainly consider finding full periodic patterns, where every
point in time contributes (precisely or approximately) to the
periodicity. However, partial periodicity is very common in
practice since it is more likely that only some of the time
episodes may exhibit periodic patterns.
We presentseveral algorithmsfor efficient mining of par-
tial periodic patterns, by exploring some interesting proper-
ties related to partial periodicity, such as the Apriori prop-
erty and the max-subpattern hit set property, and by shared
mining of multiple periods. The max-subpattern hit set
property is a vital new property which allows us to derive
the counts of all frequent patterns from a relatively small
subset of patterns existing in the time series. We show that
mining partial periodicity needs only two scans over the
time series database, even for mining multiple periods. The
performance study shows our proposed methods are very
efficient in mining long periodic patterns.
Keywords. Periodicity search, partial periodicity, time-
series analysis, data mining algorithms.
1. Introduction
Finding periodic patterns in time series databases is an
important data mining task with many applications. Many
Research was supported in part by research grants from the Natural
Sciences and Engineering Research Council of Canada and the Networks
of Centres of Excellence Program of Canada
Part of this work was done while visiting Simon Fraser University
during his sabbatical from University of Melbourne, Australia.
methods have been developed for searching periodicity pat-
ternsinlarge data sets [8]. However,mostpreviousmethods
on periodicity search are on mining full periodic patterns,
where every point in time contributes (precisely or approxi-
mately) to the cyclic behavior of the time series. For exam-
ple, all the days in the year approximately contribute to the
season cycle of the year. A useful related type of periodic
patterns, called partial periodic patterns, which specify the
behaviorofthe time series at some but not all points in time,
have not received enough attention. An example partial pe-
riodic pattern may state that Jim reads the Vancouver Sun
newspaper from 7:00 to 7:30 every weekday morning but
his activities at other times do not have much regularity.
Thus, partial periodicity is a looser kind of periodicity than
full periodicity, and it exists ubiquitously in the real world.
The purpose of the current paper is to fill the gap by consid-
ering the efficient mining of partial periodic patterns.
Most methods for finding full periodic patterns are ei-
ther inapplicable to or prohibitively expensive for the min-
ing of partial periodic patterns, because of the mixture of
periodic events and non-periodic events in the same period.
For example, FFT (Fast Fourier Transformation) cannot be
applied to mining partial periodicity because it treats the
time-series as an inseparable flow of values. Some peri-
odicity detection methods can detect some partial periodic
patterns, but only if the period, and the length and timing
of the segment in the partial patterns with specific behavior
areexplicitly specified. Forthenewspaperreadingexample,
we need to explicitly specify details such as “find the reg-
ular activities of Jim during the half-hour after 7:00 for the
period of
hours. A naive adaptation of such methods to
our partial periodic pattern mining problem would be pro-
hibitively expensive, requiring their application to a huge
number of possible combinations of the three parameters of
length, timing, and period.
Besides full periodicity search, there are many recent
1
studies on time series data mining: Most concentrate on
symbolic patterns, although some consider numerical curve
patterns in time series. Agrawal and Srikant [3] devel-
oped an Apriori-like technique [2] for mining sequential
patterns. Mannila et al. [10] consider frequent episodes in
sequences, where episodes are essentially acyclic graphs
of events whose edges specify the temporal before-and-
after relationalship but without timing-interval restrictions.
Inter-transaction association rules proposed by Lu et al. [9]
are implication rules whose two sides are totally-ordered
episodes with timing-interval restrictions (on the events in
the episodes and on the two sides). Bettini et al. [5] con-
sider a generalization of inter-transaction association rules:
these are essentially rules whose left-hand and right-hand
sides are episodes with time-interval restrictions. However,
unlike ours, periodicity is not considered in these studies.
Similar to our problem, the mining of cyclic association
rules by
¨
Ozden, et al. [12] also considers the mining of
some patterns of a range of possible periods. Observe that
cyclic association rules are partial periodic patterns with
perfect periodicity in the sense that each pattern reoccurs in
every cycle, with
confidence. The perfectness in peri-
odicity leads to a key idea used in designing efficient cyclic
association rule mining algorithms: As soon as it is known
that an association rule
does not hold at a particular in-
stant of time, we can inferthat cannot have periods which
include this time instant. For example, if the maximum pe-
riod of interest is
and it is discovered that does not
hold in the first
time instants, then cannot have any
periods. This idea leads to the useful “cycle-elimination”
strategy explored in that paper. Since real life patterns are
usually imperfect, our goal is not to mine perfect periodicity
and thus “cycle-elimination” based optimization will not be
considered here.
An Apriori-like algorithm has been proposed for mining
imperfect partial periodic patterns with a given (single) pe-
riod in a recent study by two of the current authors [7]. It
is an interesting algorithm for mining imperfect partial pe-
riodicity. However, with a detailed examination of the data
characteristics of partial periodicity, we found that Apriori
pruningin mining partial periodicity may not be as effective
as in mining association rules.
Our study has revealed the following new characteristics
of partial periodic patterns in time series: The Apriori-like
property among partial periodic patterns still holds for any
fixed period, but it does not hold for patterns between dif-
ferent periods. Furthermore, there is a strong correlation
It is important to point out that [12] concentrates on the elimination of
candidate itemsets for the association rule mining algorithm, although the
cycle-elimination strategy does lead to a small reduction on the number of
patterns when we process the time series from left to right.
Note that a modified strategy, where we stop considering certain pat-
terns as soon as the length of the time series to be processed is not enough
to make the confidence higher than the threshold, can be used.
among frequencies of partial patterns.
The main contributions of this paper are as follows. We
consider the efficient mining of partial periodic patterns, for
a single period as well as for a set of periods. We propose
several mining algorithms, by exploring some interesting
properties related to partial periodicity such as the Apri-
ori property and the max-subpattern hit set property, and by
shared mining of multiple periods. The max-subpattern hit
set property is a vital new property which allows to derive
the counts of all frequent patterns from a relatively small
subset of patterns mined from the time series. We show
that mining partial periodicity needs only two scansoverthe
time series database, even for mining multiple periods. The
performance study shows our proposed methods are very
efficient. The proposed methods are also robust that can
be applied in a variety of cases including mining multiple-
level partial periodicity and mining partial periodicity with
perturbation and evolution.
The remaining of the paper is organized as follows. In
Section 2, concepts related to partial periodicity are intro-
duced. In Section 3, methods for mining partial periodicity
in regard to both single and multiple periods are studied.
In Section 4, the implementation of a novel data structure,
namely the max-subpattern tree, for facilitating the count-
ing of the hit maximal patterns, and the derivation of the set
of frequent patterns from the hit maximal patterns, are pre-
sented. In Section 5, a comparison of the performance of
the proposed algorithms is reported. We conclude our study
in Section 6.
2 Problem Definition
Assume that a sequence of
timestamped datasets have
been collected in a database. For each time instant , let
be a set of features derived from the dataset collected at the
instant. Thus, the time series of features is repesented as,
Let be the underlying set of features. We will also use
the “don’t care” character , which can match any single set
of features. We define a pattern
as a non-
empty sequence over . We will use
to denote the length of , and will say that is the period
of the pattern
. Let the -length of be the
number of which contains letters from . A pattern with
-length is also called an -pattern. Moreover, a subpat-
tern of a pattern is a pattern
such that and have the same length, and for
every position
where . For example, the pattern
is of length and it is of -length (i.e., it is a
If is a singleton we will omit the brackets, e.g., we write as .
2
4-pattern); and and are two of the
subpatterns of .
The frequency
count and confidence of a pattern in a
time series are defined as
the string s is true in ,
and
where is the maximum number of periods of length
contained in the time series (i.e., is the positive integer
such that ). Each segment of the
form
, where , is called a
period segment. We say a pattern is true in
the period segment or the period segment matches , if, for
each position
, either is or all the letters in occur
in the set of features in the segment. Thus, if is a
subpattern of
, then the set of sequences that can match
is a subset of sequences that can match .
Example 2.1 For example,
is a pattern of period ;
its frequency count in the feature series
is
2; and its confidence is , where 3 is the maximum num-
ber of periods of length 3. The confidence of in
is also .
Similar to mining association rules [2], we say that a pat-
tern is a frequent partial periodic pattern in a time se-
ries if its confidence is larger than or equal to a threshold,
. The mining of frequent partial periodic patterns
in a time series is to discover, possibly with some restric-
tions, all the frequent patterns of the series for one period or
a range of specified periods. More specifically, the input to
mining includes:
A time series .
A specified period; or a range of periods specified by two
integers and .
An integer indicating that the ratio of the lengths of
and the patterns must be at least . This will ensure that
the patterns mined would be of value to the application at
hand.
Remark: Sometimes the derivation of the feature series
from the original data series is quite involved, and the inter-
action of the periodic patterns with the derivation of features
may lead to improved performance. Hence it is worthwhile
to combine the mining of the features from the datasets with
the mining of the patterns, as is the case for the mining of
cyclic association rules [12]. For our work on the mining of
frequent partial periodic patterns though, this interaction is
not useful for achieving computational advantage and thus
we will assume that we are dealing with the feature time
series in our study.
3 Methods for mining partial periodicity in
time series
In this section, we explore methods for mining partial
periodicity in a time series, proceeding from mining par-
tial periodicity for a single given period to mining partial
periodicity for a specified range of periods (i.e., multiple
periods).
3.1 Mining partial periodicity for single period
3.1.1 Single-period apriori method
A popular key idea used in the efficient mining of associa-
tion rules is the Apriori property discovered in [2]: If one
subset of an itemset is not frequent, then the itemset itself
cannot be frequent. This allows us to use frequent itemsets
of size
as filters for candidate itemsets of size .
Interestingly, for each period , the property supporting
the Apriori “trick” still holds:
Property 3.1 [Apriori on periodicity] Each subpattern of
a frequent pattern of period is itself a frequent pattern of
period .
The proof is based on the fact that patterns are more restric-
tive than their subpatterns. Suppose
is a subpattern of a
frequent pattern
. Then is obtained from by changing
some set of letters to a subset or . Hence is more restric-
tive than and thus the frequencycount of is greaterthan
or equal to that of
. Thus is frequent as well.
An algorithm for mining partial periodic patterns for a
given fixed period based on this Apriori “trick” was pre-
sented in [7]. We include a simplied version here for the
sake of completeness.
Algorithm 3.1 [Single-periodApriori] Find all partial pe-
riodic patterns for a given period satisfying a given con-
fidence threshold min conf in time-series , based on the
Apriori property 3.1.
Method.
1. Find
, the set of frequent 1-patterns of period , by ac-
cumulatingthefrequencycount for each 1-pattern in each
whole period segment and selecting among them whose
frequencycount is no less than
, where
is the maximum number of periods.
2. Find all frequent
-patterns of period , for from 2 up
to , based on the idea of Apriori, and terminate immedi-
ately when the candidate frequent -pattern set is empty.
Analysis.
Number of scans over the time series. Step 1 of the
algorithmneeds to scan the time series
once. Step 2 needs
3
to scan up to times in the worst case. Thus the total
number of scans is no more than the period .
Space needed. (1) At Step 1, suppose there exist a total of
distinct featuresat positions in ,
where is the numbersuch that . We
need
units of space to hold the counts. In the worst
case when every feature is distinct in the entire time series
, we need units of space . After Step 1, we
only need
units of space to keep , the set of frequent
-patterns in . (2) At Step 2, the maximum number of
candidate subpatterns that we may generate is
. Considering
that we still need space to keep the set of frequent 1-
patterns, the total amount of space needed is
in the
worse case in this computation. However, the average case
should be much smaller than the worst case since if every
feature is distinct in the time series, then there is no need to
find periodic patterns. The existence of any periodicity in
the time series will reduce the memory needed.
3.1.2 Single-period max-subpattern hit set method
Although the Apriori trick may reduce the search space in
partial periodicity mining in a similar way as association
rule mining, it is important to note that the data characteris-
tics in the two cases are very different. In mining associa-
tion rules, the numberof frequent
-itemsets shrinks quickly
as
increases because of the sparsity of frequent -itemsets
in a large transaction database. However, in mining par-
tial periodicity, very often the number of frequent -patterns
shrinks slowly (when
) as increases. The slow speed
of decrease in the number of frequent -patterns is due to a
strong correlation between frequencies of patterns and their
subpatterns. We now illustrate this point.
Example 3.1 Suppose we have two frequent 1-patterns,
and , such that and ,
in a time-series . Then it must be the case that
, as explained below. Since all period seg-
ments that match match both and ,
holds. To derive the other inequality, let denote the predi-
catethat a letteris not
, similarly . Theconfidenceof in
is at most , because . Sim-
ilarly,
. Since
, it follows that .
The slow reduction of the set of candidate frequent -
patterns as grows makes the Apriori pruning of Algorithm
3.1 less attractive. Is there a better way?
The unit of space is the space needed to hold the feature identifier and
its associated count, and its size is usually 2-8 bytes, depending on the
implementation.
This is equal to the total space that the time series occupies.
Obviously, the derivation of frequent
-patterns is still
an effective way to dramatically reduce the candidate set
to be examined later because there are usually only a small
number of features being frequent at a particular position
but there could be a large number of features appearing in
the position. This is especially true when the average num-
ber of features per position is larger than
. Thus
our discussion will be focused on how to reduce the search
effort after the set of frequent
-patterns, , is found.
Our key idea is based on the notions of max-patterns and
hit patterns, defined next.
A candidate (frequent) max-pattern,
, is the
maximal pattern which can be generated from , the set of
frequent
-patterns. For example, if the frequent 1-pattern
set is , the candidate
max-patternis
. Notice that a position in the candidate
max-pattern may be allowed to have a disjunction of more
than one non- letter. For example, if the frequent 1-pattern
set is
, the
candidate max-pattern is .
Let the -length of the candidate max-pattern, , be
. A subpattern of is hit in a period segment
of if it is the maximal subpattern of in . For
example, for
, the hit subpattern for
a period segment is
, because it is true in and none of its superpatterns
, , and , is in . The
hit set, , of a time series is the set of all hit subpatterns
of in .
The usefulness of hit max-patterns is: We can derive the
complete set of partial periodic patterns, from the frequency
counts of all the hit maximal subpatternsof
. This will
be detailed below.
We would like to give an estimate of the buffer size
needed in computation based on the idea of hit patterns.
One upper bound of the buffer size is estimated in terms
of
, the total number of periods in . , the size of
the hit set in a time series
, should be no bigger than ,
i.e., . This is obvious since each period segment
can generate at most one hit subpattern, and a hit subpat-
tern may be hit in more than one period segment. The other
upper bound of the buffer size is estimated in terms of the
maximal number of patterns that can be generated from
,
the set of frequent 1-patterns. Since each hit pattern of
is a subpattern of , which is generated from , sim-
ilar to the analysis performed in Algorithm 3.1, the size of
the set of subpatterns which can be generated from
is
.
Therefore, , the size of the hit set in a time series ,
should be no bigger than . Combining both upper
bounds, we have
Property 3.2 [The bound of hit set] The size of the hit
4
set is bounded by the formula, ,
where is the total number of periods in , and is the
set of frequent 1-patterns.
Using this formula, we can calculate the bound of the
maximal buffer size needed in the processing: Given the set
of frequent 1-patterns,
, the maximal (additional) buffer
size needed for registering the counts of all the maximal
subpatterns of
is .
This property is very useful in practice. For example, if
we found 500 frequent 1-patterns when calculating yearly
periodic patterns for 100 years, the buffer size needed is
at most 100; on the other hand, if we found 8 frequent
1-patterns for calculating weekly periodic patterns for 100
years, the buffer size needed is at most
.
We can always select the smaller one in estimating the max-
imal buffer size needed in computation.
Before turning to our hit-set based algorithm, we exam-
ine the probability distributions of maximal subpatterns of
.
Heuristic 3.1 [Popularity of longer subpatterns] The
probabilitydistributionof the maximalsubpatternsof
is usually denser for longer subpatterns (i.e., with the -
length closer to ) than the shorter ones.
This heuristiccan be observedin Example3.1. From the ex-
ample, we have
, but . In most
cases, the existence of a short max-subpattern indicates that
the nonexistence of some non- -letter, which reduces the
chance for the corresponding non-
letter patterns to reach
high confidence. Thus we have the heuristic.
This heuristics will imply that the number of nodes in
the tree data structure of the next section is usually small.
It is also useful for efficient buffer management: In order
to reduce the overall cost of access, the longer subpatterns
should be arranged to be more easily accessible (such as put
in main memory) than the shorter ones.
We now present a main algorithm for mining partial pe-
riodic patterns for a given period, which is based on the
discussions above.
Algorithm 3.2 [Max-subpattern hit-set] Find all the par-
tial periodic patterns for a given period in a time-series ,
based on the max-subpattern hit-set, for a given min
conf
threshold.
Method.
1. Scan
once to find , the set of frequent 1-patterns
of period , using Step 1 of Algorithm 3.1. Form the
candidate max-pattern, , from .
2. Scan
once. During the scan, for each period seg-
ment, if its hit set is nonempty, do the following: add the
max-subpatterninto the hit set buffer(with the associated
count initialized to 1) if it is not already there; otherwise,
increase the count of the max-subpattern by one. The hit
set bufferis implemented in the form of a max-subpattern
tree, a novel data structure, to be discussed in Section 4.
3. After the scan, derive the frequent patterns from the hit
set. We will discuss how to implement the finding of the
counts of the hit patterns and how to use these counts to
derive the frequent patterns in Section 4. It turns out that
both can be done efficiently.
Analysis.
Number of scans over the time series. The first step of
the algorithm needs to scan
once. The second step needs
to scan
one more time. Thus the total number of time-
series scans is 2, independent of the period .
Space needed. (1) The space needed for Step 1 is the
same as Algorithm 3.1. After Step 1, we need
units of
space to keep
, the set of frequent -patterns in . (2) At
the second step, suppose there are frequent -patterns
in . According to Property 3.2, the total space needed for
the hit set is at most
, where is the
total number of periods in .
In comparison with Algorithm 3.1, Algorithm 3.2 re-
duces the total number of scans of the time series from
(the length of the period) to 2, and it also uses much less
buffer space in the computation in most cases. This can
also be seen from the following observation: Suppose the
hit subpattern for a period segment is
, which is not
in the hit set yet. We need only one unit space to reg-
ister the string and its count 1. However, for the Apriori
technique, the candidate 2-patterns to be generated will be
, 3-patterns to
be generated will be , and the
4-patterns will be
, plus we have to update the count
associated with each of them. Thus, it is expected that the
max-subpatternhit set method may havebetter performance
in most cases. We will compare the performance of the two
algorithms in Section 5.
3.2 Mining partial periodicity with multiple peri-
ods
Mining partial periodicity for a given period covers a
good set of applicationssince people often liketo mine peri-
odic patternsfor naturalperiods, such as annually, quarterly,
monthly, weekly, daily, orhourly. However, certain patterns
may appear at some unexpected periods, such as every 11
years, or every14 hours. It is interesting to provide facilities
to mine periodicity for a range of periods.
To extend partial periodicity mining from one period to
multiple periods, one might wish to extend the idea of Apri-
5
ori to computing partial periodicity among different peri-
ods, that is, to use the patterns of small periods as fil-
ters for candidate patterns of periods of the form
for
an integer . This will work if all frequent patterns
of period are frequent patterns of period . Unfortu-
nately, this is not the case. For example, for the time series
, , and
. Suppose the confidence threshold is . If we use
from partial periodic patterns of period as filter for candi-
date partial periodic patterns of period , we will miss the
partial periodic pattern
.
Given that we cannot extend the Apriori “trick” to mul-
tiple periods, one obvious way to mine partial periodic pat-
terns for a range of periods is to repeatedly apply the single-
period algorithm for each period in the range.
Algorithm 3.3 [Looping over single period computa-
tion] Find all the partialperiodicpatternsfora set of periods
in a given range of interest,
, in the time-series ,
with the given min
conf threshold.
Method.
1. for each period
in the range of interest (i.e.,
), apply Algorithm 3.2 (“max-subpattern hit-
set”) on period
.
Analysis.
Numberof scans overthe time series. Since each period
will take 2 scans of the time series, the total number of scans
of the time series is
.
Space needed. For computing partial periodicity for peri-
ods from to , the space required is basically the sum of
space for each . Notice that the space required for initial
Step 1 computation is still
in the worst case since
the space once used in computation for period , can be
reinitialized and reused for computing other periods. But
we need in total
units of space to keep differ-
ent sets of frequent 1-patterns, where
is the set of
frequent -patterns in derived for period . Similarly, it
takes at most units of space to
compute all, where
is the total number of periods in
.
Algorithm 3.3 provides an iterative method for mining
partial periodicity for multiple periods. However, when the
number of periods is large, we still need a good number
of scans to mine periodicity for multiple periods. An im-
provementto the above method is to maximally explore the
mining of periodicity for multiple periods in the same scan,
which leads to the shared mining of periodicity for multiple
periods, as illustrated below.
Algorithm 3.4 [Shared mining of multiple periods]
Shared mining of all the partial periodic patterns for a set
of periods in a given range of interest, , in time-
series , with the given min conf threshold.
Method.
1. Scan once, for all periods in the range of interest, do
the same as Step 1 in Algorithm 3.2.
That is, for all periods in the range of interest (i.e.,
), find , the set of frequent 1-patterns of
period , using the same Step 1 as in Algorithm 3.1.
For each set of frequent 1-patterns of period , form the
candidate max-pattern,
, from .
2. Scan once, for all periods in the range of interest, do
the same as Step 2 in Algorithm 3.2.
A similar process which will not be explained in detail.
Analysis.
Number of scans over the time series. The first step of
the algorithm needs to scan
once. The second step needs
to scan one more time. Thus the total number of time-
series scans is 2, independent of the period .
Space needed. The total space required in the worst case
is same as in Algorithm 3.3.
Algorithm 3.4 explores shared processing at mining par-
tial periodicity for multiple periods. The advantage of the
method is that we only need two scans of time series for
mining partial periodicity for multiple periods. The over-
head of the method is that although it reduces the number
of scans to 2, it will require more space in the process-
ing of each scan than the multiple scan method because it
needs to register the corresponding counts for each period
(for ). However, since the shared features will
share the space as well (with counts incremented), and there
should be many shared features in periodicity search (oth-
erwise, why mining periodicity?), the space required will
hardly approach the worst case. Therefore, it should still be
an efficient method in manycases for mining partial period-
icity with multiple periods.
4 Derivation of all partial patterns
In this section, we examine the implementation consid-
erations of our proposed algorithms. Algorithm 3.1 is an
Apriori-like algorithm which can be implemented similarly
as other Apriori-likealgorithmsformining associationrules
(e.g. [2]). Algorithm 3.2 forms the basis for all the three
remaining algorithms and requires new tricks to achieve ef-
ficiency, and thus our discussion is focused on its efficient
implementation.
Algorithm 3.2 consists of two steps: Step 1, scan the
time series once and find frequent 1-pattern set
; and
Step 2, scan the time series one more time, collect the set
6
of the max-subpatterns hit in , and derive the set of fre-
quent patterns. The implementation of Step 1 is straight-
forward and has been discussed in the presentation of Al-
gorithm 3.1. However, Step 2 is nontrivial and needs some
good data structureto facilitatethestorageof the set of max-
subpatterns hit in
and the derivation of the set of frequent
patterns.
A new data structure, called max-subpattern tree, is de-
signed to facilitate the registration of the hit count of each
max-subpattern and derivation of the set of frequent pat-
terns, as illustrated in Figure 1. Its design is now outlined.
The max-subpatterntree takes the candidate max-pattern
as the root node, where each subpatternof with
one non-
letter missing is a direct child node of the root.
The tree expands recursively, according to the following
rules. A node
, if containing more than 2 non- letters,
may have a set of children, each of which is a subpattern of
with one more non- letter missing. Notice that a node
containing only 2 non-
letters will not have any children
since every frequent-1 pattern is already in . Importantly,
we do not create a node if neither the node nor its descen-
dant(s) containing more than 1 non-
letter is hit in .
Each node has a “count” field (which registers the number
of hits of the current node), a parent link (which is nil for
the root), and a set of child links; each child link points a
child and is associated with a corresponding missing letter.
A link can be nil when the corresponding child has not been
hit.
Notice that a non-
letter position of a max-subpattern
in a max-subpattern tree may contain a set of letters, which
matches the set of letters at the position in a period segment.
Forexample, for
= , the max-subpattern
of the period segment is
, and the segment will contribute one count to this node.
The update of the max-subpattern tree is performed as
follows.
Algorithm 4.1 [Insertion in the max-subpattern tree]
Insert a max-subpattern
found during the scan of into
the max-subpattern tree .
Method.
1. Starting from the root of the tree, find the corresponding
node by checking the missing non-
letter in order.
Forexample,fora max-patternnode
in a tree with
the root, , there are two letters,
and , missing. The node can be found by (1) following
the link (markedas ” in Figure 1) to ,
and then (2) following the link to , as shown
in Figure 1.
2. If the node
is found, increase its count by 1. Other-
wise, create a new node
(with count 1) and its missing
we show such a node using a dotted box in Figure 1.
ab1*d*
~b1
~a
~b2
~d
*b2*d* a**d* ab2***
10
0
8 18 5
19
0 2
324050
a{b1, b2}*d*
*{b1,b2}*d* ab2*d* a{b1,b2}***
*b1*d* *{b1,b2}*** ab1***
~a ~a ~a
~b1 ~b1 ~b2
~b2
~b1
~d ~b2
~d ~d
Figure 1. A max-subpattern tree to store the set of
max-subpatterns hit in the time-series.
ancestor nodes (only those on the path to
, with count
0), if any, and insert it (or them) into the corresponding
place(s) of the tree.
For example, if the very first max-subpattern node found
in is for , we will
create the node (with count 1), after creating two
ancestor nodes (with count 0):
= (which
is the root of the tree), and = (which is
s child, following the link). The node is
s child, following the link.
Analysis.
Let the total number of non-
letters in be . For
a max-subpattern
containing ( ) non- letters,
we need to follow links to find the node and create
at most new nodes in the worst case. There-
fore, the time complexity of node search and node creation
will be less than
. Also, since each insertion of max-
subpattern will create either only 0 node (when it hits) or
less than
nodes, the total number of the nodes in the tree
is less than , where is the size of the hit set.
In general, to insert a subpattern we need to both locate
the position and update the count of the node if the node is
found, or otherwise insert one or several new nodes.
Example 4.1 Let Figure 1 be the current max-subpattern
tree
. To insert a (max)subpattern into the tree, we
searchthe tree starting with the root,
.
The first non- letter missing is and the second non-
letter missing is . Thus we first follow the branch to
node
, and then follow the branch. Since the node
is located, its count is incremented by 1.
Before discussing the derivation of the set of frequent
patterns, we need to introduce the concept of reachable an-
cestors. Since the traversal and creation of the children of a
node in the max-subpattern tree follow the non-
letter po-
sition order, some of the ancestor nodes of a node may not
7
be directly linked to a node. For example, in Figure 1, the
node is linked to only one parent but not
the other
(note: this missing link is marked by a
dashed line in the Figure).
In general, the set of reachable ancestors of a node
in a max-subpattern tree is the set of all the nodes in ,
which are proper superpatterns of . It can be computed as
follows: (1) derive a list of missing letters from based
on
, which is roughly the position-wise difference, (2)
the set of linked ancestors consists of those patterns whose
missing letters form a proper prefix of , and (3) the set
of not-linkedancestorsare those patternswhosemissing let-
ters form a proper sublist (but not prefix) of
.
Example 4.2 We compute the set of reachable ancestors
for a node
in a max-subpattern tree with root
. The list of missing non- letters
is . Thus, the set of linked ancestors is (1) (miss-
ing nothing, which is the root); (2)
(i.e., missing , which
is the node ); and (3) (i.e., missing , then
missing
, which is the node ). The set of not-
linked ancestors is: (corresponding to the missing
letter pattern ), (corresponding to ),
(correspondingto ), and (corresponding to ).
In other words, one can follow the links whose mark is not
in ordered way (to avoid visiting the same node more than
once) and collect all the non-
nodes reached in .
Essentially there is a tree traversal for each fixed pattern,
except that we do not visit a node and its descendants if the
node is not an ancestor pattern of our current pattern.
The derivation of the frequent
-patterns is performed as
follows.
Algorithm 4.2 [Derivation of frequent patterns from
max-subpattern tree] The derivation of the frequent
-
patterns for all , given a max-subpattern tree , by an
Apriori-like technique.
Method.
1. The set of frequent
-patterns is derived in the first
scan of Algorithm 3.2.
2. The max-subpattern tree
is derived in the second scan
of Algorithm 3.2. The set of frequent
-patterns ( )
is derived as follows.
for
to do
derive candidate patterns with -length from fre-
quent patterns with -length by -way
join”.
scan tree T to find frequency counts of these candi-
date patterns and eliminate the non-frequent ones.
Notice that the frequencycount of a node is the sum
of the count of itself and those of all of its reach-
able ancestors. If the derived frequent
-pattern set
is empty, return.
Analysis.
Let the total number of non- letters in be . As
shownin the analysis of Algorithm 4.1, the time complexity
for searching a node is less than
. Since there are at most
nodes to be generated from the max-pattern tree
(including all the missing descendants), and there are at
most reachable ancestors in , where is the size of
the hit set, the worst case time complexity for derivation of
all the frequent patterns is O(
), i.e., propor-
tional to and the size of the hit set, but exponential to
(i.e., proportionalto the size of the tree thatcan be generated
by ). Since an infrequent node will reduce the number
of candidates to be generated in the future rounds, the real
processing cost is usually much smaller than the cost in the
worst case.
We illustrate how to derive the frequent -patterns for
from the max-subpattern tree .
Example 4.3 Let Figure 1 be the derived max-subpattern
tree
, and . We can traverse the max-
subpattern tree to find all the frequent -patterns for
as follows. Starting at level 2, we have the following fre-
quent patterns: (68), (68),
(47), (119), (92), (84) . We show
the derivation of
(68) here: since the list of miss-
ing letters in this node is , its set of reachable ancestors
is , , , and thus its frequent count = 10 + 0 + 50 +
8 (itself) = 68. Since level-2 has no infrequent nodes, we
search all the nodes at level-1 and have the following fre-
quent patterns:
(60), (50) ; Since there is
one node infrequent, level-0 (root) has no frequent patterns.
Noticealthough we only savedone node computationin this
case, it will save much more when the tree is large and there
are more missing nodes.
From the aboveexample, one can see that there are many
frequent -patterns with small that can be generated from
a max-subpatterntree. In practical applications, people may
only be interested in the set of maximal frequent patterns
instead of all frequent patterns, where a set of maximal fre-
quent patterns is a subset of the frequent pattern set and
every other pattern in the set is a subpattern of an element
in the set. For example, if the set of frequent
pattern (for
) is , the set of maximal
frequent patterns is .
If a user is interested in deriving the set of maximal fre-
quent patterns, the MaxMiner algorithm developed by Ba-
yardo [4] is a good candidate. The success of this algorithm
8
stems from generating new candidates by joining frequent
itemsets and looking head. However, it still requires to scan
up to period times in the worst case. The mixture of
max-subpattern hit set method and the MaxMiner can get
rid of this problem and will be more efficient than pure
MaxMiner. The details of the new method will be exam-
ined in future research.
5 Performance study
In this section we report a performance study which
compares the performance of the periodicity mining algo-
rithms proposed in this paper. In particular, we give a per-
formance comparison between the single-period Apriori
algorithm (Algorithm 3.1) (or simply called Apriori), and
the max-subpattern hit-set algorithm (Algorithm 3.2) (or
simply hit-set) applied to a single period.
This comparison indicates that there is a significant
gain in efficiency by max-subpattern hit-set over Apriori.
Since there is more gain when applied to multiple pe-
riods by using max-subpattern hit-set, it is clear that
max-subpattern hit-set is the winner.
The performance study is conducted on a Pentium 166
machine with 64 megabytesmain memory, running in Win-
dows/NT. The program is written in Microsoft/VisualC++.
5.1 Testing Databases
Each test time series is a synthetic time-series databases
generated using a randomized periodicity data generation
algorithm. From a set of features, potentially frequent 1-
patterns are composed. The size of the potentially frequent
1-patterns is determined based on a Poisson distribution.
These patterns are generated and put into the time-series
according to an exponential distribution.
LENGTH the length of time series
a period
MAX-PAT-LENGTH the maximal -length of
frequent patterns
the number of frequent 1-patterns
Table 1. Parameters of synthetic time series
The basic parameters used to generate the synthetic
databases are listed in Table 1. The parametersof LENGTH
(the lengthof time series) and (a period)are independently
chosen. The parameters of MAX-PAT-LENGTH (the max-
imal
-length of frequent patterns) and (the number
of frequent 1-patterns) are for a fixed , and they are con-
trolled by the choiceof some appropriateconfidence thresh-
old. We found that other parameters, such as the number of
features occurring at a fixed position and the number of fea-
tures in the time series, do not have much impact on the
performance result and thus they are not considered in the
tests.
5.2 Performance comparison of the algorithms
Figure 2 shows there is a significant efficiency gain by
max-subpattern hit-set over Apriori. In this figure, the
maximal pattern length (the maximal
-length of frequent
partial periodic patterns) grows from to . The other
parameters are kept constant: and .
We run two sets of tests, one with the length of the time
series being
and the other being . As
we can see, the running time of max-subpattern hit-set
is almost constant for both cases, while Apriori is almost
linear. When MAX-PAT-LENGTH is
, the gain by
max-subpattern hit-set over Apriori is about double. We
expect this gain will increase for larger MAX-PAT-LENGTH.
Max-Pat-Length
Time
2 4 6 8 10
HitSet500k
1000
2000
3000
4000
5000
6000
(seconds)
7000
Apriori 500k
HitSet100k
Apriori 100k
Figure 2. Performance gain when
MAX-PAT-LENGTH increases:
, .
It is important to note that, the gain shown in Figure 2 is
done by keeping everything in memory, and by considering
only one period. In general, this will be unlikely the case,
and max-subpattern hit-set will perform even better than
Apriori for the following reasons:
In general, the time series of features may need to be
stored on disk, due to factors such as each
may con-
tain thousands of featuresand the length of the time series
can be longer. When the time series is stored on disk,
there would be a large amount of extra disk-IO associ-
ated with Apriori, but not with max-subpattern hit-set
since it only requires two scans. Even when the
time series is not stored on disk, Apriori will need
to go over this huge sequence many more times than
max-subpattern hit-set. Thus max-subpattern hit-set
will be far better than Apriori.
9
When there are a range of periods to consider,
max-subpattern hit-set can find all frequent patterns
in two scans but Apriori will require many more
scans, depending on the number of periods and the
-length of the maximal frequent patterns. Hence
max-subpattern hit-set will be again far better than
Apriori.
6 Conclusions
We have studied efficient methods for mining partial pe-
riodicity in time series database. Partial periodicity, which
associates periodic behavior with only a subset of all the
time points, is less restrictive than full periodicity and thus
covers a broad class of applications.
By exploring severalinterestingpropertiesrelated to par-
tial periodicity, including the Apriori property, the max-
subpattern hit set property, and shared mining of multiple
periods, a set of partial periodicity mining algorithms are
proposed, with their relative performance compared. Our
study shows that the max-subpattern hit set method, which
needs only two scans of the time series database, even for
mining multiple periods, offers excellent performance.
Our study has been confined to mining partial periodic
patterns in one time series for categorical data with sin-
gle level of abstraction. However the method developed
here can be extended for mining multiple-level, multiple-
dimensional partial periodicity and for mining partial peri-
odicity with perturbation and evolution.
For mining numerical data, such as stock or power con-
sumption fluctuation, one can examine the distribution of
numerical values in the time-series data and discretize them
into single- or multiple- level categorical data. For min-
ing multiple-level partial periodicity, one can explore level-
shared mining by first mining the periodicity at a high level,
and then progressively drilling-down with the discovered
periodic patterns to see whether they are still periodic at a
lower level.
Perturbation may happen from period to period which
may make it difficult to discover partial periodicity in many
applications. For mining partial periodicity with perturba-
tion, one method is to slightly enlarge the time slot to be
examined. Partial periodic patterns with minor perturbation
are likely to be caught in the generalized time slot. Another
method is to include the features happening in the time slots
surroundingthe onebeing analyzed. We can furtheremploy
regression technique to reduce the noise of perturbation.
There are still many issues regarding partial periodicity
mining which deserve further study, such as further explo-
ration of shared mining for mining periodicitywith multiple
periods, mining periodic association rules based on partial
periodicity, and query- and constraint- based mining of par-
tial periodicity [11]. We are studying these problems and
implementing our algorithms for mining partial periodicity
in a data mining system and will report our progress in the
future.
References
[1] R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Query-
ing shapes of histories. In Proc. 21st Int. Conf. Very Large
Data Bases, pages 502–514, Zurich, Switzerland, Sept.
1995.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining as-
sociation rules. In Proc. 1994 Int. Conf. Very Large Data
Bases, pages 487–499, Santiago, Chile, September 1994.
[3] R. Agrawal and R. Srikant. Mining sequential patterns. In
Proc. 1995 Int. Conf. Data Engineering, pages 3–14, Taipei,
Taiwan, March 1995.
[4] R. J. Bayardo. Efficiently mining long patterns from
databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Manage-
ment of Data, pages 85–93, Seattle, Washington, June 1998.
[5] C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal
relationships with multiple granularities in time sequences.
Data Engineering Bulletin, 21:32–38, 1998.
[6] J. Han and Y. Fu. Discovery of multiple-level associa-
tion rules from large databases. In Proc. 1995 Int. Conf.
Very Large Data Bases, pages 420–431, Zurich, Switzerland,
Sept. 1995.
[7] J. Han, W. Gong, and Y. Yin. Mining segment-wise periodic
patterns in time-related databases. In Proc. 1998 Int’l Conf.
on Knowledge Discovery and Data Mining (KDD’98), New
York City, NY, August 1998.
[8] H. J. Loether and D. G. McTavish. Descriptive and Inferen-
tial Statistics: An Introduction. Allyn and Bacon, 1993.
[9] H. Lu, J. Han, and L. Feng. Stock movement and n-
dimensional inter-transaction association rules. In Proc.
1998 SIGMOD Workshop on Research Issues on Data Min-
ing and Knowledge Discovery (DMKD’98), pages 12:1–
12:7, Seattle, Washington, June 1998.
[10] H. Mannila, H Toivonen, and A. I. Verkamo. Discover-
ing frequent episodes in sequences. In Proc. 1st Int. Conf.
Knowledge Discovery and Data Mining, pages 210–215,
Montreal, Canada, Aug. 1995.
[11] R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Ex-
ploratory mining and pruning optimizations of constrained
associations rules. In Proc. 1998 ACM-SIGMOD Int. Conf.
Management of Data, pages 13–24, Seattle, Washington,
June 1998.
[12] B.
¨
Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic as-
sociation rules. In Proc. 1998 Int. Conf. Data Engineering
(ICDE’98), pages 412–421, Orlando, FL, Feb. 1998.
10
... Extensive experimentation: To the best of our knowledge, no previous work has compared, empirically, the performance of the most cited algorithms based on association rules, such as Apriori [9], MS-Apriori [10], FP-Growth [11], PPA [12], and Max-Subpattern [11]. Thus, we have conducted a comprehensive comparison of these algorithms over two STDBs-first, a synthetic one, then a real one. ...
... Extensive experimentation: To the best of our knowledge, no previous work has compared, empirically, the performance of the most cited algorithms based on association rules, such as Apriori [9], MS-Apriori [10], FP-Growth [11], PPA [12], and Max-Subpattern [11]. Thus, we have conducted a comprehensive comparison of these algorithms over two STDBs-first, a synthetic one, then a real one. ...
... Han et al. [11] proposed the Max-Subpattern Hit-Set algorithm, often referred to simply as Max-Subpattern. They based their development on a custom data structure called a max-subpattern tree to efficiently generate larger partial periodic patterns from combinations of smaller patterns. ...
Article
Full-text available
Deriving insight from data is a challenging task for researchers and practitioners, especially when working on spatio-temporal domains. If pattern searching is involved, the complications introduced by temporal data dimensions create additional obstacles, as traditional data mining techniques are insufficient to address spatio-temporal databases (STDBs). We hereby present a new algorithm, which we refer to as F1/FP, and can be described as a probabilistic version of the Minus-F1 algorithm to look for periodic patterns. To the best of our knowledge, no previous work has compared the most cited algorithms in the literature to look for periodic patterns—namely, Apriori, MS-Apriori, FP-Growth, Max-Subpattern, and PPA. Thus, we have carried out such comparisons and then evaluated our algorithm empirically using two datasets, showcasing its ability to handle different types of periodicity and data distributions. By conducting such a comprehensive comparative analysis, we have demonstrated that our newly proposed algorithm has a smaller complexity than the existing alternatives and speeds up the performance regardless of the size of the dataset. We expect our work to contribute greatly to the mining of astronomical data and the permanently growing online streams derived from social media.
... Useful information that can facilitate the users to achieve socio-economic development lies hidden in this data. Han et al. [1] first introduced a partial periodic pattern model to discover all periodic regularities in a (binary) multiple time series. Since then, the problem of finding these patterns has received considerable attention [2]- [6]. ...
... Thus, there is value in finding these patterns in applications. However, finding F3Ps in a QIMTS is a non-trivial task due to the following reasons: 1) Since a pattern in a series represents an ordered sequence of itemsets, the search space of partial periodic pattern mining is per i=1 n i , where n represents the total number of distinct objects, and per represents the total number of distinct timestamps (or period [1]) in a sequence. We must investigate new techniques to reduce this colossal search space and the computational cost-effectively. ...
... Han et al. [1] extended the basic frequent itemset model [17] to find partial periodic patterns in binary regular time series data. This model discovers all periodic patterns in time series data that satisfy the user-specified minimum support (minSup). ...
... This task is essential in many applications, including finance, weather forecasting, and healthcare, where recurring patterns can provide valuable insights into the underlying processes. To address this problem, researchers [22][23][24][25][26][27] commonly use a two-step approach. ...
... Time series analysis is a powerful tool for understanding patterns and trends in sequential data. However, several traditional time series analysis approaches [22][23][24][25][26][27][28] to time treat data as a symbolic sequence and may overlook important temporal information within the sequence. For example, consider a financial time series where the price of a stock is recorded at regular intervals over time. ...
Article
Full-text available
Partial periodic pattern (3P) mining is a vital data mining technique that aims to discover all interesting patterns that have exhibited partial periodic behavior in temporal databases. Previous studies have primarily focused on identifying 3Ps only in row temporal databases. One can not ignore the existence of 3Ps in columnar temporal databases as many real-world applications, such as Facebook and Adobe, employ them to store their big data. This paper proposes an efficient single database scan algorithm, Partial Periodic Pattern-Equivalence Class Transformation (3P-ECLAT), to identify all 3Ps in a columnar temporal database. The proposed algorithm compresses the given database into a novel list-based data structure and mines it recursively to find all 3Ps. The 3P-ECLAT leverages the “downward closure property” and “depth-first search technique” to reduce the search space and the computational cost. Extensive experiments have been conducted on synthetic and real-world databases to demonstrate the efficiency of the 3P-ECLAT algorithm. The memory and runtime results show that 3P-ECLAT outperforms its competitor considerably. Furthermore, 3P-ECLAT is highly scalable and is superior to the previous approach in handling large databases. Finally, to demonstrate the practical utility of our algorithm, we provide two real-world case studies, one on analyzing traffic congestion during disasters and another on identifying the highly polluted areas in Japan.
... To address this problem, a common approach used in the literature [44,45,46,47,48,49] is a two-step model. The first step involves partitioning the time series data into distinct subsets or period segments of fixed length or period. ...
... Time series analysis studies often treat time series data as a symbolic sequence, neglecting the temporal information of events within the sequence [44,45,46,47,48,49,50]. However, this approach can be limiting as it may disregard important insights that could be derived from analyzing the temporal aspect of events. ...
Article
Full-text available
Periodic frequent-pattern mining (PFPM) is a vital knowledge discovery technique that identifies periodically occurring patterns in a temporal database. Although traditional PFPM algorithms have many applications, they often produce a large set of periodic-frequent patterns (PFPs) in a database. As a result, analyzing PFPs can be very time-consuming for users. Moreover, a large set of PFPs makes PFPM algorithms less efficient regarding runtime and memory consumption. This paper handles this problem by proposing a novel model of closed 1 Springer Nature 2021 L A T E X template 2 Article Title periodic-frequent patterns (CPFPs) found in databases. CPFPs are less expensive to mine because they represent a concise and lossless subset uniquely describing the entire set of PFPs. We also present an efficient depth-first search algorithm, called Closed Periodic-Frequent Pattern-Miner (CPFP-Miner), to discover the patterns. The proposed algorithm utilizes the weighted ordering of the patterns concept to reduce the patterns' search space. On the other hand, the current periodicity concept is also applied to prune aperiodic patterns from the search space. Extensive experiments on both real-world and synthetic databases demonstrate that the CPFP-Miner algorithm is efficient. It outperforms the state-of-the-art algorithms regarding run-time requirements, memory consumption, and energy consumption on several real-world and synthetic databases. Additionally, the scalabil-ity of the CPFP-Miner algorithm is demonstrated to be more effective and productive than the state-of-the-art algorithms. Finally, we present two case studies to show the functionality of the proposed patterns.
... In this paper we modelled the recurrency by forcing the segments to share their parameters. An alternative approach to discover recurrency is to look explicitly for recurrent patterns (Ozden et al., 1998;Han et al., 1998Han et al., , 1999Ma & Hellerstein, 2001;Yang et al., 2003;Galbrun et al., 2019). We should point out that these works are not design to work with graphs; instead they work with event sequences. ...
Article
Full-text available
A popular approach to model interactions is to represent them as a network with nodes being the agents and the interactions being the edges. Interactions are often timestamped, which leads to having timestamped edges. Many real-world temporal networks have a recurrent or possibly cyclic behaviour. In this paper, our main interest is to model recurrent activity in such temporal networks. As a starting point we use stochastic block model, a popular choice for modelling static networks, where nodes are split into R groups. We extend the block model to temporal networks by modelling the edges with a Poisson process. We make the parameters of the process dependent on time by segmenting the time line into K segments. We require that only $$H \le K$$ H ≤ K different set of parameters can be used. If $$H < K$$ H < K , then several, not necessarily consecutive, segments must share their parameters, modelling repeating behaviour. We propose two variants where a group membership of a node is fixed over the course of entire time line and group memberships are allowed to vary from segment to segment. We prove that searching for optimal groups and segmentation in both variants is NP -hard. Consequently, we split the problem into 3 subproblems where we optimize groups, model parameters, and segmentation in turn while keeping the remaining structures fixed. We propose an iterative algorithm that requires $$\mathcal {O} \left( KHm + Rn + R^2\,H\right)$$ O K H m + R n + R 2 H time per iteration, where n and m are the number of nodes and edges in the network. We demonstrate experimentally that the number of required iterations is typically low, the algorithm is able to discover the ground truth from synthetic datasets, and show that certain real-world networks exhibit recurrent behaviour as the likelihood does not deteriorate when H is lowered.
... Inspired by Ozden's work [24], Han et al. [25] described a model to find partial periodic patterns in an evenly spaced binary time series. Later, the authors proposed an efficient algorithm [26] to discover the partial periodic patterns. In this model, a binary series is split into multiple sequences of a particular length specified by the user, and interesting patterns were discovered using only the minSup threshold value. ...
Article
Full-text available
Periodic-frequent patterns are a vital class of regularities in a temporal database. Most previous studies followed the approach of finding these patterns by storing the temporal occurrence information of a pattern in a list. While this approach facilitates the existing algorithms to be practicable on sparse databases, it also makes them impracticable (or computationally expensive) on dense databases due to increased list sizes. A renowned concept in set theory is larger the set, the smaller its complement will be. Based on this conceptual fact, this paper explores the complements, redefines the periodic-frequent pattern and proposes an efficient depth-first search algorithm called PFPM-C, that finds all periodic-frequent patterns by storing only non-occurrence information of a pattern in a database. Experimental results on several databases demonstrate that our algorithm is efficient.
... Inspired by Ozden's work, Han et al. [23] described a model to find partial periodic patterns in an evenly spaced binary time series. Later, the authors proposed an efficient algorithm [24] to discover the partial periodic patterns. In this model, a binary time series database is split into multiple subsequence databases of a particular length specified by the user, and interesting patterns were discovered using only the minSup threshold value. ...
Article
Full-text available
Partial Periodic Pattern Mining (3PM) is a key knowledge discovery technique with many applications. It involves discovering all patterns that have exhibited partial periodic behavior in a temporal database. Unfortunately, the widespread adoption of this technique has been hindered by the following two limitations: (i) the rare item problem, which involves either missing the patterns containing rare items or producing too many patterns, most of which may be uninteresting to the user, and (ii) computationally expensive mining process as its mining algorithms were inefficient in reducing the enormous search space. This paper makes the following efforts to address the above-mentioned two limitations. First, we introduce a new null-invariant measure, periodic- confidence, to determine the periodic interestingness of a pattern in a database. Second, an alternative model of a partial periodic pattern has been defined based on the proposed measure. Third, an efficient depth-first search algorithm based on the renowned pattern-growth technique has been introduced to discover all partial periodic patterns in a database. Fourth, the proposed algorithm employs a novel lossless pruning technique called “irregularity pruning” to reduce the search space and computational cost-efficiently. Experiments on several datasets demonstrate that our model can effectively tackle the rare item problem, and our algorithm is efficient. Finally, we discuss the usefulness of patterns with case studies performed on air pollution and traffic congestion databases.
Conference Paper
Full-text available
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to degenerate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments.
Conference Paper
Full-text available
We present a pattern-mining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data show that when the patterns are long, our algorithm is more efficient by an order of magnimaximal frequent itemset, Max-Miner’s output implicitly and concisely represents all frequent itemsets. Max-Miner is shown to result in two or more orders of magnitude in performance improvements over Apriori on some data-sets. On other data-sets where the patterns are not so long, the gains are more modest. In practice, Max-Miner is demonstrated to run in time that is roughly linear in the number of maximal frequent itemsets and the size of the database, irrespective of the size of the longest frequent itemset. tude or more. 1.
Conference Paper
Full-text available
From the standpoint of supporting human-centered discov- ery of knowledge, the present-day model of mining asso- ciation rules suffers from the following serious shortcom- ings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model functions as a black-box, admitting little user inter- action in between. We propose, in this paper, an architec- ture that opens up the black-box, and supports constraint- based, human-centered exploratory mining of associations. The foundation of this architecture is a rich set of con- straint constructs, including domain, class, and SqLstyle aggregate constraints, which enable users to clearly specify what associations are to be mined. We propose constrained association queries as a means of specifying the constraints to be satisfied by the antecedent and consequent of a mined association. In this paper, we mainly focus on the technical challenges in guaranteeing a level of performance that is commensu- rate with the selectivities of the constraints in an associ- ation query. To this end, we introduce and analyze two properties of constraints that are critical to pruning: onti- monotonicity and succinctness. We then develop charac- terizations of various constraints into four categories, ac- cording to these properties. Finally, we describe a min- ing algorithm called CAP, which achieves a maximized de- gree of pruning for all categories of constraints. Experi- mental results indicate that CAP can run much faster, in some cases as much as 80 times, than several basic algo- rithms. This demonstrates how important the succinctness and anti-monotonicity properties are, in delivering the per- formance guarantee.
Conference Paper
Full-text available
We study the problem of discovering association rules that display regular cyclic variation over time. For example, if we compute association rules over monthly sales data, we may observe seasonal variation where certain rules are true at approximately the same month each year. Similarly, association rules can also display regular hourly, daily, weekly, etc., variation that is cyclical in nature. We demonstrate that existing methods cannot be naively extended to solve this problem of cyclic association rules. We then present two new algorithms for discovering such rules. The first one, which we call the sequential algorithm, treats association rules and cycles more or less independently. By studying the interaction between association rules and time, we devise a new technique called cycle pruning, which reduces the amount of time needed to find cyclic association rules. The second algorithm, which we call the interleaved algorithm, uses cycle pruning and other optimization techniques for discovering cyclic association rules. We demonstrate the effectiveness of the interleaved algorithm through a series of experiments. These experiments show that the interleaved algorithm can yield significant performance benefits when compared to the sequential algorithm. Performance improvements range from 5% to several hundred percent
Conference Paper
Sequences of events describing the behavior and actions of users or systems can be collected in several domains. In this paper we consider the problem of recognizing frequent episodes in such sequences of events. An episode is defined to be a collection of events that occur within time intervals of a given size in a given partial order.Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We describe an efficient algorithm for the discovery of all frequent episodes from a given class of episodes, and present experimental results.
Article
This paper reports the progress in thisfront. A more detailed study can be found in [4].In this paper, we focus on algorithms for discovering sequential relationships when a rough pattern of relationshipsis given. The rough pattern (which we term "event structure") specifies what sort of relationships a useris interested in. For example, a user may be interested in "which pairs of events occur frequently one week afteranother". The algorithms will find the instances that fit the event...