Conference PaperPDF Available

Efficient Mining of Partial Periodic Patterns in Time Series Database

April 1999

April 1999

DOI:10.1109/ICDE.1999.754913

Source
IEEE Xplore

Conference: Data Engineering, 1999. Proceedings., 15th International Conference on

Authors:

Jiawei Han

University of Illinois, Urbana-Champaign

Guozhu Dong

Wright State University

Yiwen Yin

Queen's University Belfast

Partial periodicity search, i.e., search for partial periodic patterns in time-series databases, is an interesting data mining problem. Previous studies on periodicity search mainly consider finding full periodic patterns, where every point in time contributes (precisely or approximately) to the periodicity. However, partial periodicity is very common in practice since it is more likely that only some of the time episodes may exhibit periodic patterns. We present several algorithms for efficient mining of partial periodic patterns, by exploring some interesting properties related to partial periodicity such as the Apriori property and the max-subpattern hit set property, and by shared mining of multiple periods. The max-subpattern hit set property is a vital new property which allows us to derive the counts of all frequent patterns from a relatively small subset of patterns existing in the time series. We show that mining partial periodicity needs only two scans over the time series database, even for mining multiple periods. The performance study shows our proposed methods are very efficient in mining long periodic patterns

Performance gain when MAX-PAT-LENGTH increases: p = 50, jF 1 j = 12.

…

Figures - uploaded by Guozhu Dong

Content may be subject to copyright.

Content uploaded by Guozhu Dong

Content may be subject to copyright.

Efﬁcient Mining of Partial Periodic Patterns

in Time Series Database

In ICDE 99

Jiawei Han

School of Computing Science

Simon Fraser University

han@cs.sfu.ca

Guozhu Dong

Department of Computer Science and Engineering

Wright State University

gdong@cs.wright.edu

Yiwen Yin

School of Computing Science

Simon Fraser University

yiweny@cs.sfu.ca

Abstract

Partial periodicity search, i.e., search for partial peri-

odic patterns in time-series databases, is an interesting data

mining problem. Previous studies on periodicity search

mainly consider ﬁnding full periodic patterns, where every

point in time contributes (precisely or approximately) to the

periodicity. However, partial periodicity is very common in

practice since it is more likely that only some of the time

episodes may exhibit periodic patterns.

We presentseveral algorithmsfor efﬁcient mining of par-

tial periodic patterns, by exploring some interesting proper-

ties related to partial periodicity, such as the Apriori prop-

erty and the max-subpattern hit set property, and by shared

mining of multiple periods. The max-subpattern hit set

property is a vital new property which allows us to derive

the counts of all frequent patterns from a relatively small

subset of patterns existing in the time series. We show that

mining partial periodicity needs only two scans over the

time series database, even for mining multiple periods. The

performance study shows our proposed methods are very

efﬁcient in mining long periodic patterns.

Keywords. Periodicity search, partial periodicity, time-

series analysis, data mining algorithms.

1. Introduction

Finding periodic patterns in time series databases is an

important data mining task with many applications. Many

Research was supported in part by research grants from the Natural

Sciences and Engineering Research Council of Canada and the Networks

of Centres of Excellence Program of Canada

Part of this work was done while visiting Simon Fraser University

during his sabbatical from University of Melbourne, Australia.

methods have been developed for searching periodicity pat-

ternsinlarge data sets [8]. However,mostpreviousmethods

on periodicity search are on mining full periodic patterns,

where every point in time contributes (precisely or approxi-

mately) to the cyclic behavior of the time series. For exam-

ple, all the days in the year approximately contribute to the

season cycle of the year. A useful related type of periodic

patterns, called partial periodic patterns, which specify the

behaviorofthe time series at some but not all points in time,

have not received enough attention. An example partial pe-

riodic pattern may state that Jim reads the Vancouver Sun

newspaper from 7:00 to 7:30 every weekday morning but

his activities at other times do not have much regularity.

Thus, partial periodicity is a looser kind of periodicity than

full periodicity, and it exists ubiquitously in the real world.

The purpose of the current paper is to ﬁll the gap by consid-

ering the efﬁcient mining of partial periodic patterns.

Most methods for ﬁnding full periodic patterns are ei-

ther inapplicable to or prohibitively expensive for the min-

ing of partial periodic patterns, because of the mixture of

periodic events and non-periodic events in the same period.

For example, FFT (Fast Fourier Transformation) cannot be

applied to mining partial periodicity because it treats the

time-series as an inseparable ﬂow of values. Some peri-

odicity detection methods can detect some partial periodic

patterns, but only if the period, and the length and timing

of the segment in the partial patterns with speciﬁc behavior

areexplicitly speciﬁed. Forthenewspaperreadingexample,

we need to explicitly specify details such as “ﬁnd the reg-

ular activities of Jim during the half-hour after 7:00 for the

period of

hours.” A naive adaptation of such methods to

our partial periodic pattern mining problem would be pro-

hibitively expensive, requiring their application to a huge

number of possible combinations of the three parameters of

length, timing, and period.

Besides full periodicity search, there are many recent

studies on time series data mining: Most concentrate on

symbolic patterns, although some consider numerical curve

patterns in time series. Agrawal and Srikant [3] devel-

oped an Apriori-like technique [2] for mining sequential

patterns. Mannila et al. [10] consider frequent episodes in

sequences, where episodes are essentially acyclic graphs

of events whose edges specify the temporal before-and-

after relationalship but without timing-interval restrictions.

Inter-transaction association rules proposed by Lu et al. [9]

are implication rules whose two sides are totally-ordered

episodes with timing-interval restrictions (on the events in

the episodes and on the two sides). Bettini et al. [5] con-

sider a generalization of inter-transaction association rules:

these are essentially rules whose left-hand and right-hand

sides are episodes with time-interval restrictions. However,

unlike ours, periodicity is not considered in these studies.

Similar to our problem, the mining of cyclic association

rules by

Ozden, et al. [12] also considers the mining of

some patterns of a range of possible periods. Observe that

cyclic association rules are partial periodic patterns with

perfect periodicity in the sense that each pattern reoccurs in

every cycle, with

conﬁdence. The perfectness in peri-

odicity leads to a key idea used in designing efﬁcient cyclic

association rule mining algorithms: As soon as it is known

that an association rule

does not hold at a particular in-

stant of time, we can inferthat cannot have periods which

include this time instant. For example, if the maximum pe-

riod of interest is

and it is discovered that does not

hold in the ﬁrst

time instants, then cannot have any

periods. This idea leads to the useful “cycle-elimination”

strategy explored in that paper. Since real life patterns are

usually imperfect, our goal is not to mine perfect periodicity

and thus “cycle-elimination” based optimization will not be

considered here.

An Apriori-like algorithm has been proposed for mining

imperfect partial periodic patterns with a given (single) pe-

riod in a recent study by two of the current authors [7]. It

is an interesting algorithm for mining imperfect partial pe-

riodicity. However, with a detailed examination of the data

characteristics of partial periodicity, we found that Apriori

pruningin mining partial periodicity may not be as effective

as in mining association rules.

Our study has revealed the following new characteristics

of partial periodic patterns in time series: The Apriori-like

property among partial periodic patterns still holds for any

ﬁxed period, but it does not hold for patterns between dif-

ferent periods. Furthermore, there is a strong correlation

It is important to point out that [12] concentrates on the elimination of

candidate itemsets for the association rule mining algorithm, although the

cycle-elimination strategy does lead to a small reduction on the number of

patterns when we process the time series from left to right.

Note that a modiﬁed strategy, where we stop considering certain pat-

terns as soon as the length of the time series to be processed is not enough

to make the conﬁdence higher than the threshold, can be used.

among frequencies of partial patterns.

The main contributions of this paper are as follows. We

consider the efﬁcient mining of partial periodic patterns, for

a single period as well as for a set of periods. We propose

several mining algorithms, by exploring some interesting

properties related to partial periodicity such as the Apri-

ori property and the max-subpattern hit set property, and by

shared mining of multiple periods. The max-subpattern hit

set property is a vital new property which allows to derive

the counts of all frequent patterns from a relatively small

subset of patterns mined from the time series. We show

that mining partial periodicity needs only two scansoverthe

time series database, even for mining multiple periods. The

performance study shows our proposed methods are very

efﬁcient. The proposed methods are also robust that can

be applied in a variety of cases including mining multiple-

level partial periodicity and mining partial periodicity with

perturbation and evolution.

The remaining of the paper is organized as follows. In

Section 2, concepts related to partial periodicity are intro-

duced. In Section 3, methods for mining partial periodicity

in regard to both single and multiple periods are studied.

In Section 4, the implementation of a novel data structure,

namely the max-subpattern tree, for facilitating the count-

ing of the hit maximal patterns, and the derivation of the set

of frequent patterns from the hit maximal patterns, are pre-

sented. In Section 5, a comparison of the performance of

the proposed algorithms is reported. We conclude our study

in Section 6.

2 Problem Deﬁnition

Assume that a sequence of

timestamped datasets have

been collected in a database. For each time instant , let

be a set of features derived from the dataset collected at the

instant. Thus, the time series of features is repesented as,

Let be the underlying set of features. We will also use

the “don’t care” character , which can match any single set

of features. We deﬁne a pattern

as a non-

empty sequence over . We will use

to denote the length of , and will say that is the period

of the pattern

. Let the -length of be the

number of which contains letters from . A pattern with

-length is also called an -pattern. Moreover, a subpat-

tern of a pattern is a pattern

such that and have the same length, and for

every position

where . For example, the pattern

is of length and it is of -length (i.e., it is a

If is a singleton we will omit the brackets, e.g., we write as .

4-pattern); and and are two of the

subpatterns of .

The frequency

count and conﬁdence of a pattern in a

time series are deﬁned as

the string s is true in ,

and

where is the maximum number of periods of length

contained in the time series (i.e., is the positive integer

such that ). Each segment of the

form

, where , is called a

period segment. We say a pattern is true in

the period segment or the period segment matches , if, for

each position

, either is or all the letters in occur

in the set of features in the segment. Thus, if is a

subpattern of

, then the set of sequences that can match

is a subset of sequences that can match .

Example 2.1 For example,

is a pattern of period ;

its frequency count in the feature series

2; and its conﬁdence is , where 3 is the maximum num-

ber of periods of length 3. The conﬁdence of in

is also .

Similar to mining association rules [2], we say that a pat-

tern is a frequent partial periodic pattern in a time se-

ries if its conﬁdence is larger than or equal to a threshold,

. The mining of frequent partial periodic patterns

in a time series is to discover, possibly with some restric-

tions, all the frequent patterns of the series for one period or

a range of speciﬁed periods. More speciﬁcally, the input to

mining includes:

A time series .

A speciﬁed period; or a range of periods speciﬁed by two

integers and .

An integer indicating that the ratio of the lengths of

and the patterns must be at least . This will ensure that

the patterns mined would be of value to the application at

hand.

Remark: Sometimes the derivation of the feature series

from the original data series is quite involved, and the inter-

action of the periodic patterns with the derivation of features

may lead to improved performance. Hence it is worthwhile

to combine the mining of the features from the datasets with

the mining of the patterns, as is the case for the mining of

cyclic association rules [12]. For our work on the mining of

frequent partial periodic patterns though, this interaction is

not useful for achieving computational advantage and thus

we will assume that we are dealing with the feature time

series in our study.

3 Methods for mining partial periodicity in

time series

In this section, we explore methods for mining partial

periodicity in a time series, proceeding from mining par-

tial periodicity for a single given period to mining partial

periodicity for a speciﬁed range of periods (i.e., multiple

periods).

3.1 Mining partial periodicity for single period

3.1.1 Single-period apriori method

A popular key idea used in the efﬁcient mining of associa-

tion rules is the Apriori property discovered in [2]: If one

subset of an itemset is not frequent, then the itemset itself

cannot be frequent. This allows us to use frequent itemsets

of size

as ﬁlters for candidate itemsets of size .

Interestingly, for each period , the property supporting

the Apriori “trick” still holds:

Property 3.1 [Apriori on periodicity] Each subpattern of

a frequent pattern of period is itself a frequent pattern of

period .

The proof is based on the fact that patterns are more restric-

tive than their subpatterns. Suppose

is a subpattern of a

frequent pattern

. Then is obtained from by changing

some set of letters to a subset or . Hence is more restric-

tive than and thus the frequencycount of is greaterthan

or equal to that of

. Thus is frequent as well.

An algorithm for mining partial periodic patterns for a

given ﬁxed period based on this Apriori “trick” was pre-

sented in [7]. We include a simplied version here for the

sake of completeness.

Algorithm 3.1 [Single-periodApriori] Find all partial pe-

riodic patterns for a given period satisfying a given con-

ﬁdence threshold min conf in time-series , based on the

Apriori property 3.1.

Method.

1. Find

, the set of frequent 1-patterns of period , by ac-

cumulatingthefrequencycount for each 1-pattern in each

whole period segment and selecting among them whose

frequencycount is no less than

, where

is the maximum number of periods.

2. Find all frequent

-patterns of period , for from 2 up

to , based on the idea of Apriori, and terminate immedi-

ately when the candidate frequent -pattern set is empty.

Analysis.

Number of scans over the time series. Step 1 of the

algorithmneeds to scan the time series

once. Step 2 needs

to scan up to times in the worst case. Thus the total

number of scans is no more than the period .

Space needed. (1) At Step 1, suppose there exist a total of

distinct featuresat positions in ,

where is the numbersuch that . We

need

units of space to hold the counts. In the worst

case when every feature is distinct in the entire time series

, we need units of space . After Step 1, we

only need

units of space to keep , the set of frequent

-patterns in . (2) At Step 2, the maximum number of

candidate subpatterns that we may generate is

. Considering

that we still need space to keep the set of frequent 1-

patterns, the total amount of space needed is

in the

worse case in this computation. However, the average case

should be much smaller than the worst case since if every

feature is distinct in the time series, then there is no need to

ﬁnd periodic patterns. The existence of any periodicity in

the time series will reduce the memory needed.

3.1.2 Single-period max-subpattern hit set method

Although the Apriori trick may reduce the search space in

partial periodicity mining in a similar way as association

rule mining, it is important to note that the data characteris-

tics in the two cases are very different. In mining associa-

tion rules, the numberof frequent

-itemsets shrinks quickly

increases because of the sparsity of frequent -itemsets

in a large transaction database. However, in mining par-

tial periodicity, very often the number of frequent -patterns

shrinks slowly (when

) as increases. The slow speed

of decrease in the number of frequent -patterns is due to a

strong correlation between frequencies of patterns and their

subpatterns. We now illustrate this point.

Example 3.1 Suppose we have two frequent 1-patterns,

and , such that and ,

in a time-series . Then it must be the case that

, as explained below. Since all period seg-

ments that match match both and ,

holds. To derive the other inequality, let denote the predi-

catethat a letteris not

, similarly . Theconﬁdenceof in

is at most , because . Sim-

ilarly,

. Since

, it follows that .

The slow reduction of the set of candidate frequent -

patterns as grows makes the Apriori pruning of Algorithm

3.1 less attractive. Is there a better way?

The unit of space is the space needed to hold the feature identiﬁer and

its associated count, and its size is usually 2-8 bytes, depending on the

implementation.

This is equal to the total space that the time series occupies.

Obviously, the derivation of frequent

-patterns is still

an effective way to dramatically reduce the candidate set

to be examined later because there are usually only a small

number of features being frequent at a particular position

but there could be a large number of features appearing in

the position. This is especially true when the average num-

ber of features per position is larger than

. Thus

our discussion will be focused on how to reduce the search

effort after the set of frequent

-patterns, , is found.

Our key idea is based on the notions of max-patterns and

hit patterns, deﬁned next.

A candidate (frequent) max-pattern,

, is the

maximal pattern which can be generated from , the set of

frequent

-patterns. For example, if the frequent 1-pattern

set is , the candidate

max-patternis

. Notice that a position in the candidate

max-pattern may be allowed to have a disjunction of more

than one non- letter. For example, if the frequent 1-pattern

set is

, the

candidate max-pattern is .

Let the -length of the candidate max-pattern, , be

. A subpattern of is hit in a period segment

of if it is the maximal subpattern of in . For

example, for

, the hit subpattern for

a period segment is

, because it is true in and none of its superpatterns

, , and , is in . The

hit set, , of a time series is the set of all hit subpatterns

of in .

The usefulness of hit max-patterns is: We can derive the

complete set of partial periodic patterns, from the frequency

counts of all the hit maximal subpatternsof

. This will

be detailed below.

We would like to give an estimate of the buffer size

needed in computation based on the idea of hit patterns.

One upper bound of the buffer size is estimated in terms

, the total number of periods in . , the size of

the hit set in a time series

, should be no bigger than ,

i.e., . This is obvious since each period segment

can generate at most one hit subpattern, and a hit subpat-

tern may be hit in more than one period segment. The other

upper bound of the buffer size is estimated in terms of the

maximal number of patterns that can be generated from

the set of frequent 1-patterns. Since each hit pattern of

is a subpattern of , which is generated from , sim-

ilar to the analysis performed in Algorithm 3.1, the size of

the set of subpatterns which can be generated from

Therefore, , the size of the hit set in a time series ,

should be no bigger than . Combining both upper

bounds, we have

Property 3.2 [The bound of hit set] The size of the hit

set is bounded by the formula, ,

where is the total number of periods in , and is the

set of frequent 1-patterns.

Using this formula, we can calculate the bound of the

maximal buffer size needed in the processing: Given the set

of frequent 1-patterns,

, the maximal (additional) buffer

size needed for registering the counts of all the maximal

subpatterns of

is .

This property is very useful in practice. For example, if

we found 500 frequent 1-patterns when calculating yearly

periodic patterns for 100 years, the buffer size needed is

at most 100; on the other hand, if we found 8 frequent

1-patterns for calculating weekly periodic patterns for 100

years, the buffer size needed is at most

We can always select the smaller one in estimating the max-

imal buffer size needed in computation.

Before turning to our hit-set based algorithm, we exam-

ine the probability distributions of maximal subpatterns of

Heuristic 3.1 [Popularity of longer subpatterns] The

probabilitydistributionof the maximalsubpatternsof

is usually denser for longer subpatterns (i.e., with the -

length closer to ) than the shorter ones.

This heuristiccan be observedin Example3.1. From the ex-

ample, we have

, but . In most

cases, the existence of a short max-subpattern indicates that

the nonexistence of some non- -letter, which reduces the

chance for the corresponding non-

letter patterns to reach

high conﬁdence. Thus we have the heuristic.

This heuristics will imply that the number of nodes in

the tree data structure of the next section is usually small.

It is also useful for efﬁcient buffer management: In order

to reduce the overall cost of access, the longer subpatterns

should be arranged to be more easily accessible (such as put

in main memory) than the shorter ones.

We now present a main algorithm for mining partial pe-

riodic patterns for a given period, which is based on the

discussions above.

Algorithm 3.2 [Max-subpattern hit-set] Find all the par-

tial periodic patterns for a given period in a time-series ,

based on the max-subpattern hit-set, for a given min

conf

threshold.

Method.

1. Scan

once to ﬁnd , the set of frequent 1-patterns

of period , using Step 1 of Algorithm 3.1. Form the

candidate max-pattern, , from .

2. Scan

once. During the scan, for each period seg-

ment, if its hit set is nonempty, do the following: add the

max-subpatterninto the hit set buffer(with the associated

count initialized to 1) if it is not already there; otherwise,

increase the count of the max-subpattern by one. The hit

set bufferis implemented in the form of a max-subpattern

tree, a novel data structure, to be discussed in Section 4.

3. After the scan, derive the frequent patterns from the hit

set. We will discuss how to implement the ﬁnding of the

counts of the hit patterns and how to use these counts to

derive the frequent patterns in Section 4. It turns out that

both can be done efﬁciently.

Analysis.

Number of scans over the time series. The ﬁrst step of

the algorithm needs to scan

once. The second step needs

to scan

one more time. Thus the total number of time-

series scans is 2, independent of the period .

Space needed. (1) The space needed for Step 1 is the

same as Algorithm 3.1. After Step 1, we need

units of

space to keep

, the set of frequent -patterns in . (2) At

the second step, suppose there are frequent -patterns

in . According to Property 3.2, the total space needed for

the hit set is at most

, where is the

total number of periods in .

In comparison with Algorithm 3.1, Algorithm 3.2 re-

duces the total number of scans of the time series from

(the length of the period) to 2, and it also uses much less

buffer space in the computation in most cases. This can

also be seen from the following observation: Suppose the

hit subpattern for a period segment is

, which is not

in the hit set yet. We need only one unit space to reg-

ister the string and its count 1. However, for the Apriori

technique, the candidate 2-patterns to be generated will be

, 3-patterns to

be generated will be , and the

4-patterns will be

, plus we have to update the count

associated with each of them. Thus, it is expected that the

max-subpatternhit set method may havebetter performance

in most cases. We will compare the performance of the two

algorithms in Section 5.

3.2 Mining partial periodicity with multiple peri-

ods

Mining partial periodicity for a given period covers a

good set of applicationssince people often liketo mine peri-

odic patternsfor naturalperiods, such as annually, quarterly,

monthly, weekly, daily, orhourly. However, certain patterns

may appear at some unexpected periods, such as every 11

years, or every14 hours. It is interesting to provide facilities

to mine periodicity for a range of periods.

To extend partial periodicity mining from one period to

multiple periods, one might wish to extend the idea of Apri-

ori to computing partial periodicity among different peri-

ods, that is, to use the patterns of small periods as ﬁl-

ters for candidate patterns of periods of the form

for

an integer . This will work if all frequent patterns

of period are frequent patterns of period . Unfortu-

nately, this is not the case. For example, for the time series

, , and

. Suppose the conﬁdence threshold is . If we use

from partial periodic patterns of period as ﬁlter for candi-

date partial periodic patterns of period , we will miss the

partial periodic pattern

Given that we cannot extend the Apriori “trick” to mul-

tiple periods, one obvious way to mine partial periodic pat-

terns for a range of periods is to repeatedly apply the single-

period algorithm for each period in the range.

Algorithm 3.3 [Looping over single period computa-

tion] Find all the partialperiodicpatternsfora set of periods

in a given range of interest,

, in the time-series ,

with the given min

conf threshold.

Method.

1. for each period

in the range of interest (i.e.,

), apply Algorithm 3.2 (“max-subpattern hit-

set”) on period

Analysis.

Numberof scans overthe time series. Since each period

will take 2 scans of the time series, the total number of scans

of the time series is

Space needed. For computing partial periodicity for peri-

ods from to , the space required is basically the sum of

space for each . Notice that the space required for initial

Step 1 computation is still

in the worst case since

the space once used in computation for period , can be

reinitialized and reused for computing other periods. But

we need in total

units of space to keep differ-

ent sets of frequent 1-patterns, where

is the set of

frequent -patterns in derived for period . Similarly, it

takes at most units of space to

compute all, where

is the total number of periods in

Algorithm 3.3 provides an iterative method for mining

partial periodicity for multiple periods. However, when the

number of periods is large, we still need a good number

of scans to mine periodicity for multiple periods. An im-

provementto the above method is to maximally explore the

mining of periodicity for multiple periods in the same scan,

which leads to the shared mining of periodicity for multiple

periods, as illustrated below.

Algorithm 3.4 [Shared mining of multiple periods]

Shared mining of all the partial periodic patterns for a set

of periods in a given range of interest, , in time-

series , with the given min conf threshold.

Method.

1. Scan once, for all periods in the range of interest, do

the same as Step 1 in Algorithm 3.2.

That is, for all periods in the range of interest (i.e.,

), ﬁnd , the set of frequent 1-patterns of

period , using the same Step 1 as in Algorithm 3.1.

For each set of frequent 1-patterns of period , form the

candidate max-pattern,

, from .

2. Scan once, for all periods in the range of interest, do

the same as Step 2 in Algorithm 3.2.

A similar process which will not be explained in detail.

Analysis.

Number of scans over the time series. The ﬁrst step of

the algorithm needs to scan

once. The second step needs

to scan one more time. Thus the total number of time-

series scans is 2, independent of the period .

Space needed. The total space required in the worst case

is same as in Algorithm 3.3.

Algorithm 3.4 explores shared processing at mining par-

tial periodicity for multiple periods. The advantage of the

method is that we only need two scans of time series for

mining partial periodicity for multiple periods. The over-

head of the method is that although it reduces the number

of scans to 2, it will require more space in the process-

ing of each scan than the multiple scan method because it

needs to register the corresponding counts for each period

(for ). However, since the shared features will

share the space as well (with counts incremented), and there

should be many shared features in periodicity search (oth-

erwise, why mining periodicity?), the space required will

hardly approach the worst case. Therefore, it should still be

an efﬁcient method in manycases for mining partial period-

icity with multiple periods.

4 Derivation of all partial patterns

In this section, we examine the implementation consid-

erations of our proposed algorithms. Algorithm 3.1 is an

Apriori-like algorithm which can be implemented similarly

as other Apriori-likealgorithmsformining associationrules

(e.g. [2]). Algorithm 3.2 forms the basis for all the three

remaining algorithms and requires new tricks to achieve ef-

ﬁciency, and thus our discussion is focused on its efﬁcient

implementation.

Algorithm 3.2 consists of two steps: Step 1, scan the

time series once and ﬁnd frequent 1-pattern set

; and

Step 2, scan the time series one more time, collect the set

of the max-subpatterns hit in , and derive the set of fre-

quent patterns. The implementation of Step 1 is straight-

forward and has been discussed in the presentation of Al-

gorithm 3.1. However, Step 2 is nontrivial and needs some

good data structureto facilitatethestorageof the set of max-

subpatterns hit in

and the derivation of the set of frequent

patterns.

A new data structure, called max-subpattern tree, is de-

signed to facilitate the registration of the hit count of each

max-subpattern and derivation of the set of frequent pat-

terns, as illustrated in Figure 1. Its design is now outlined.

The max-subpatterntree takes the candidate max-pattern

as the root node, where each subpatternof with

one non-

letter missing is a direct child node of the root.

The tree expands recursively, according to the following

rules. A node

, if containing more than 2 non- letters,

may have a set of children, each of which is a subpattern of

with one more non- letter missing. Notice that a node

containing only 2 non-

letters will not have any children

since every frequent-1 pattern is already in . Importantly,

we do not create a node if neither the node nor its descen-

dant(s) containing more than 1 non-

letter is hit in .

Each node has a “count” ﬁeld (which registers the number

of hits of the current node), a parent link (which is nil for

the root), and a set of child links; each child link points a

child and is associated with a corresponding missing letter.

A link can be nil when the corresponding child has not been

hit.

Notice that a non-

letter position of a max-subpattern

in a max-subpattern tree may contain a set of letters, which

matches the set of letters at the position in a period segment.

Forexample, for

= , the max-subpattern

of the period segment is

, and the segment will contribute one count to this node.

The update of the max-subpattern tree is performed as

follows.

Algorithm 4.1 [Insertion in the max-subpattern tree]

Insert a max-subpattern

found during the scan of into

the max-subpattern tree .

Method.

1. Starting from the root of the tree, ﬁnd the corresponding

node by checking the missing non-

letter in order.

Forexample,fora max-patternnode

in a tree with

the root, , there are two letters,

and , missing. The node can be found by (1) following

the link (markedas “ ” in Figure 1) to ,

and then (2) following the link to , as shown

in Figure 1.

2. If the node

is found, increase its count by 1. Other-

wise, create a new node

(with count 1) and its missing

we show such a node using a dotted box in Figure 1.

ab1*d*

~b1

~b2

*b2*d* a**d* ab2***

8 18 5

0 2

324050

a{b1, b2}*d*

*{b1,b2}*d* ab2*d* a{b1,b2}***

*b1*d* *{b1,b2}*** ab1***

~a ~a ~a

~b1 ~b1 ~b2

~b2

~b1

~d ~b2

~d ~d

Figure 1. A max-subpattern tree to store the set of

max-subpatterns hit in the time-series.

ancestor nodes (only those on the path to

, with count

0), if any, and insert it (or them) into the corresponding

place(s) of the tree.

For example, if the very ﬁrst max-subpattern node found

in is for , we will

create the node (with count 1), after creating two

ancestor nodes (with count 0):

= (which

is the root of the tree), and = (which is

’s child, following the link). The node is

’s child, following the link.

Analysis.

Let the total number of non-

letters in be . For

a max-subpattern

containing ( ) non- letters,

we need to follow links to ﬁnd the node and create

at most new nodes in the worst case. There-

fore, the time complexity of node search and node creation

will be less than

. Also, since each insertion of max-

subpattern will create either only 0 node (when it hits) or

less than

nodes, the total number of the nodes in the tree

is less than , where is the size of the hit set.

In general, to insert a subpattern we need to both locate

the position and update the count of the node if the node is

found, or otherwise insert one or several new nodes.

Example 4.1 Let Figure 1 be the current max-subpattern

tree

. To insert a (max)subpattern into the tree, we

searchthe tree starting with the root,

The ﬁrst non- letter missing is and the second non-

letter missing is . Thus we ﬁrst follow the branch to

node

, and then follow the branch. Since the node

is located, its count is incremented by 1.

Before discussing the derivation of the set of frequent

patterns, we need to introduce the concept of reachable an-

cestors. Since the traversal and creation of the children of a

node in the max-subpattern tree follow the non-

letter po-

sition order, some of the ancestor nodes of a node may not

be directly linked to a node. For example, in Figure 1, the

node is linked to only one parent but not

the other

(note: this missing link is marked by a

dashed line in the Figure).

In general, the set of reachable ancestors of a node

in a max-subpattern tree is the set of all the nodes in ,

which are proper superpatterns of . It can be computed as

follows: (1) derive a list of missing letters from based

, which is roughly the position-wise difference, (2)

the set of linked ancestors consists of those patterns whose

missing letters form a proper preﬁx of , and (3) the set

of not-linkedancestorsare those patternswhosemissing let-

ters form a proper sublist (but not preﬁx) of

Example 4.2 We compute the set of reachable ancestors

for a node

in a max-subpattern tree with root

. The list of missing non- letters

is . Thus, the set of linked ancestors is (1) (miss-

ing nothing, which is the root); (2)

(i.e., missing , which

is the node ); and (3) (i.e., missing , then

missing

, which is the node ). The set of not-

linked ancestors is: (corresponding to the missing

letter pattern ), (corresponding to ),

(correspondingto ), and (corresponding to ).

In other words, one can follow the links whose mark is not

in ordered way (to avoid visiting the same node more than

once) and collect all the non-

nodes reached in .

Essentially there is a tree traversal for each ﬁxed pattern,

except that we do not visit a node and its descendants if the

node is not an ancestor pattern of our current pattern.

The derivation of the frequent

-patterns is performed as

follows.

Algorithm 4.2 [Derivation of frequent patterns from

max-subpattern tree] The derivation of the frequent

patterns for all , given a max-subpattern tree , by an

Apriori-like technique.

Method.

1. The set of frequent

-patterns is derived in the ﬁrst

scan of Algorithm 3.2.

2. The max-subpattern tree

is derived in the second scan

of Algorithm 3.2. The set of frequent

-patterns ( )

is derived as follows.

for

to do

derive candidate patterns with -length from fre-

quent patterns with -length by “ -way

join”.

scan tree T to ﬁnd frequency counts of these candi-

date patterns and eliminate the non-frequent ones.

Notice that the frequencycount of a node is the sum

of the count of itself and those of all of its reach-

able ancestors. If the derived frequent

-pattern set

is empty, return.

Analysis.

Let the total number of non- letters in be . As

shownin the analysis of Algorithm 4.1, the time complexity

for searching a node is less than

. Since there are at most

nodes to be generated from the max-pattern tree

(including all the missing descendants), and there are at

most reachable ancestors in , where is the size of

the hit set, the worst case time complexity for derivation of

all the frequent patterns is O(

), i.e., propor-

tional to and the size of the hit set, but exponential to

(i.e., proportionalto the size of the tree thatcan be generated

by ). Since an infrequent node will reduce the number

of candidates to be generated in the future rounds, the real

processing cost is usually much smaller than the cost in the

worst case.

We illustrate how to derive the frequent -patterns for

from the max-subpattern tree .

Example 4.3 Let Figure 1 be the derived max-subpattern

tree

, and . We can traverse the max-

subpattern tree to ﬁnd all the frequent -patterns for

as follows. Starting at level 2, we have the following fre-

quent patterns: (68), (68),

(47), (119), (92), (84) . We show

the derivation of

(68) here: since the list of miss-

ing letters in this node is , its set of reachable ancestors

is , , , and thus its frequent count = 10 + 0 + 50 +

8 (itself) = 68. Since level-2 has no infrequent nodes, we

search all the nodes at level-1 and have the following fre-

quent patterns:

(60), (50) ; Since there is

one node infrequent, level-0 (root) has no frequent patterns.

Noticealthough we only savedone node computationin this

case, it will save much more when the tree is large and there

are more missing nodes.

From the aboveexample, one can see that there are many

frequent -patterns with small that can be generated from

a max-subpatterntree. In practical applications, people may

only be interested in the set of maximal frequent patterns

instead of all frequent patterns, where a set of maximal fre-

quent patterns is a subset of the frequent pattern set and

every other pattern in the set is a subpattern of an element

in the set. For example, if the set of frequent

pattern (for

) is , the set of maximal

frequent patterns is .

If a user is interested in deriving the set of maximal fre-

quent patterns, the MaxMiner algorithm developed by Ba-

yardo [4] is a good candidate. The success of this algorithm

stems from generating new candidates by joining frequent

itemsets and looking head. However, it still requires to scan

up to period times in the worst case. The mixture of

max-subpattern hit set method and the MaxMiner can get

rid of this problem and will be more efﬁcient than pure

MaxMiner. The details of the new method will be exam-

ined in future research.

5 Performance study

In this section we report a performance study which

compares the performance of the periodicity mining algo-

rithms proposed in this paper. In particular, we give a per-

formance comparison between the single-period Apriori

algorithm (Algorithm 3.1) (or simply called Apriori), and

the max-subpattern hit-set algorithm (Algorithm 3.2) (or

simply hit-set) applied to a single period.

This comparison indicates that there is a signiﬁcant

gain in efﬁciency by max-subpattern hit-set over Apriori.

Since there is more gain when applied to multiple pe-

riods by using max-subpattern hit-set, it is clear that

max-subpattern hit-set is the winner.

The performance study is conducted on a Pentium 166

machine with 64 megabytesmain memory, running in Win-

dows/NT. The program is written in Microsoft/VisualC++.

5.1 Testing Databases

Each test time series is a synthetic time-series databases

generated using a randomized periodicity data generation

algorithm. From a set of features, potentially frequent 1-

patterns are composed. The size of the potentially frequent

1-patterns is determined based on a Poisson distribution.

These patterns are generated and put into the time-series

according to an exponential distribution.

LENGTH the length of time series

a period

MAX-PAT-LENGTH the maximal -length of

frequent patterns

the number of frequent 1-patterns

Table 1. Parameters of synthetic time series

The basic parameters used to generate the synthetic

databases are listed in Table 1. The parametersof LENGTH

(the lengthof time series) and (a period)are independently

chosen. The parameters of MAX-PAT-LENGTH (the max-

imal

-length of frequent patterns) and (the number

of frequent 1-patterns) are for a ﬁxed , and they are con-

trolled by the choiceof some appropriateconﬁdence thresh-

old. We found that other parameters, such as the number of

features occurring at a ﬁxed position and the number of fea-

tures in the time series, do not have much impact on the

performance result and thus they are not considered in the

tests.

5.2 Performance comparison of the algorithms

Figure 2 shows there is a signiﬁcant efﬁciency gain by

max-subpattern hit-set over Apriori. In this ﬁgure, the

maximal pattern length (the maximal

-length of frequent

partial periodic patterns) grows from to . The other

parameters are kept constant: and .

We run two sets of tests, one with the length of the time

series being

and the other being . As

we can see, the running time of max-subpattern hit-set

is almost constant for both cases, while Apriori is almost

linear. When MAX-PAT-LENGTH is

, the gain by

max-subpattern hit-set over Apriori is about double. We

expect this gain will increase for larger MAX-PAT-LENGTH.

Max-Pat-Length

Time

2 4 6 8 10

HitSet500k

1000

2000

3000

4000

5000

6000

(seconds)

7000

Apriori 500k

HitSet100k

Apriori 100k

Figure 2. Performance gain when

MAX-PAT-LENGTH increases:

, .

It is important to note that, the gain shown in Figure 2 is

done by keeping everything in memory, and by considering

only one period. In general, this will be unlikely the case,

and max-subpattern hit-set will perform even better than

Apriori for the following reasons:

In general, the time series of features may need to be

stored on disk, due to factors such as each

may con-

tain thousands of featuresand the length of the time series

can be longer. When the time series is stored on disk,

there would be a large amount of extra disk-IO associ-

ated with Apriori, but not with max-subpattern hit-set

since it only requires two scans. Even when the

time series is not stored on disk, Apriori will need

to go over this huge sequence many more times than

max-subpattern hit-set. Thus max-subpattern hit-set

will be far better than Apriori.

When there are a range of periods to consider,

max-subpattern hit-set can ﬁnd all frequent patterns

in two scans but Apriori will require many more

scans, depending on the number of periods and the

-length of the maximal frequent patterns. Hence

max-subpattern hit-set will be again far better than

Apriori.

6 Conclusions

We have studied efﬁcient methods for mining partial pe-

riodicity in time series database. Partial periodicity, which

associates periodic behavior with only a subset of all the

time points, is less restrictive than full periodicity and thus

covers a broad class of applications.

By exploring severalinterestingpropertiesrelated to par-

tial periodicity, including the Apriori property, the max-

subpattern hit set property, and shared mining of multiple

periods, a set of partial periodicity mining algorithms are

proposed, with their relative performance compared. Our

study shows that the max-subpattern hit set method, which

needs only two scans of the time series database, even for

mining multiple periods, offers excellent performance.

Our study has been conﬁned to mining partial periodic

patterns in one time series for categorical data with sin-

gle level of abstraction. However the method developed

here can be extended for mining multiple-level, multiple-

dimensional partial periodicity and for mining partial peri-

odicity with perturbation and evolution.

For mining numerical data, such as stock or power con-

sumption ﬂuctuation, one can examine the distribution of

numerical values in the time-series data and discretize them

into single- or multiple- level categorical data. For min-

ing multiple-level partial periodicity, one can explore level-

shared mining by ﬁrst mining the periodicity at a high level,

and then progressively drilling-down with the discovered

periodic patterns to see whether they are still periodic at a

lower level.

Perturbation may happen from period to period which

may make it difﬁcult to discover partial periodicity in many

applications. For mining partial periodicity with perturba-

tion, one method is to slightly enlarge the time slot to be

examined. Partial periodic patterns with minor perturbation

are likely to be caught in the generalized time slot. Another

method is to include the features happening in the time slots

surroundingthe onebeing analyzed. We can furtheremploy

regression technique to reduce the noise of perturbation.

There are still many issues regarding partial periodicity

mining which deserve further study, such as further explo-

ration of shared mining for mining periodicitywith multiple

periods, mining periodic association rules based on partial

periodicity, and query- and constraint- based mining of par-

tial periodicity [11]. We are studying these problems and

implementing our algorithms for mining partial periodicity

in a data mining system and will report our progress in the

future.

References

[1] R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Query-

ing shapes of histories. In Proc. 21st Int. Conf. Very Large

Data Bases, pages 502–514, Zurich, Switzerland, Sept.

1995.

[2] R. Agrawal and R. Srikant. Fast algorithms for mining as-

sociation rules. In Proc. 1994 Int. Conf. Very Large Data

Bases, pages 487–499, Santiago, Chile, September 1994.

[3] R. Agrawal and R. Srikant. Mining sequential patterns. In

Proc. 1995 Int. Conf. Data Engineering, pages 3–14, Taipei,

Taiwan, March 1995.

[4] R. J. Bayardo. Efﬁciently mining long patterns from

databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Manage-

ment of Data, pages 85–93, Seattle, Washington, June 1998.

[5] C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal

relationships with multiple granularities in time sequences.

Data Engineering Bulletin, 21:32–38, 1998.

[6] J. Han and Y. Fu. Discovery of multiple-level associa-

tion rules from large databases. In Proc. 1995 Int. Conf.

Very Large Data Bases, pages 420–431, Zurich, Switzerland,

Sept. 1995.

[7] J. Han, W. Gong, and Y. Yin. Mining segment-wise periodic

patterns in time-related databases. In Proc. 1998 Int’l Conf.

on Knowledge Discovery and Data Mining (KDD’98), New

York City, NY, August 1998.

[8] H. J. Loether and D. G. McTavish. Descriptive and Inferen-

tial Statistics: An Introduction. Allyn and Bacon, 1993.

[9] H. Lu, J. Han, and L. Feng. Stock movement and n-

dimensional inter-transaction association rules. In Proc.

1998 SIGMOD Workshop on Research Issues on Data Min-

ing and Knowledge Discovery (DMKD’98), pages 12:1–

12:7, Seattle, Washington, June 1998.

[10] H. Mannila, H Toivonen, and A. I. Verkamo. Discover-

ing frequent episodes in sequences. In Proc. 1st Int. Conf.

Knowledge Discovery and Data Mining, pages 210–215,

Montreal, Canada, Aug. 1995.

[11] R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Ex-

ploratory mining and pruning optimizations of constrained

associations rules. In Proc. 1998 ACM-SIGMOD Int. Conf.

Management of Data, pages 13–24, Seattle, Washington,

June 1998.

[12] B.

Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic as-

sociation rules. In Proc. 1998 Int. Conf. Data Engineering

(ICDE’98), pages 412–421, Orlando, FL, Feb. 1998.

An Efficient Probabilistic Algorithm to Detect Periodic Patterns in Spatio-Temporal Datasets

Article

Full-text available

Jun 2024

Deriving insight from data is a challenging task for researchers and practitioners, especially when working on spatio-temporal domains. If pattern searching is involved, the complications introduced by temporal data dimensions create additional obstacles, as traditional data mining techniques are insufficient to address spatio-temporal databases (STDBs). We hereby present a new algorithm, which we refer to as F1/FP, and can be described as a probabilistic version of the Minus-F1 algorithm to look for periodic patterns. To the best of our knowledge, no previous work has compared the most cited algorithms in the literature to look for periodic patterns—namely, Apriori, MS-Apriori, FP-Growth, Max-Subpattern, and PPA. Thus, we have carried out such comparisons and then evaluated our algorithm empirically using two datasets, showcasing its ability to handle different types of periodicity and data distributions. By conducting such a comprehensive comparative analysis, we have demonstrated that our newly proposed algorithm has a smaller complexity than the existing alternatives and speeds up the performance regardless of the size of the dataset. We expect our work to contribute greatly to the mining of astronomical data and the permanently growing online streams derived from social media.

Discovering Fuzzy Partial Periodic Patterns in Quantitative Irregular Multiple Time Series (preprint)

Conference Paper

Apr 2023

3P-ECLAT: mining partial periodic patterns in columnar temporal databases

Article

Full-text available

Dec 2023
APPL INTELL

Partial periodic pattern (3P) mining is a vital data mining technique that aims to discover all interesting patterns that have exhibited partial periodic behavior in temporal databases. Previous studies have primarily focused on identifying 3Ps only in row temporal databases. One can not ignore the existence of 3Ps in columnar temporal databases as many real-world applications, such as Facebook and Adobe, employ them to store their big data. This paper proposes an efficient single database scan algorithm, Partial Periodic Pattern-Equivalence Class Transformation (3P-ECLAT), to identify all 3Ps in a columnar temporal database. The proposed algorithm compresses the given database into a novel list-based data structure and mines it recursively to find all 3Ps. The 3P-ECLAT leverages the “downward closure property” and “depth-first search technique” to reduce the search space and the computational cost. Extensive experiments have been conducted on synthetic and real-world databases to demonstrate the efficiency of the 3P-ECLAT algorithm. The memory and runtime results show that 3P-ECLAT outperforms its competitor considerably. Furthermore, 3P-ECLAT is highly scalable and is superior to the previous approach in handling large databases. Finally, to demonstrate the practical utility of our algorithm, we provide two real-world case studies, one on analyzing traffic congestion during disasters and another on identifying the highly polluted areas in Japan.

A Fundamental Approach to Discover Closed Periodic-Frequent Patterns in Very Large Temporal Databases

Article

Full-text available

Sep 2023
APPL INTELL

Periodic frequent-pattern mining (PFPM) is a vital knowledge discovery technique that identifies periodically occurring patterns in a temporal database. Although traditional PFPM algorithms have many applications, they often produce a large set of periodic-frequent patterns (PFPs) in a database. As a result, analyzing PFPs can be very time-consuming for users. Moreover, a large set of PFPs makes PFPM algorithms less efficient regarding runtime and memory consumption. This paper handles this problem by proposing a novel model of closed 1 Springer Nature 2021 L A T E X template 2 Article Title periodic-frequent patterns (CPFPs) found in databases. CPFPs are less expensive to mine because they represent a concise and lossless subset uniquely describing the entire set of PFPs. We also present an efficient depth-first search algorithm, called Closed Periodic-Frequent Pattern-Miner (CPFP-Miner), to discover the patterns. The proposed algorithm utilizes the weighted ordering of the patterns concept to reduce the patterns' search space. On the other hand, the current periodicity concept is also applied to prune aperiodic patterns from the search space. Extensive experiments on both real-world and synthetic databases demonstrate that the CPFP-Miner algorithm is efficient. It outperforms the state-of-the-art algorithms regarding run-time requirements, memory consumption, and energy consumption on several real-world and synthetic databases. Additionally, the scalabil-ity of the CPFP-Miner algorithm is demonstrated to be more effective and productive than the state-of-the-art algorithms. Finally, we present two case studies to show the functionality of the proposed patterns.

Recurrent segmentation meets block models in temporal networks

Article

Full-text available

Jan 2024
MACH LEARN

A popular approach to model interactions is to represent them as a network with nodes being the agents and the interactions being the edges. Interactions are often timestamped, which leads to having timestamped edges. Many real-world temporal networks have a recurrent or possibly cyclic behaviour. In this paper, our main interest is to model recurrent activity in such temporal networks. As a starting point we use stochastic block model, a popular choice for modelling static networks, where nodes are split into R groups. We extend the block model to temporal networks by modelling the edges with a Poisson process. We make the parameters of the process dependent on time by segmenting the time line into K segments. We require that only $$H \le K$$ H ≤ K different set of parameters can be used. If $$H < K$$ H < K , then several, not necessarily consecutive, segments must share their parameters, modelling repeating behaviour. We propose two variants where a group membership of a node is fixed over the course of entire time line and group memberships are allowed to vary from segment to segment. We prove that searching for optimal groups and segmentation in both variants is NP -hard. Consequently, we split the problem into 3 subproblems where we optimize groups, model parameters, and segmentation in turn while keeping the remaining structures fixed. We propose an iterative algorithm that requires $$\mathcal {O} \left( KHm + Rn + R^2\,H\right)$$ O K H m + R n + R 2 H time per iteration, where n and m are the number of nodes and edges in the network. We demonstrate experimentally that the number of required iterations is typically low, the algorithm is able to discover the ground truth from synthetic datasets, and show that certain real-world networks exhibit recurrent behaviour as the likelihood does not deteriorate when H is lowered.

Mining Periodic-Frequent Patterns in Irregular Dense Temporal Databases Using Set Complements

Article

Full-text available

Jan 2023

Periodic-frequent patterns are a vital class of regularities in a temporal database. Most previous studies followed the approach of finding these patterns by storing the temporal occurrence information of a pattern in a list. While this approach facilitates the existing algorithms to be practicable on sparse databases, it also makes them impracticable (or computationally expensive) on dense databases due to increased list sizes. A renowned concept in set theory is larger the set, the smaller its complement will be. Based on this conceptual fact, this paper explores the complements, redefines the periodic-frequent pattern and proposes an efficient depth-first search algorithm called PFPM-C, that finds all periodic-frequent patterns by storing only non-occurrence information of a pattern in a database. Experimental results on several databases demonstrate that our algorithm is efficient.

Periodic-Confidence: A Null-invariant Measure to Discover Partial Periodic Patterns in Non-Uniform Temporal Databases

Article

Full-text available

Oct 2023

Partial Periodic Pattern Mining (3PM) is a key knowledge discovery technique with many applications. It involves discovering all patterns that have exhibited partial periodic behavior in a temporal database. Unfortunately, the widespread adoption of this technique has been hindered by the following two limitations: (i) the rare item problem, which involves either missing the patterns containing rare items or producing too many patterns, most of which may be uninteresting to the user, and (ii) computationally expensive mining process as its mining algorithms were inefficient in reducing the enormous search space. This paper makes the following efforts to address the above-mentioned two limitations. First, we introduce a new null-invariant measure, periodic- confidence, to determine the periodic interestingness of a pattern in a database. Second, an alternative model of a partial periodic pattern has been defined based on the proposed measure. Third, an efficient depth-first search algorithm based on the renowned pattern-growth technique has been introduced to discover all partial periodic patterns in a database. Fourth, the proposed algorithm employs a novel lossless pruning technique called “irregularity pruning” to reduce the search space and computational cost-efficiently. Experiments on several datasets demonstrate that our model can effectively tackle the rare item problem, and our algorithm is efficient. Finally, we discuss the usefulness of patterns with case studies performed on air pollution and traffic congestion databases.

WEB USAGE MINING: EXTRACTION OF SEQUENTIAL PATTERNS USING PREFIXSPAN METHOD

Conference Paper

Full-text available

Apr 2015

Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to degenerate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments.

A Novel Structure to Mine High Utility Items Based on Transactional Weights

Conference Paper

Oct 2023

Mining Seasonal Temporal Patterns in Time Series

Conference Paper

Apr 2023

Querying Shapes of Histories.

Conference Paper

Full-text available

Jan 1995

Efficiently Mining Long Patterns from Databases.

Conference Paper

Full-text available

Jun 1998
SIGMOD REC

Roberto J. Bayardo

We present a pattern-mining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data show that when the patterns are long, our algorithm is more efficient by an order of magnimaximal frequent itemset, Max-Miner’s output implicitly and concisely represents all frequent itemsets. Max-Miner is shown to result in two or more orders of magnitude in performance improvements over Apriori on some data-sets. On other data-sets where the patterns are not so long, the gains are more modest. In practice, Max-Miner is demonstrated to run in time that is roughly linear in the number of maximal frequent itemsets and the size of the database, irrespective of the size of the longest frequent itemset. tude or more. 1.

Exploratory Mining and Pruning Optimizations of Constrained Association Rules.

Conference Paper

Full-text available

Jun 1998

From the standpoint of supporting human-centered discov- ery of knowledge, the present-day model of mining asso- ciation rules suffers from the following serious shortcom- ings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model functions as a black-box, admitting little user inter- action in between. We propose, in this paper, an architec- ture that opens up the black-box, and supports constraint- based, human-centered exploratory mining of associations. The foundation of this architecture is a rich set of con- straint constructs, including domain, class, and SqLstyle aggregate constraints, which enable users to clearly specify what associations are to be mined. We propose constrained association queries as a means of specifying the constraints to be satisfied by the antecedent and consequent of a mined association. In this paper, we mainly focus on the technical challenges in guaranteeing a level of performance that is commensu- rate with the selectivities of the constraints in an associ- ation query. To this end, we introduce and analyze two properties of constraints that are critical to pruning: onti- monotonicity and succinctness. We then develop charac- terizations of various constraints into four categories, ac- cording to these properties. Finally, we describe a min- ing algorithm called CAP, which achieves a maximized de- gree of pruning for all categories of constraints. Experi- mental results indicate that CAP can run much faster, in some cases as much as 80 times, than several basic algo- rithms. This demonstrates how important the succinctness and anti-monotonicity properties are, in delivering the per- formance guarantee.

Cyclic association rules

Conference Paper

Full-text available

Mar 1998

We study the problem of discovering association rules that display regular cyclic variation over time. For example, if we compute association rules over monthly sales data, we may observe seasonal variation where certain rules are true at approximately the same month each year. Similarly, association rules can also display regular hourly, daily, weekly, etc., variation that is cyclical in nature. We demonstrate that existing methods cannot be naively extended to solve this problem of cyclic association rules. We then present two new algorithms for discovering such rules. The first one, which we call the sequential algorithm, treats association rules and cycles more or less independently. By studying the interaction between association rules and time, we devise a new technique called cycle pruning, which reduces the amount of time needed to find cyclic association rules. The second algorithm, which we call the interleaved algorithm, uses cycle pruning and other optimization techniques for discovering cyclic association rules. We demonstrate the effectiveness of the interleaved algorithm through a series of experiments. These experiments show that the interleaved algorithm can yield significant performance benefits when compared to the sequential algorithm. Performance improvements range from 5% to several hundred percent

Descriptive and Inferential Statistics: An Introduction

Article

Jan 1976

Discovering frequent episodes in event sequences

Article

Jan 1997

Fast Algorithms for Mining Association Rules

Article

Jan 1994

R. Agrawal

Stock movement and n-dimensional inter-transaction association rules

Article

Jan 1998

Discovering Frequent Episodes in Sequences.

Conference Paper

Jan 1995

Sequences of events describing the behavior and actions of users or systems can be collected in several domains. In this paper we consider the problem of recognizing frequent episodes in such sequences of events. An episode is defined to be a collection of events that occur within time intervals of a given size in a given partial order.Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We describe an efficient algorithm for the discovery of all frequent episodes from a given class of episodes, and present experimental results.

Mining Temporal Relationships with Multiple Granularities in Time Sequences.

Article

Jan 1998

This paper reports the progress in thisfront. A more detailed study can be found in [4].In this paper, we focus on algorithms for discovering sequential relationships when a rough pattern of relationshipsis given. The rough pattern (which we term "event structure") specifies what sort of relationships a useris interested in. For example, a user may be interested in "which pairs of events occur frequently one week afteranother". The algorithms will find the instances that fit the event...

Efficient Mining of Partial Periodic Patterns in Time Series Database

Abstract and Figures

Recommended publications

WE'RE COMMITTED TO MAKING A MEANINGFUL IMPACT ON THE WORLD

The Impact of The Centre for Secure Information Technologies (CSIT)

Transforming the Lives of People with Cystic Fibrosis

The Food Fortress- From A Crisis to The Formation of An Innovative Food Quality Assurance Scheme

Efficient Mining of Partial Periodic Patterns in Time Series Database In ICDE 99

Mining periodic patterns in time-series databases

Data Mining and Knowledge Discovery, 8, 53--87, 2004

Mining Segment-Wise Periodic Patterns in Time-Related Databases

Mining Frequent Patterns Without Candidate Generation