Conference PaperPDF Available

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

September 2015

September 2015

DOI:10.13140/RG.2.1.5059.3762

Conference: The International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015)
At: Surya University, Jakarta, Indonesia

Authors:

Natalie Walker

SDIWC

SQL Server Logical Structure.

…

A concept hierarchy tree for attribute workclass in adult dataset[18]

…

Decision tree structuring the different relationships

…

Proposed Inventory Management Model.

…

+19

The relationships between data spaces, filegroups, and partition schemes.

…

Figures - uploaded by Natalie Walker

Content may be subject to copyright.

Content uploaded by Natalie Walker

Content may be subject to copyright.

Conference Title

The International Conference on Database, Data Warehouse,

Data Mining and Big Data (DDDMBD2015)

Conference Dates

September 10-12, 2015

Conference Venue

Surya University

Tangerang, Indonesia

ISBN

Published by

The Society of Digital Information and Wireless

Communications (SDIWC)

Wilmington, New Castle, DE 19801, USA

www.sdiwc.net

DDDMBD2015 Table of Contents

Partitioning Technology and Fast Content Movements of Big Data………………………………….…..1

Mining Patterns with Attribute Oriented Induction………………………………………………………….…11

Supporting Aggregation in Data Warehousing Considering User-Defined Temporal

Granularities…………………………………………………………………………………………………………………….…22

Optimal Health Care Inventory Management Using Analytics……………………………………………..37

Partitioning Technology and Fast Content Movements of Big Data

TE-YUAN LIN, CHIOU-SHANN FUH

Computer Science and Information Engineering, National Taiwan University

8F., No.7, Songren Rd., Xinyi Dist., Taipei City 110, Taiwan (R.O.C.)

D03922002@ntu.edu.tw, fuh@csie.ntu.edu.tw

ABSTRACT

Database storage storing abundant data usually

accompanies slow performance of query and data

manipulation. This thesis presents a model and

methodology of faster data manipulation

(insert/delete) of mass data rows stored in a big

table. In this thesis, it depicts the solution to

manipulate large data sets of one table which moves

into and out of another logical table with

outstanding efficiency compared with traditional

transactional way. With this idea, the table structure

needs to be redesigned to accommodate and keep

data, in other words, the table needs to be

“partitioned”.

It also covers partitioning strategies which are

applied to various scenarios such as the data sliding

window scenario, data archiving, and partition

consolidation and movement practice.

KEYWORDS

big data, partition table, partitioning, big table,

MapReduce

1 INTRODUCTION

1.1 Motivation

The term “Big Data” from software and

computer science field is a collection of

datasets that grows so large and complex that it

becomes difficult to process using traditional

relational database management tools or

processing applications. In fact, “Big Data” is

one of the unstoppable IT trends in the Cloud

Computing topic. Nowadays, most relevant

applied topics of Big Data are generally derived

from MapReduce [1]. However, if we move our

focus from the requirement of search/data

analysis to real-time data, the shortage of

MapReduce is inevitable from its nature of

designed architecture. Some disadvantages of

MapReduce are, first, good at batched, offline

big data processing, but not suitable for real-

time transactional jobs. Second, MapReduce

uses brute force to compute result instead of

indexed data. Third and the most important, for

many commercial users, SQL (Structured

Query Language)-like data and relational

database remains the majority. Once we get

back to the traditional usage of RDBMS

(Relational Data-Base Management Systems),

the common slowness of big data manipulation

is unavoidable. A notion came to our minds: Is

it possible to insert/delete a huge dataset in no

time? Is it possible to leverage technical

innovations based on current database platform

to minimize maintenance time and maximize

efficiency for enterprise level database? How to

make these work like a charm automatically?

Thus we decide to design a mechanism to fulfill

all of the requirements.

1.2 Database Platforms

To better consistently explain the idea

thoroughly in this thesis, Microsoft SQL Server

(but not limited to this) is chosen as the

platform for explain the following samples and

the details. The hierarchy of Database Logical

Structures is shown in below figure:

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figure 1: SQL Server Logical Structure.

As described in Figure 1, an instance of the

database engine is a copy of the SQL Server

executable that runs as an operating system

service. Each instance manages several system

databases and one or more user databases. Each

computer can run multiple instances of the

database engine. Applications connect to the

instance in order to perform work in a database

managed by the instance [2].

SQL Server maps a database over a set of

operating-system files which can be categorized

in two main types, data and log and with three

different file extensions: (.mdf and .ndf for data,

and .ldf for log). Data and log information are

never mixed in the same file, and individual

files are used only by one database. Each

database has one set of data spaces, which can

be a filegroup, a partition scheme, or a

filestream directory [3].

Filegroups are named collections of files and

are used to help with data placement and

administrative tasks such as backup and restore

operations. Partition schemes in the database

can map the partitions of a partitioned table or

index to filegroups. The filestream directory is

used for integrating with unstructured data such

as text documents, images, and videos that are

often stored outside the database, separate from

its structured data. The data of a table is

physically stored in one or more files, its

logical container is not a file, but a filegroup. A

logical container “owns” the logical objects it

contains, and a logical object “belongs” to the

logical container that holds it. Filegroups own

files and tables, therefore tables and files

belong to filegroups. Partition schemes own

partitioned tables, therefore partitioned tables

belong to partition schemes, and if a partition

scheme exists, one or more filegroups belong to

it. Some examples are:

 A primary filegroup (which always holds

the system tables) and a default filegroup

must exist. The primary and default

filegroups can be the same filegroup or

separate filegroups.

 A filegroup may have more than one file.

 A partition scheme can be defined on a

single filegroup with one file.

 A partition scheme can be defined on a

single filegroup with multiple files.

 A partition scheme can be defined on

multiple filegroups, with each filegroup

having one or more files.

Figure 2 illustrates the relationships between

database’s data spaces, filegroups, and partition

schemes.

Figure 2: The relationships between data spaces,

filegroups, and partition schemes.

1.3 Table and Partition Overview

A table is contained in one or more partitions

and each partition contains data rows in either a

heap or a clustered index structure. The pages

of the heap or clustered index are managed in

one or more allocation units, depending on the

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

column types in the data rows. Figure 3 shows

the organization of a table. Table and index

pages are contained in one or more partitions. A

partition is a user-defined unit of data

organization. By default, a table or index has

only one partition that contains all the table or

index pages. The partition resides in a single

filegroup. A table or index with a single

partition is equivalent to the organizational

structure of tables and indexes in earlier

versions of SQL Server [4].

Figure 3: Table and partition organization [4].

When a table or index uses multiple partitions,

the data are partitioned, so that groups of rows

are mapped into individual partitions, based on

a specified column. The partitions can be put on

one or more filegroups in the database. The

table or index is treated as a single logical

entity when queries or updates are performed

on the data.

2 METHODOLOGY

2.1 Partition Approach

The two common partitioning techniques

applied to the dataset dividend slices are

Horizontal Partitioning and Vertical

Partitioning. To be simple, no matter what kind

of techniques, we can deem them as a “logical

view” to a table. For horizontal partitioning

views, the table maintains the same structure

and columns but is partitioned along the row-

level boundaries, for the vertical partitioning

views, the vertically partitioned table splits the

original table into more than one physical table.

Each table has the same number of records and

the same primary key, but the other columns

are different. These two tables have a 1-to-1

relationship, and this technique might be used

to overcome a database engine’s limit on the

number of columns supported in one table. For

example, a financing table might need 1,000

columns that have a 1-to-1 relationship with

each other, but the database engine might only

support a maximum of 256 columns per table.

In that case the logical table could be split

(vertically partitioned) into 4 physical tables.

• Horizontal partitioning is a method of

dividing rows of data in a table, with different

rows belonging to different partitions of that

table. We also call it “Row groups”.

Figure 4: Horizontal partitioning table concept.

• Vertical partitions have subsets of table

columns, each subset stores a part of column-

wised data. Theoretically, each subset table can

be accessed independently, we also call it

“Segments”.

Figure 5: Vertical partitioning table concept.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Horizontal and vertical partitions are just like

two knives that cut the base table apart from

different angles. It is worth pointing out that

this “cutting” technique is different from usual

SQL statement to filter out rows and columns,

on the contrary, the data quantity to the base

table remains the same, but had divided into

pieces under cover. The smaller pieces can be

stored in a very distributed way on the different

disks as more as possible, but from developers’

point of view, it is nothing but a normal table of

no difference.

An example here is the base table (e.g.

SalesTable) looks like the following:

Figure 6: The sample base table called SalesTable.

After combining with horizontal and vertical

techniques, the original bigger table has been

divided into 12 smaller tables.

Figure 7: The concept of how data are stored after

partitioned horizontally and vertically.

If we perform a simple query like this:

SELECT ProductKey, SUM (SalesAmount)

FROM SalesTable

WHERE OrderDateKey < 20130108

The result would look like the following

highlighted region with merely 3 small tables:

Figure 8: The sample result of vertical partitions filtering.

We can fetch only the needed columns and

rows with some filtering condition, and what’s

even more here is that we do not need to bother

any disk I/O (Input/Output) effort on the rest 9

tables (marked with X in the Figure 8), merely

the involved disks of the 3 tables required. It is

quite different from the original way we

perform a select instruction toward the whole

table or indexes without partitioning on the

same disks. See? It is quite natural to imagine

the benefits that the partitioning techniques can

bring to us:

 Less storage used and faster storage speed:

By means of multiple I/O paths via multiple file

groups and disks.

 Smaller, faster backups and restores:

By means of backups of changed filegroups

only

 Faster inserts and deletes of large data sets

By means of partition switching

For the top two points are obviously, but not

very clear for the last one? At this point, it is

all right, we are going to describe it further in

the coming pages soon.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Of course, every benefit has its cost that needs

to pay, no exception for the partition techniques,

they are:

 Partitioned objects require more memory

than non-partitioned objects.

 Database administrator must have

sufficient knowledge of partitioning.

 Partitioned objects may have some

restrictions under certain situations.

Although partitioning requires extra effort

when managing very large numbers of records,

once the data size scale is big enough, it is

surely less total administrative time than trying

to manage the records without partitioning.

Here we would like to introduce partitioning

methodology of how we boost the performance

of data access and manipulation. Whether

horizontally or vertically partitioning the

dataset is allowed, but operational complexity

and the transparency degree also matters. The

rule of thumb is: from the database

administrator’s standpoint, the lower

operational complexity the better, on the

contrary, from the developer’s view, the higher

transparency the better. Consider the side

impact of vertical partitioning, what happens

when the columns are removed from the

original base table and placed in its child tables,

it is through denormalization, overall data

access patterns are more complicated because

developers need to know the denormalization

relations between each child table very well

besides the functional relations of different

tables. Combining both horizontal and vertical

methods together is even more complicated by

multiplying the number of divided slices.

Another consideration is the behavior how we

use data. People usually get used to archiving

the historical data by “time”, “region”, or

“category”. Interestingly these are more similar

to horizontally distribution instead of vertical

column slices. Thus, here the horizontal

partitioning method will be adopted in practice

throughout the next implementation.

2.2 Horizontal Partitioning

The three amigos composed of the partitioning

definition is by “partition boundary”, creating a

“partition function” and “partition scheme”,

followed by assigning to the target table(s). The

dependency relationship is shown in Figure 9:

Figure 9: The dependency relationship of partition

definitions.

Each component in the relationship must

depend on the existence of its predecessor

component, and can be used by many of its

successors. The partitioned table must be

created on a partition scheme, which defines the

filegroup storage locations for the partitioned

table. This is the biggest difference between

non-partitioned tables. Non-partitioned tables

also do have a partition, but the number of

partition is one, while partitioned table can be

separated into thousands of partitions.

More details of some terminologies:

 Partition Boundary: By specifying a set of

boundary values from the range of

possible values of the input parameter’s

data type. These boundary values define a

number of sub-ranges within the overall

range, called partitions.

Each partition:

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

• Partitions are numbered starting with 1.

• Any given possible value within the boundary

value range can exist in only one partition.

• Two types (RANGE LEFT/ RANGE RIGHT)

of boundary range control which partition a

record is stored in when the partitioning value

equals the boundary exactly.

Table 1: Example with the boundary value of 1000:

(Partition #1)

(Partition #2)

RANGE

LEFT

<= 1000

> 1000

RANGE

RIGHT

< 1000

>= 1000

Say the boundary here is 1,000 and split the

overall range as two partitions (#1 and #2), if

there is an incoming value of 800 inserted into

the partitioned table, which partition number

should this value reside in? Quite simply, it is

#1 since the incoming value less than the

boundary value of 1,000. What if the incoming

value is exactly the same as the boundary value

of 1,000?

The answer depends on the partition range

direction we use. If we use RANGE LEFT,

once the incoming value is right equal to the

boundary value, then it would be assigned to

reside in the “LEFT” partition, which the

partition number is #1 here. Meanwhile, if we

use RANGE RIGHT, the coming value of

1,000 would fall into the “RIGHT” partition,

where the partition number is #2. If we have

multiple boundary values, they are like:

 RANGE LEFT and values of (0, 10, 20):

Partition 1: values from the data type

minimum to values <= 0

Partition 2: values > 0 and values <= 10

Partition 3: values > 10 and values <= 20

Partition 4: values > 20 and values <= data

type maximum

 RANGE RIGHT and values of (0, 10, 20):

Partition 1: values from the data type

minimum to values < 0

Partition 2: values >= 0 and values < 10

Partition 3: values >= 10 and values < 20

Partition 4: values >= 20 and values <=

data type maximum

2.3 Metadata-Only Manipulations

One fact in the world is, moving a large data set

could never be faster than the small one. Just

like under the same driving mode, traveling

10,000 miles with less gasoline usage than 5

miles only happens in the Arabian Nights.

For instance, we have two tables (Tables A &

B), a common manipulation is to move “2008

Data” from Table A to B. The puzzle is that the

size of “2008 Data” is up to 10 TB.

Figure 10: Traditional SQL operation of data

manipulation.

By using traditional method, the pseudo SQL

statement would be two steps as followed:

insert Table B

select * from Table A

where data_period = 2008 Data;

delete Table A

where data_period = 2008 Data

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

In the first step, as it says, select the specified

period data set and insert into to the target table

(Table B). Due to the data size is up to 10 TB,

this operation comes to no surprise that would

require several days to most of slow disks,

several hours taken even for very high-end

storage disk arrays. It merely counted the

insertion effort, for any database engine, since

it needs to record every action for possible

rollback request later, extra disk I/O operation

cost charges due to the write-in to log files, and

hence, the real duration and the space needed

could be doubled.

Unfortunately, it is just the beginning. The

second step is another disaster to the whole

system because we need to get rid of the data

set “2008 Data” from the source table (Table A).

The ensuing disk I/O flood will cost another

several days, or another several hours at least.

After the torture like this, if you dared to touch

the disks surface and feel the temperature,

believe me, you must regret.

The cost does not yet include the derived

network bandwidth, the processor power, and

the most of all, the precious waiting time.

Thinking differently is the only way out! The

tricky way is: DO NOT REALLY move data

sets literally, but move the “metadata” behind

them, instead. By changing the link definition

of big data set and the partition number where it

lives in, it is possible to load or remove data

from the partitioned table almost immediately

regardless of how big they are.

Let’s leverage the technique stated earlier.

Partition Table A into six areas so the data

within would be divided according to the date

boundaries. Table B is a blank table with an

empty partition.

Imagine the following concept:

Figure 11: Table and partition metadata movement

concept (Before switch out).

After our architecture redesigned with table

partitioning techniques, each area of data set

has its own link to the logical partition number

with filegroup. The default partition of Table B

shares the same filegroup with the partition #2

of Table A. By exchanging the link of the

partition “2008 Data” metadata definition,

which is the concept called “Switch out”, the

fast performance is no longer a dream. The

magic pseudo statement as follows:

Alter Table A

SWITCH PARTITION 2

TO B

Guess what, this short statement will do the

tricks and finish the job in seconds. After the

“Switch out” operation, the link definition of

data set “2008 Data” has been changed from

Table A to B. For users’ point of view, it is like

the data set “2008 Data” has been deleted from

Table A and loaded into Table B.

All of these happens in a twinkle. This is

because the “Switch” operation can avoid any

real data movement. The new metadata

definition looks like the followed figure 12:

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figure 12: Table and partition metadata movement

concept (After switch out).

With the same idea, we can perform “Switch

in” operation to pull in the data set we want to

the table or specific partition.

2.4 Sliding the Partitions

With the basis of clear interpretation in

previous sections, now is a good timing to

design a partitioning strategy that leverages

partition switching and the performance

advantage it brings to manage the big tables in

business operations.

The strategy is by sliding the partitions to fade

out old data and bring in new data into the

partitioned table periodically, we call this

method “Sliding window”, as it is described by

analogy with the network packet terms in

TCP/IP (Transmission Control Protocol /

Internet Protocol) world. The concept

procedures consist of the following ten steps &

pseudo codes:

1. Design a partition function based on the

proper column of the source table(s).

CREATE PARTITION FUNCTION

pf_Annually_Right(datetime)

AS RANGE Right FOR VALUES

(

‘2008-01-01’, ‘2009-01-01’,

‘2010-01-01’,

‘2011-01-01’, ‘2012-01-01’)

For the values, though we used the “fixed”

dates here, to be more flexible in the real

world systems, we strongly suggest to

replace the fixed values with

corresponding dynamic values and

format from the real-time system date or

some data calculations result with

incrementally or diminishingly method.

For example:

SUBSTRING(CONVERT(varchar,DATEADD(

dd,+7,getdate()),121),0,12)

2. Plan filegroup(s) distribution and design

the partition scheme to combine the

partition function and the filegroup

definition.

CREATE PARTITION SCHEME ps_

pf_Annually AS PARTITION

pfDaily_Right ALL TO ([PRIMARY])

To be more performance-oriented thinking,

plan more than one filegroups and assign

to different physical disk arrays will get

further parallel disk I/O benefits. Here for

demonstration, we put all in the

PRIMARY group.

3. Use the current existing source table or

design the new source table to

accommodate the data pertaining to the

partitioning structure.

CREATE TABLE [sourceTableName]

(

columnDate, column2, column3....

)

ON ps_pf_Annually([column_Date])

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

4. Design a staging table (temporary archive

table) with identical columns/indexes

definitions as the source table.

CREATE TABLE [tablename_Staging]

(

columnDate, column2, column3....

)

5. Move the oldest data partition from the

source partitioned table to the archive table

by “Switch out” command, now the oldest

data would reside in the staging archive

table, archive it in the background to other

tables (will not lock/affect the source

table's transaction and performance) or

truncate it directly (in seconds, no matter

how many rows or how big size it is).

Figure 13: Switch out the old partition to staging

table.

ALTER TABLE [sourceTableName]

SWITCH PARTITION 2

TO[tablename_Staging];

TRUNCATE TABLE [tablename_Staging]

6. Get rid of the boundary value of the oldest

partition by “Merge” command.

Figure 14: Before merging the partitions, the oldest

boundary value is 2008-01-01.

ALTER PARTITION FUNCTION

pf_Annually_Right()

MERGE RANGE (‘2008-01-01’)

Figure 15: After merging the partitions, the oldest

boundary value becomes 2009-01-01.

7. Specify the NEXT USED filegroup for the

incoming new partition.

ALTER PARTITION SCHEME ps_

pf_Annually

NEXT USED[PRIMARY]

8. Create a new boundary value for the new

partition at the newest end partition of the

table, remember by “Split” command, this

will cut the leading end partition into two

empty partitions.

ALTER PARTITION FUNCTION

pf_Annually_Right()

SPLIT RANGE (‘2012-01-01’)

9. (Optional) Get the data from an existing

staging table with identical

columns/indexes definitions as the target

partitioned table by “Switch in” command

to the proper partition location.

Figure 16: Before extending new partition, the

maximum boundary value is 2011-01-01.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figure 17: After splitting partitions, the maximum

boundary value becomes 2012-01-01, and load the

data of the staging table into Partition 5 of the target

partitioned table.

Notice: Keep this new partition empty,

never put into both the leading ends. Make

sure the indexes of the existing staging

table are rebuilt/refreshed before switching

into the target partitioned table.

10. The staging table now is empty and the

source table partition has been refreshed.

In the above steps, we developed a strategy to

slide the partitioning structure by cutting out

the oldest partition and extend an extra new

partition, during the process, not only

refreshing the structure, but also deleting the

historic data and loading the latest data in

seconds. To further automate this process,

simply put these steps into the schedule jobs

with proper frequency, this idea is totally

different from traditional SQL DML (Data

Manipulation Language) and extremely

efficient in dealing with big data for any

enterprise level application.

3 CONCLUSION

By leveraging the “Divide & Conquer” theorem,

big data can be partitioned into separate logical

units, by changing the metadata relationship,

each big dataset unit can “roam around”

different tables quickly under moving

requirements, and this really raises the

performance and the usage flexibility of

application possibility in the age of

data/information explosion.

Partitioning also has its disadvantages and

limitations. The framework has the tight

dependency on the database engine. The same

scenario with good performance and feasibility

in one database platform may not work that

well in the others due to the version

supportability and internal engine behavior

differences. A little bit complex designing

requirements before setting out on a journey is

also the showstopper widely used in daily data

operations. However, we still can reap more

benefits from using the idea than its limitations.

As the maturity of the framework goes by, the

true influence of partitioning is worth expecting.

REFERENCES

[1] J. Dean and S. Ghemawat (December, 2004).

MapReduce: Simplified Data Processing on Large

Clusters, Paper presented at the meeting of

Proceedings of Symposium on Operating Systems

Design and Implementation, San Francisco, CA.

[2] Microsoft, “Database Engine Instances,”

http://technet.microsoft.com/en-

us/library/hh231298.aspx, 2013.

[3] Microsoft, “Filestream Overview,”

http://technet.microsoft.com/en-

us/library/bb933993(v=sql.105).aspx, 2013

[4] Microsoft, “Table and Index Organization,”

http://technet.microsoft.com/en-

us/library/ms189051(v=sql.105).aspx, 2013

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Mining Patterns with Attribute Oriented Induction

Spits Warnars

Database, Datawarehouse & Data Mining Research Center, Surya University

Jl. Boulevard Gading Serpong Blok O/1, Tangerang 15810, Indonesia

Spits.warnars@surya.ac.id

ABSTRACT

Mining data in human activity life such as business,

education, engineering, health and so on, is

important and help human itself in order to justify

their decision making process. Attribute Oriented

Induction (AOI) has been using to mine significant

different patterns since was coined in 1989, has

been combined and as complement with other data

mining pattern. AOI has been proved and powerful,

has future opportunity to be explored in order to

help human life to find data patterns. AOI is chosen

since can reduce many patterns by summarize/roll

up many patterns in low into high level in concept

tree/hierarchy. However, non summarize pattern at

low level in concept tree/hierarchy can be used to

sharpen the mining knowledge pattern just as like

roll up and drill down in data warehouse. Mapping

implementation of AOI in human life area such as

business, education, engineering, health and so on,

is useful in order to give valuable knowledge AOI

mining pattern, particularly for those who interest

with AOI data mining technique as data mining

technique which can summarize many pattern into

simple patterns.

KEYWORDS

Data Mining, Attribute Oriented Induction, AOI,

pattern, rule.

1. INTRODUCTION

Attribute Oriented Induction (AOI) method was

first proposed in 1989 integrates a machine

learning paradigm especially learning-from-

examples techniques with database operations,

extracts generalized rules from an interesting

set of data and discovers high level data

regularities [39]. AOI provides an efficient and

effective mechanism for discovering various

kinds of knowledge rules from datasets or

databases.

AOI approach is developed for learning

different kinds of knowledge rules such as

characteristic rules, discrimination rules,

classification rules, data evolution regularities

[1], association rules and cluster description

rules[2].

1) Characteristic rule is an assertion which

characterizes the concepts which satisfied

by all of the data stored in database. This

rule provide generalized concepts about a

property which can help people recognize

the common features of the data in a class.

For example the symptom of the specific

disease [9].

2) Discriminant rule is an assertion which

discriminates the concepts of one (target)

class from another (constrasting). This rule

give a discriminant criterion which can be

used to predict the class membership of of

new data,for example to distinguish one

disease from the other [9].

3) Classification rule is a set of rules which

classifies the set of relevant data according

to one or more specific attributes. For

example, classifying diseases into classes

and provide the symptoms of each [40].

4) Association rule is association relationships

among the set of relevant data. For

example, discovering a set of symptoms

frequently occurring together[12,35].

5) Data evolution regularities rule is a general

evolution behaviour of a set of the relevant

data (valid only in time-related/temporal

data). For example, describing the major

factors that influence the fluctuations of

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

stock values through time [3,37]. Data

evolution regularities can then be classified

into characteristic rule and discrimination

rule[3].

6) Cluster description rule is used to cluster

data according to data semantics [12], for

example clustering the university student

based on different attribute(s).

2. QUANTITATIVE AND QUALITATIVE

RULES IN AOI

Rules in AOI can be represented with

quantitative and qualitative rules:

1) Quantitative rule is a rule which is

associated with quantitative information

such as statistical information which asses

the representativeness of the rule in the

database [1]. There are three types

quantitative rule i.e. quantitative

characteristic rule, quantitative

discriminative rule and quantitative

characteristic and discriminative rule.

a. Quantitative characteristic rule is

quantitative information of a

characteristic rule and each rule in final

generalization can be measured with t-

weight in formula 1.

t-weight =Votes(qa)/ (1)

where :

 t-weight = percentage of each rule in

the final generalized

relation.

 Votes(qa) = number of tuples in each

rule in the final

generalized relation

Where Votes(qa) is in

Votes{q1,...,qN}.

 N = number of rules in the final

generalized relation.

Quantitative characteristic rule is

represented with symbol  and should

be in the form of:

V(x)=target_class(x)condition1(x)

[t:w1] V...V conditionn(x)[t:wn]

Where :

 x is the target class between 1..n.

 n is the number of rules in the final

generalized relation.

 [t:w1] is t-weight (formula 1) for rule

1 until

 [t:wn] as t-weight (formula 1) for rule

Example:

V(x) = graduate(x)(Birthplace(x) Є

Canada Λ GPA(x) Є excellent)

[t:75%] V (Major(x) Є science Λ

Birthplace(x) Є Foreign Λ

GPA(x) Є good) [t:25%]

b. Quantitative discriminative rule is a

discrimination rule that use quantitative

information. Each rule in the target class

will be discriminated against a rule in

the constrating class and is measured

with d-weight in formula 2.

d-weight =Votes(qa ϵ Cj) / (2)

where :

 d-weight = percentage ratio per rule

in the target class to the

total number of tuples in

the target class and the

contrasting class for the

same rule.

 Votes(qa) = number of tuples in each

rule in the target class Cj.

 Cj is in {C1,...,CK}.

 K = total number of the target and

constrating classess for the

same rule.

Quantitative discriminative rule is

shown with symbol  and should be in

the form of:

V(x)=target_class(x) condition1(x)

[d:w1] V...V conditionn(x)[d:wn]

Where:

 x is the target class between 1..n.

 n is the number of rules in the target

class.

 [d:w1] is d-weight (formula 2) for

rule 1 in the target class.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

 [t:wn] is d-weight (formula 2) for rule

n of target class.

Example:

V(x) = graduate(x)  (Birthplace(x)

ЄForeign Λ GPA(x) Є good)

[d:100%] V (Major(x) Є social

Λ GPA(x) Є good) [d:25%]

c. Quantitative characteristic and

discriminative rule use quantitative

information characteristic rule and

discriminative rule which have both t-

weight and d-weight for the same rules.

Each rule is measured with t-weight in

formula 1 for characteristic rule and d-

weight in formula 2 for discriminative

rule. Quantitative characteristic and

discriminative rule is shown with

symbol  and should be in the form

of:

V(x)=target_class(x) 

condition1(x)[t: w1,d:w1] V...V

conditionn(x)[t:wn,d:wn]

Where:

 x is target class between 1..n.

 n is number of rules in target class.

 [t: w1] is t-weight in formula 1.

 [d: w1] is d-weight in formula 2.

Example:

V(x) = professor(x) (Birthplace(x)

ЄForeign Λ GPA(x) Єgood)

[t:20%,d:100%] V (Major(x) Є

social Λ GPA(x) Є good)

[t:10%,d:25%]

2) Qualitative rule can be obtained by using

the same process of learning applied in its

quantitative counterpart without the

association of the quantitative attribute in

the generalized relations [1]. Qualitative

characteristic rule uses symbol  and

qualitative discriminative rule uses

symbol. Qualitative rule either

characteristic or discriminative rules should

be in the form of:

V(x)=target_class(x) [|] condition1(x)

V...V conditionn(x)

Example:

V(x) = graduate(x)  (Birthplace(x)

ЄCanada Λ GPA(x) Є excellent) V

(Major(x) Є science Λ Birthplace(x)

Є Foreign Λ GPA(x) Є good)

3. Concept Hierarchies

One advantage of AOI is that it has concept

hierarchy as the background knowledge which

can be provided by the knowledge engineers or

domain experts [2,3,4]. Concept hierarchy

stored a relation in the database provides

essential background knowledge for data

generalization and multiple level data mining.

Concept hierarchy represents a taxonomy of

concept of the attribute domain values. Concept

hierarchy can be specified based on the

relationship among database attributes or by set

groupings and be stored in the form of relations

in the same database [7].

Concept hierarchy can be adjusted dynamically

based on the distribution of the set of data

relevant to the data mining tasks. The

hierarchies for numerical attributes can be

constructed automatically based on data

distribution analysis [7]. Concept hierarchy for

numeric will be treated differently for the sake

of efficiency [20,21,22,23,26]. For example if

there are a range of value between 0 and 1.99,

then there will be 199 values start from 0.00

until 1.99, but for efficiency there will be only

1 record created with 3 fields rather than with

200 records with 2 fields.

In concept hierarchy concepts are ordered by

levels from specific or low level concepts into

general or higher level. Generalization is

achieved by ascending to the next higher level

concepts along the paths of the concept

hierarchy. The most general concept is the null

description as the most specific concepts

correspond to the specific values of the

attributes in the database which described as

ANY. Concept hierarchy can be balanced or

unbalanced, where unbalanced hierarchy then

must be converted to a balanced hierarchy.

Figure 1 shows the concept hierarchy tree for

attribute workclass in adult dataset [18] which

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

has three levels. The first level as the low level

has 8 concepts and they are without-pay, never-

worked, private, self-emp-not-inc, self-emp-inc,

federal-gov,state-gov and local-gov concepts.

The second level has 5 concepts and they are

charity, unemployed, entrepreneur, centre and

territory concepts. The third level as the high

level has 2 concepts and they are non

government and government concepts. For

example, the concept of non government at the

high level has 3 sub concepts in the second

level: charity, unemployed and entrepreneur

concepts. The concept entrepreneur at the

second level has 3 sub concepts in the low

level: private, self-emp-not-inc and self-emp-

inc concepts.

Figure 1. A concept hierarchy tree for attribute

workclass in adult dataset[18]

Concept hierarchy in figure 1 can be

represented with:

Without-pay



Charity

Never-worked



Unemployed

{Private, self-emp-not-inc,

self-emp-inc}



entrepreneur

{federal-gov,state-gov}



Centre

Local-gov



Territory

{Charity,Unemployed,

entrepreneur}



Non government

{Centre, Territory}



Government

{Non government,

Government}



ANY(workclass)

Where symbol



indicates generalization, for

example Without-pay



Charity indicates that

Charity concept is a generalization of Without-

pay concept.

There are four types of concept generalization

in the concept hierarchy [6]:

1) Unconditional concept generalization: rule

is associated with the unconditional IS-A

type rules. A concept is generalized to a

higher level concept because of the

subsumption relationship indicated in the

concept hierarchy.

2) Conditional/deductive rule generalization:

rule is associated with a generalization path

as a deduction rule where the type of rules

is conditional and can only be applied to

generalize a concept if the corresponding

condition can be satisfied. For example,

form: A(x) Λ B(x)  C(x) has the meaning

that for a tuple x, the concept(attribute

value) A can be generalized to concept C if

condition B can be satisfied by x. Or

concept C can be generalized if it can be

satisfied by concept A and B.

3) Computational rule generalization: each

rule is represented by a condition which is

value-based and can be evaluated against an

attribute or a tuple or the database by

performing some computations. The true

value of the condition would then determine

whether a concept can be generalized via

the path.

4) Hybrid rule-based concept generalization: a

hierarchy can have paths associated with all

the above 3 different types of rules. It has a

powerful representation capability and is

suitable for many kinds of application.

Rules number 2-4 is three types of rule based

concept hierarchy [5,34] while rule number 1 is

a non rule based concept hierarchy.

A rule-based concept hierarchy is a concept

hierarchy whose paths have associated

generalization rules. In the rule-based

induction, data cube (hypercube) in

multidimensional datawarehouse is the

favourable data structure [6]. To perform a rule-

based induction on the data in a large

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

warehouses, the path relation algorithm is an

excellent choice because datawarehouse has

already structured as cube/hypercube [6]. Rule-

based concept has induction anomaly problem

which affects the efficiency which is caused by:

1) A rule may depend on an attribute which

has been removed.

2) A rule may depend on an attribute whose

concept level in the prime relation has been

generalized too high to match the condition

of the rule.

3) A rule may depend on a condition which

can only be evaluated against the initial

relation, e.g. the number-of-tuples in the

relation.

There are three ways to solve the induction

anomaly problem [6]:

1) Reapplying the deduction rules all over

again on the initial relation which are costly

and wasteful.

2) Repetitive generalization required by roll-

up and drill-down which can be done in an

efficient way without induction anomaly

problem.

3) Propose the use of path relation (the last

method backtracking algorithm [5,6]

4. AOI prototype

The AOI method was implemented in a data

mining system prototype called DBMINER

[5,7,17,28,29] which previously called

DBLearn and been tested successfully against

large relational database. DBLearn

[24,25,27,38] is a prototype data mining system

which was developed in Simon Fraser

University. DBMINER was developed by

integrating database, OLAP and data mining

technologies [17,36] has following features:

1) Incorporating several data mining

techniques like attribute oriented induction,

statistical analysis, progressive deepening

for mining multiple-level rules and meta-

rule guided knowledge mining [7] data cube

and OLAP technology [17].

2) Mining new kinds of rules from large

databases including multiple level

association rules, classification rules, cluster

description rules and prediction.

3) Automatic generation of numeric

hierarchies and refinement of concept

hierarchies.

4) High level SQL-like and graphical data

mining interfaces.

5) Client server architecture and performance

improvements for larger application.

6) SQL-like data mining query language

DMQL and Graphical user interfaces have

been enhanced for interactive knowledge

mining.

7) Perform roll-up and drill-down at multiple

concept levels with multiple dimensional

data cubes.

5. AOI algorithms

AOI can be implemented with an architecture

design shown in figure 2 where characteristic

rule (LCHR) and classification rule (LCLR) can

be learned directly from the transactional

database (OLTP) or Data warehouse (OLAP)

[6,8] with the help of the concept hierarchy as

the knowledge generalization. Concept

hierarchy can be created from OLTP database

as a direct resource.

Figure 2. AOI architecture

From a database we can identify two types of

learnings:

1) Positive learning as the target class where

the data are tuples in the database which are

consistent with the learning concepts.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Positive learning/target class will be built

when learn characteristic rule

2) Negative learning as the contrasting class in

which the data do not belong to the target

class. negative learning/contrasting class

will be built when learn discrimination or

classification rule.

Characteristic rule has been used by AOI in

order to recognize, learning and mining as a

specific character for each of attribute as their

specific mining characterization. Characteristic

rule process the generalization with help of

concept hierarchy as the standard saving

background knowledge to find target class as a

positive learning. Mining rule can not be

limited with just only one rule, as the more

rules can be created the more mining can be

done. This has been proven as an intelligent

system which can help human to make a system

that has ability to think like a human [3]. Rules

often can be discovered by generalization in

several possible directions [9].

Relational database as resources for data

mining with AOI can be read with data

manipulation language select sql statement

[13,14,15,16]. Using a query for building rules

gives an efficient mechanism for understanding

the mined rules [11,12]. In the current AOI, a

query is processed with SQL-like data mining

query language DMQL at the beginning of the

process. It collects the relevant sets of data by

processing a transformed relational query,

generalizes the data by AOI and then presents

the outputs in different forms [7].

AOI generalizes and reduces the prime relation

further until the final relation can satisfy the

user expectation based on the set threshold. One

or two thresholds can be applied, where one

threshold is used to control both of number of

distinct attributes and tuples in the

generalization process, whilst two thresholds

are used to control the number of distinct

attributes and tuples in the generalization

process.

Threshold as a control for the maximum

number of tuples of the target class in the final

generalized relation can be replaced with group

by operator in sql select statement which will

limit the final result of generalization. Setting

different threshold will generate different

generalized tuples as the needed of global

picture of induction repeatedly as time-

consuming and tedious work [10]. All

interesting generalized tuples as multiple rule

can be generated as the global picture of

induction by using group by operator or distinct

function in the sql select statement.

AOI can perform datawarehouse techniques by

doing generalization process repetitively in

order to generate rules at different concepts

levels in a concept hierarchy, enabling the user

to find the most suitable discovery levels and

rules. This technique performs roll up

(progressive generalization [6]) or drill down

(progressive specialization [6]) and operation

[2,7] have been recognized as datawarehouse

techniques. Finding the most suitable discovery

levels and rules would add multidimensional

views to a database using generalization

process repetitively at different concepts level.

Building a logical formula as the representation

of a final result of AOI can not be done with

select sql statement and not select sql statement.

However, the sql statement can be matched

with other applications like Java, Visual Basic,

programming server program like ASP, JSP or

PHP. The data resulted from the sql statement

can be used to create a logical formula using

one of those application softwares.

There are 8 strategy steps that must be done [3]

in the process of generalization. Here is step

one to seven which is for characteristic rule and

step one to eight are for

classification/discriminant rule.

1) Generalization on the smallest

decomposable components, generalization

should be performed on the smallest

decomposable components of a data

relation.

2) Attribute removal, if there is a large set of

distinct values for an attribute but there is

no higher level concept provided for the

attribute, the attribute should be removed

during generalization.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

3) Concept tree Ascension, if there exists a

higher level concept in the concept

hierarchy for an attribute value of a tuple,

the substitution of the value by its higher

level concept would generalize the tuples.

4) Vote propagation, the value of the vote is

the value of accumulated tuples where the

vote will be accumulated when merging

identical tuples in the generalization.

5) Threshold control on each attribute, if the

number of distinct values in a resulting

relationthe is larger than the specified

threshold value, further generalization on

this attribute should be performed.

6) Threshold control on generalized relations,

if the number of tuples is larger than the

specified threshold value, further

generalization will be done based on the

selected attributes and the merging of the

identical tuples should be performed.

7) Rule transformation, change final

generalization to quantitative rule and

qualitative rule from a tuple (conjunctive)

or multiple tuples (disjunctive).

8) Handling overlapping tuples, if there are

overlapping tuples in both target and

constrasting classes, these tuples should be

marked and eliminated from the final

generalized relation.

AOI characteristic rule algorithm [3] is given as

follow:

For each of attribute Ai (1



where n= # of attributes) in the

generalized relation GR

{ While

#_of_distinct_values_in_attribute_Ai >

threshold

{If no higher level concept in

concept hierarchy for

attribute_Ai

Then remove attribute Ai

Else substitute the value of Ai

by its corresponding

minimal generalized

concept

Merge identical tuples

}

While #_of_tuples in GR > threshold

{ Selective generalize attributes

Merge identical tuples

}

This AOI characteristic rule algorithm is the

implementation of step one to seven of the

generalization strategy steps. The algorithm

shows two sub processes i.e. control number of

distinct attributes and control number of tuples.

1) Control number of distinct attributes is a

vertical process which checks every per

attribute vertically. This is done by

checking all attributes in the learning results

of a dataset which have distinct attributes

less equal than the threshold. This first sub

process is just applied attributes that have

distinct attributes greater than threshold

while the number of distinct attributes are

also greater than the threshold. Each

attribute which have distinct attribute

greater than threshold will be checked if it

has a higher level concept in the concept

hierarchy. If it has no higher level concept

then the attribute will not be used. On the

other hand if they have higher level concept

then the attribute value will be substituted

with the value of the higher level concept.

Merging identical tuples will be done in

order to summarize generalization and

accumulate the value of the vote of the

identical tuples by eliminating the

redundant tuples. Eventually, after this first

sub process all the attributes in

generalization will have number of distinct

attributes less equal than the threshold. This

first sub process is implementation of step

one to five of the generalization strategy

steps.

2) Control number of tuples is a horizontal

process which checks per rule horizontally.

This is carried out for those attributes which

passed the first sub process where each

attribute will have number of distinct

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

attributes less equal than the threshold. This

second sub process is only done while the

number of rules is greater than threshold.

Selective generalization of the attributes

and merging of the indentical tuples will

reduce the number of rules. Selecting

candidate attribute for further generalization

can be done by preferences with finding the

ratio on the number of tuples or the number

of distinct attribute values. Selecting

candidate attribute for further generalization

can be examined by user based on the non

interesting one, either non interesting

attribute or non interesting rule. As with

first sub process merging the identical

tuples will be done in order to summarize

generalization and accumulate the vote

value of identical tuples by eliminating the

redundant tuples. Eventually, after this

second sub process the number of rules is

less equal than the threshold. This second

sub process is the implementation of step

three, four and six of the generalization

strategy steps.

AOI discriminant rule algorithm [1] is shown

below:

For each of attribute Ai (1



where n= # of attributes) in the

generalized relation GR

{ Mark the overlapping tuples

While

#_of_distinct_values_in_attribute_Ai >

threshold

{ If no higher level concept in

concept hierarchy for

attribute_Ai

Then remove attribute Ai

Else substitute the value of

Ai by its corresponding

minimal generalized

concept

Mark the overlapping tuples

Merge identical tuples

}

While #_of_tuples in GR > threshold

{ Selective generalize attributes

Mark the overlapping tuples

Merge identical tuples

}

AOI discriminant rule algorithm is the

implementation of step one until eight of

generalization strategy steps. Since AOI

discriminant rule and AOI characteristic rule

algorithms have the same generalization

strategy steps between steps one and seven,

then literally they have the same process and

the difference is just only in step eight. They

also have the same sub processes i.e. control

number of distinct attributes as the first sub

process and control number of tuples as the

second sub process. The step handling

overlapping tuples as the eight generalization

strategy step is process in the beginning before

the first sub process and both in first and

second processes before merge indentical

tuples.

6. AOI Advantages and disadvantages

AOI provides a simple and efficient way to

learn knowledge rules from a large database

and has many advantages [9] such as:

1) AOI provides additional flexibility over

many machine learning algorithms.

2) AOI can learn knowledge rules in different

conjunctive and disjunctive forms and

provides more choices for the experts and

users.

3) AOI can use database facilities as the

traditional relational database such as

selection, join, projection whereas most

learning algorithms suffer from inefficiency

problems in a large database environment.

4) AOI can learn qualitative rules with

quantitative information while many

machine learning algorithm only can learn

qualitative rules.

5) AOI can handle noisy data and exceptional

cases elegantly by incorporating statistical

techniques in the learning process whereas

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

some learning system can only work in a

‘noise free’ environment.

However, AOI also has disadvantages [10] such

as:

1) AOI can only provides a snapshot of the

generalized knowledge and not a global

picture. Yet, the global picture can be

revealed by trying different thresholds

repeatedly.

2) Adjusting different thresholds will result in

different sets of generalized tuples.

However, using different thresholds

repeatedly is a time consuming and tedious

work.

3) There will be a problem in selecting the best

generalized rules between the large and

small threshold. Where in a large threshold

value will lead to a relatively complex rule

with many disjuncts and the results may not

be fully generalized. On the other hand a

small threshold value will lead to a simple

rule with few disjuncts and the results may

over generalized the rule with a risk of

losing some valuable informations.

7. AOI Current Studies

There are a number of recent studies on AOI.

One study by Chen et al has proposed a global

AOI method employing multiple-level mining

technique with multiple minimum supports in

order to generalize all interesting general

knowledge [30]. Wu et al have proposed a

Global Negative AOI (GNAOI) approach that

can generate comprehensive and multiple-level

negative generalized knowledge at the same

time [31]. Furthermore, Muyeba et al have

proposed clusterAOI, a hybrid interestingness

heuristic algorithm, which uses attribute

features such as concept hierarchies and distinct

domain attribute values to dynamically

recalculate new attribute thresholds for each less

significant attribute [32]. Moreover, Huang et al

have introduced the Modified AOI (MAOI)

method to deal with the multi-valued attribute

table and further sort the readers into different

clusters. Instead of using the concept hierarchy

and concept trees, MAOI method implemented

the concept climbing and generalization of

multi-valued attribute table with Boolean

Algebra and modified Karnaugh Map, and then

described the clusters with concept description

[33].

Meanwhile, Over generalization problem in

AOI was reduced with entropy measurement,

where AOI algorithm was extended by feature

selection for generalization process depends on

feature entropy measurement [41]. Meanwhile,

AOI is combined with EP(Emerging Pattern)

become AOI-HEP(Attribute Oriented Induction

High Emerging Pattern) use to mine frequent

and similar pattern [42,43,44] and have future

research such as inverse discovery learning,

learning more than two datasets and learning

other knowledge rules[45]. Moreover, MAOI

(Modified AOI) algorithm was proposed to deal

with multi-valued attributes which convert the

data to Boolean bit uses K-map to converge the

attributes[46]. Furthermore, AOI was modified

and called Frequency Count AOI (FC-AOI) and

used to mine the network data[47]. Meanwhile,

AOI was extended and used as Extended

Attribute Oriented Induction (EAOI) for

clustering mixed data type, where EAOI has

function to drawback major values and numeric

attributes[48,49].

Moreover, AOI was chosen as second step from

5 steps proposed algorithm in order to produce

AOI characteristic rule for parallel machine

scheduling[50]. Another approach was proposed

where doing classification using decision tree

induction which improve C4.5 classifier with 4

steps where the first step is generalization by

AOI[51]. Meanwhile, CFAOI (Concept-Free

AOI) was proposed in order to improve AOI

from the constraint of concept tree on multi

value attributes, by combining the simplified

binary digits with Karnaugh Map [52].

8. Conclusion

AOI has ages of 26 years since 1989 proof that

still exist in finding pattern and have been

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

combined and as complement with other data

mining techniques. AOI can mine many

different patterns and other possible patterns in

the future. AOI has been proof as powerful

mining technique when many patterns can be

mining with simple pattern results. AOI has

powerful in order to roll up/summarize data in

low to high level in concept tree/hierarchy,

which show that produce simple pattern.

Implementation AOI shows that AOI is useful

and recognized to mine pattern summarize

pattern from huge pattern and many kinds of

different patterns. Using AOI in many kind of

field such as business, education, engineering,

health and so on, should be mapped in order to

increase the reliability of AOI as proof and

powerful data mining technique.

Acknowledgement

This research is supported under Program of

research incentive of national innovation

system (SINAS) from Ministry of Research,

Technology and Higher Education of the

Republic of Indonesia, decree number

147/M/Kp/IV/2015, Research code: RD-2015-

0020.

REFERENCES

[1] Han,J., Cai, Y., and Cercone, N. 1993. Data-driven discovery of

quantitative rules in relational databases. IEEE Trans on Knowl

and Data Engin, 5(1),29-40.

[2] Han,J. and Fu, Y. 1995. Exploration of the power of attribute-

oriented induction in data mining. in U. Fayyad, G. Piatetsky-

Shapiro, P. Smyth and R. Uthurusamy, eds. Advances in

Knowledge Discovery and Data Mining, 399-421.

[3] Han, J., Cai, Y. and Cercone, N. 1992. Knowledge discovery in

databases: An attribute-oriented approach. In Proceedings of the

18th Int. Conf. Very Large Data Bases, 547-559.

[4] Han,J. 1994. Towards efficient induction mechanisms in database

systems. Theoretical Computer Science, 133(2), 361-385.

[5] Cheung, D.W., Fu, A.W. and Han, J. 1994. Knowledge discovery

in databases: A rule-based attribute-oriented approach. In

Proceedings of Intl Symp on Methodologies for Intelligent

Systems, 164-173.

[6] Cheung, D.W., Hwang, H.Y., Fu, A.W. and Han, J. 2000.

Efficient rule-based attribute-oriented induction for data mining.

Journal of Intelligent Information Systems, 15(2), 175-200.

[7] Han,J., Fu, Y.,Wang, W., Chiang, J., Gong, W., Koperski,

K., Li,D., Lu, Y., Rajan,A., Stefanovic,N., Xia,B. and

Zaiane,O.R.1996. DBMiner:A system for mining knowledge in

large relational databases. In Proceedings of Int'l Conf. on Data

Mining and Knowledge Discovery, 250-255.

[8] Han,J., Lakshmanan, L.V.S. and Ng, R.T. 1999. Constraint-based,

multidimensional data mining. IEEE Computer, 32(5), 46-50.

[9] Cai, Y. 1989. Attribute-oriented induction in relational databases.

Master thesis, Simon Fraser University.

[10] Wu, Y., Chen, Y. and Chang, R. 2009. Generalized Knowledge

Discovery from Relational Databases. International Journal of

Computer Science and Network, 9(6),148-153.

[11] Imielinski, T. and Virmani, A. 1999. MSQL: A Query Language

for Database Mining. in Proceedings of Data Mining and

Knowledge Discovery, 3, 373-408.

[12] Muyeba, M. 2005. On Post-Rule Mining of Inductive Rules using

a Query Operator. In Proceedings of Artificial Intelligence and

Soft Computing.

[13] Meo, R., Psaila,G. and Ceri,S. 1998. An Extension to SQL for

Mining Association Rules. In Proceedings of Data Mining and

Knowledge Discovery,2,195-224.

[14] Muyeba,M.K. and Keane,J.A. 1999. Extending attribute-oriented

induction as a key-preserving data mining method. In Proceedings

3rd European Conference on Principles of Data Mining and

Knowledge Discovery, Lecture Notes in Computer science, 1704,

448-455.

[15] Muyeba, M. and Marnadapali, R. 2005. A framework for Post-

Rule Mining of Distributed Rules Bases. In Proceeding of

Intelligent Systems and Control.

[16] Zaiane, O.R. 2001. Building Virtual Web Views. Data and

Knowledge Engineering, 39, 143-163.

[17] Han, J., Chiang, J. Y., Chee, S., Chen, J., Chen, Q., Cheng, S.,

Gong, W., Kamber, M.,Koperski, K., Liu, G., Lu, Y., Stefanovic,

N., Winstone, L., Xia, B. B., Zaiane, O. R., Zhang, S., and Zhu,

H. 1997. DBMiner: a system for data mining in relational

databases and data warehouses. In Proceedings of the 1997

Conference of the Centre For Advanced Studies on Collaborative

Research, 8-.

[18] Frank, A. and Asuncion, A. 2010. UCI Machine Learning

Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University

of California, School of Information and Computer Science.

[19] Elfeky, M.G., Saad, A.A. and Fouad, S.A. 2000. ODMQL: Object

Data Mining Query Language. In Proceedings of the International

Symposium on Objects and Databases, 128-140.

[20] Han, J. and Fu, Y. 1994. Dynamic Generation and Refinement of

Concept Hierarchies for Knowledge Discovery in Databases. In

Proceedings of AAAI Workshop on Knowledge Discovery in

Databases, 157-168.

[21] Huang, Y. and Lin, S. 1996. An Efficient Inductive Learning

Method for Object-Oriented Database Using Attribute Entropy.

IEEE Transactions on Knowledge and Data Engineering,

8(6),946-951.

[22] Hu, X. 2003. DB-HReduction: A Data Preprocessing Algorithm

for Data Mining Applications. Applied Mathematics

Letters,16(6),889-895.

[23] Hsu, C. 2004. Extending attribute-oriented induction algorithm for

major values and numeric values. Expert Systems with

Applications, 27, 187-202.

[24] Han, J., Fu, Y., Huang, Y., Cai, Y., and Cercone, N. 1994.

DBLearn: a system prototype for knowledge discovery in

relational databases. ACM SIGMOD Record, 23(2), 516.

[25] Han, J., Fu, Y., and Tang, S. 1995. Advances of the DBLearn

system for knowledge discovery in large databases. In

Proceedings of the 14th international Joint Conference on

Artificial intelligence, 2049-2050.

[26] Beneditto, M.E.M.D. and Barros, L.N.D. 2004. Using Concept

Hierarchies in Knowledge Discovery. Lecture Notes in Computer

Science, 3171,255–265.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

[27] Fudger, D. and Hamilton, H.J. 1993. A Heuristic for Evaluating

Databases for Knowledge Discovery with DBLEARN. In

Proceedings of the International Workshop on Rough Sets and

Knowledge Discovery: Rough Sets, Fuzzy Sets and Knowledge

Discovery (RSKD '93), 44-51.

[28] Han, J. 1997. OLAP Mining: An Integration of OLAP with Data

Mining. In Proceedings of the 7th IFIP 2.6 Working Conference on

Database Semantics (DS-7),1-9.

[29] Han, J., Fu,Y., Koperski, K., Melli, G., Wang, W. And Zaïane,

O.R. 1996. Knowledge Mining in Databases: An Integration of

Machine Learning Methodologies with Database Technologies,

Canadian Artificial Intelligence,(38),4-8.

[30] Chen, Y.L., Wu,Y.Y. and Chang, R. 2012. From data to global

generalized knowledge. Decision Support Systems, 52(2), 295-

307.

[31] Wu,Y.Y., Chen,Y.L., and Chang,R., 2011, Mining negative

generalized knowledge from relational databases, Knowledge-

Based Systems,24(1), 134-145.

[32] Muyeba, M.K., Crockett, K. and Keane, J.A. 2011. A hybrid

interestingness heuristic approach for attribute-oriented mining. In

Proceedings of the 5th KES international conference on Agent

and multi-agent systems: technologies and applications (KES-

AMSTA'11), 414-424.

[33] Huang, S., Wang, L. and Wang, W. 2011. Adopting data mining

techniques on the recommendations of the library collections. In

Proceedings of the 11th international conference on Information

and Knowledge engineering, 46-52.

[34] Thanh, N.D., Phong, N.T. and Anh, N.K. 2010. Rule-Based

Attribute-Oriented Induction for Knowledge Discovery. In

Proceedings of the 2010 2nd International Conference on

Knowledge and Systems Engineering (KSE '10), 55-62.

[35] Han,J. and Fu, Y. 1995. Discovery of Multiple-Level Association

Rules from Large Databases. In Proceedings of the 21th

International Conference on Very Large Data Bases (VLDB '95),

420-431.

[36] Han, J. 1998. Towards on-line analytical mining in large

databases. SIGMOD Rec. 27(1), 97-107.

[37] Han, J., Cai, O., Cercone, N. and Huang, Y. 1995. Discovery of

Data Evolution Regularities in Large Databases. Journal of

Computer and Software Engineering,3(1),41-69.

[38] Cercone, N., Han, J., McFetridge, P., Popowich, F., Cai,Y., Fass,

D., Groeneboer, C., Hall, G. and Huang, Y. 1994. System X and

DBLearn: How to Get More from Your Relational Database,

Easily. Integrated Computer-Aided Engineering, 1(4),311-339.

[39] Cai, Y., Cercone, N. and Han, J. 1991. Learning in relational

databases: an attribute-oriented approach. Comput. Intell,

7(3),119-132.

[40] Cai, Y., Cercone, N. and Han, J. 1990. An attribute-oriented

approach for learning classification rules from relational

databases. In Proceedings of 6th International Conference on Data

Engineering, 281-288.

[41] Al-Mamory, S.O., Hasson, S.T. and Hammid, M.K. 2013.

Enhancing Attribute Oriented Induction of Data Mining, Journal

of Babylon University, 7(21), 2286-2295.

[42] S. Warnars. 2015. Mining Frequent and similar patterns with

Attribute Oriented Induction High Level Emerging Pattern (AOI-

HEP) Data Mining technique, International Journal of Emerging

Technologies in Computational and Applied Sciences (IJETCAS),

3(11), 266-276.

[43] S.Warnars, 2014. Mining Frequent pattern with Attribute Oriented

Induction High Level Emerging Pattern (AOI-HEP). Proceedings

of the 2nd International Conference on Information and

Communication Technology (ICoICT), 144-149.

[44] S.Warnars, 2012. Attribute Oriented Induction High Level

Emerging Pattern. Proceedings of the International Conference on

Granular Computing(GrC).

[45] S.Warnars, 2014. Attribute Oriented Induction High Level

Emerging Pattern (AOI-HEP) future research. Proceedings of the

8nd International Conference on Information & Communication

Technology and Systems (ICTS), 13-18.

[46] Huang, s, Hsu, P. and Lam, H.N.N. 2013. An attribute oriented

Induction approach for Knowledge discovery from relational

databases. Advances in Information Sciences and Service

Sciences (AISS), 5(3), 511-519.

[47] Tanutama, L. 2013. Frequency count Attribute Oriented Induction

of Corporate Network data for Mapping Business activity.

International Conference on Advances Science and Contemporary

Engineering (ICASCE), 149-152.

[48] Prasad, D.H. and Punithavalli, M. 2012. An integrated GHSOM-

MLP with Modified LM Algorithm for Mixed Data Clustering,

ARPN Journal of Engineering and Applied Sciences, 7(9), 1162-

1169.

[49] Prasad, D.H. and Punithavalli, M. 2013. A Novel approach for

mixed Data Clustering using Dynamic Growing Hierarchical Self-

Organizing Map and Extended Atrribute-Oriented Induction,Life

Science Journal, 10(1), 3259-3266.

[50] Kaviani, M., Aminnayeri, M, Rafienejad, S.N. and Jolai, F.2012.

An appropriate pattern to solving a parallel machine scheduling

by combination of meta-heuristic and data mining, Journal of

American Science, 8(1), 160-167.

[51] Ali, M.M, Qaseem, M.S., Rajamani, L. and Govardhan, A. 2013.

Extracting useful Rules Through Improved Decision Tree

Induction using Information Entropy, International Journal of

Information Sciences and Techniques(IJIST), 3(1), 27-41.

[52] Huang, S. 2013. CFAOI: Concept-Free AOI on Multi Value

Attributes. Life Science Journal, 10(4), 2341-2348.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Supporting Aggregation in Data Warehousing Considering User-Defined

Temporal Granularities

Paolo Terenziani

, Antonella Andolina

DISIT – Universita’ del Piemonte Orientale “Amedeo Avogadro”

Viale Teresa Michel 11, 15121 Alessandria, Italy

Email: terenz@di.unito.it

ITCS Sommeiller, C.so Duca degli Abruzzi, 20 - 10129 Torino, Italy

Email: antoando@libero.it

ABSTRACT

Time-varying data are important in many

applications. Starting from the 80’s, they have been

widely studied in the field of temporal databases.

More recently, data warehouses (DWs) have gained

an increasing importance, strongly focusing on

time-varying data. DWs support different temporal

granularities (e.g., days, months, years), as well as

several operators for aggregating data along them.

In the past two decades, the importance of coping

also with user-defined granularities has been clearly

identified. However, most current DWs only

supports the treatment of limited sets of pre-defined

granularities. In this paper, we ground our approach

on some basic results in the temporal database

literature, including the telic\atelic distinction, and

we propose a general and application-independent

framework supporting aggregation of measures

along user-defined granularities.

KEYWORDS

Temporal Databases, Data Warehousing, user-

defined temporal granularities, temporal

aggregation, telic\atelic data

1 INTRODUCTION

The history of past activities, in the form of

time-stamped data, is a crucial source of

information for knowledge\data-based decision

making (consider, e.g., Business Intelligence).

Data warehouses (DWs) are widely exploited to

access historical data to support strategic

decisions. A DW is “a collection of subject-

oriented, integrated, non-volatile, and time-

variant data to support management’s

decisions” [1]. Thus, the temporal dimension is

an intrinsic part of the definition of DWs. Facts

in DWs have a temporal dimension, regardless

of whether source data are temporal or not. If

source data are not temporal, the time when

data are loaded into the DW is associated to

them, otherwise the time-stamps in the source

data are imported into the DW.

The research in the area of databases clearly

points out that the treatment of temporal data is

a semantically and practically complex issue

and that, given both its generality and

complexity, it cannot be delegated to users and

system-designers (see, e.g., the illuminating

examples in [2]). The area of research of

“temporal databases” (TDB henceforth) has

originated to cope with temporal data in

databases, leading , in the last three decades, to

a large set of dedicated conferences and

scientific papers (e.g., more than 2000 papers

are referred in the bibliography in [3], and the

recent Encyclopedia of Database Systems [4]

contains about 90 entries on time-related

terms). However, several problems still deserve

further investigation. We focus on one of them:

aggregation of measures along multiple user-

defined temporal granularities, to cope with

problems like the one in Ex.1, where working-

weeks (the period from Mondays to Fridays) is

an example of user-defined granularity.

Ex.1 given salaries expressed at the granularity

of working-weeks aggregate them at the

granularity of months.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

In Data Warehousing (DW), the aggregation

(through functions such as sum, average, etc.)

of measures along dimensions (e.g., in “roll up”

operations) is a fundamental and frequently

adopted operation. Considering the temporal

dimension, current DW systems operate on a

fixed pre-defined set of granularities, usually

taken from the Gregorian calendar. On the other

hand, supporting user-defined granularities

involve being able to take into account also

new, domain and\or task dependent

granularities, defined by the system designer, or

by users. In the area of temporal databases,

there is a wide consensus about the importance

of coping with user-defined granularities, and

many different approaches have been devised to

achieve such a task (see the discussion in

Section 5 of this paper). For example, in their

attempt to provide a new standard (TSQL2 [2])

for the treatment of time in temporal databases,

Snodgrass and co-authors emphasized that in

many practical applications, supporting a

“standard” set of granularities (e.g., seconds,

minutes, ..., days, months, years) does not

suffice, since user-defined granularities need to

be used. In [2], the authors stress that the usage

of a calendar (informally, a calendar can be

seen as a set of related time granularities)

depends on the cultural, legal and even business

orientation of the user, and that no calendar is

inherently “better” than another. They also

show many examples of different calendars and

periodicities, including e.g. Gregorian vs. lunar

calendar, and fiscal vs. academic year. On the

other hand, in most DW approaches only a pre-

defined set of temporal granularities (usually

taken from the Gregorian calendar only) is

provided, and only regular mappings (i.e.,

mappings for which some conversion constant

exists [5]) between such a pre-defined set of

granularities are usually supported. In case they

are considered, irregular mappings between

pre-defined granularities are treated in an ad-

hoc way. On the other hand, no support is

provided to extend a DW with new user-

defined granularities (whose mapping onto the

pre-existent ones may be, obviously, irregular).

Indeed, in their recent book about advanced

spatial and temporal DW, Malinowski and

Zimanyi [5] have explicitly claimed: “Even

though the problem of managing data with

multiple temporal granularities has been widely

investigated in the area of temporal databases,

this is still an open research area in the field of

temporal datawarehouses” (page 246).

In this paper, we overcome such a relevant

limitation of the current literature, coping with

temporal aggregation in the context of multiple

user-defined temporal granularities. We

consider a context in which, as in most TDB

approaches to user-defined temporal

granularities (e.g., in TSQL2 [2], and in [6]),

temporal granularities may be arbitrarily

defined by system administrators\users, and

users may switch between any pair of

granularities. Abstractly speaking, the whole

problem may be split into three parts:

(i) defining proper mapping algorithms that

determine the correspondence between

granules in G

origin

and granules in G

task

;

(ii) define efficient algorithms to collect data to

be aggregated at a given granularity;

(iii) support users with appropriate

aggregation functions to perform

“meaningful” aggregations.

The TDB literature has devoted a lot of

attention to problems (i) (see, e.g., the book [6])

and (ii) (called temporal group composition in

[7]). On the other hand, in this paper we focus

on the problem (iii) above, which has been

partially neglected by the TDB literature,

especially as concerns its impact on DWs. We

propose a general-purpose and domain-

independent methodology to systematically

provide users with a set of proper aggregation

functions/operations to aggregate values along

the DW temporal dimension, when moving

from an origin temporal granularity (e.g.,

business-weeks) to a task granularity (e.g.,

months), and considering the temporal

relationships of the origin and task granularities

and the nature of the data to be aggregated. In

particular,

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

(1) we focus on the telic vs. atelic nature of

data, whose impact on TDBs has only

recently started to be explored [8], and

(2) differently from the current DW

approaches, in which the granularities used

in the temporal dimension are limited

(usually, a subset of the Gregorian

calendar), we support user-defined

granularities, and irregular mappings

between them.

To the best of our knowledge, there is no DW

approach that deals in a systematic way with

problem (iii), and with the above tasks.

Obviously, it would be possible to delegate

users and system developers to develop

aggregation functions between multiple user-

defined granularities at the application level.

However, such a choice would result in a huge

extra-amount of (very complex) work for them.

We thus aim at providing a suitable suite of

aggregation functions once-and-for-all, at the

system level. Obviously, we do not claim that

the set of aggregation functions/operations we

identify is “complete”. We believe that it is not

reasonable to demand all of the work to the

users. Thus, our approach is a first step towards

a general (not ad-hoc or application\domain-

dependent) and user-friendly approach to cope

with the above phenomena.

Our methodology has at least three main

contexts of applications within the DW context.

It can be applied:

(1) to aggregate a measure from one temporal

granularity to another one

(2) in cases in which source data are temporal,

to aggregate source data from the source

temporal granularity to the basic temporal

granularity of the DW (or Data Mart)

(3) in cases in which slowly changing

dimensions [9] are coped with using TDB

methodologies (as suggested in [5]), to

cope with the aggregation of measures

when dimensional attributes are changing.

Since we operate at the data level, in this paper

we fully explore only the first two issues above.

Finally, it is worth mentioning that, in this

paper, we focus only on valid time [10].

Transaction time [10] may be present in the

source data (and, possibly, it can be loaded in

the DW), but “transaction time for aggregated

data will not give useful information for

decision-making users” [11]. Anyway, our

methodology is general, thus it can be applied

also to transaction time (which is inherently

atelic [12]), and to other temporal dimensions,

such as lifespan and loading time in [5].

The paper is organized as follows. We

first introduce the telic\atelic distinction in

Section 2. Section 3 identifies three core

phenomena to be coped with when dealing with

temporal aggregation in DW, as well as a set of

relationships which holds between pairs of

user-defined temporal granularities. In section

4, we introduce a set of general functions to

cope with the above phenomena, and a criteria

to choose between them. Finally, section 5

presents conclusions and related works.

2 PRELIMINARIES: THE TELIC\ATELIC

DISTINCTION IN DATABASES

The telic/atelic distinction originates from the

ancient Greek philosophy. In [13], Aristotle has

divided facts into two main categories: telic and

atelic. Telic facts (e.g., John built a house) are

those facts that have telos (telos means goal in

ancient Greek), while atelic facts (e.g., John

was asleep) do not have it (the prefix “a”

corresponds to privative “α” in Greek).

Starting from Aristotle, the distinction between

telic and atelic facts has been widely studied in

many different areas and, in particular within

linguistics. In the area of linguistics, natural

language sentences are studied and classified

according to their syntactic, and semantic

properties, and their linguistic behvior. For

instance Vendler classified sentences as

activities, accomplishment, achievements and

states [14]. In particular, the linguistic analysis

has pointed out that the telic\atelic distinction

plays an important role in characterizing the

semantics of sentences\facts. In particular,

Dowty [15] pointed out that stative sentences

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

(which are atelic) have the downward

inheritance property [16]: for example, if John

was asleep from 1 to 2pm, he was asleep in

each subintervals of such an interval. Dowty

also noticed that such a property does not hold

for accomplishments (which are telic): if John

built a house starting from April 1

and ending

on October 31

, we cannot conclude that he

built the house in any subinterval of such an

interval (indeed, one can say that John was

building a house on May, but not that he built

the (whole) house on May). Indeed, also

upward inheritance [16] holds for atelic facts,

but not for telic ones: if John was asleep from 1

to 2 and then from 2 to 3, we can correctly

conclude that John was asleep between 1 and 3;

on the other hand, if John built a house from

April 1

to October 31

, and then a (another)

house from November 1

to January 30

, one

cannot conclude that John built a house from

April 1

to January 30

. The telic\atelic has

been studied in many different areas, including

linguistics, philosophy, cognitive science, and

is is quite articulated. However, in this paper, it

is enough to oversimplify it considering the

following definition:

Definition. Atelic vs. telic values (data). In

this paper, we define atelic those values (data)

for which upward and downward inheritance

hold, and telic those values for which such

properties do not hold.

Despite its generality and relevance, the

telic/atelic distinction has usually been

neglected in the TDB area. A relevant

exception are the work by Terenziani and

Snodgrass, that have proposed a relational data

model and algebra considering the distinction

[12], and the telic\atelic entry in the database

Encyclopedia [8]. For the first time, in this

paper we analyze the impact of the telic/atelic

distinction on the problem of temporal

aggregation between different temporal

granularities. But, first, it is important to point

out what is the main impact of the telic\atelic

distinction on TDBs: as discussed in [12], the

snapshot semantics ([2]; also called point-based

semantics), which is usually adopted in the

TDB area to model the meaning of temporal

data, applies correctly to atelic facts, but not to

telic ones, for which an interval-based

semantics is required. To show this fact, let us

consider drug administration, as an example of

telic fact. Let us suppose that Sue had an

administration of 500 mg of cyclophosphamide

(a cancer drug) starting at 1 and ending at 3

(inclusive), an administration of 500 mg of

cyclophosphamide starting at 4 and ending at 4,

and that Mary had an administration of 200 mg

of infliximab starting at 2 and ending at 4. If we

adopt the point-based semantics, the above

example is modeled as:

1 → {<Sue, cyclophosphamide, 500>}

2 → {<Sue, cyclophosphamide, 500>,

<Mary, infliximab, 200>}

3 → {<Sue, cyclophosphamide, 500>,

<Mary, infliximab, 200>}

4 → {<Sue, cyclophosphamide, 500>,

<Mary, infliximab, 200>}

Point-based semantics is not expressive enough

to deal with the temporal interpretation of telic

data, so that, in the example, we have a loss of

information. The two different drug

administrations given to Sue (one starting at 1

and ending at 3, the other from 4 to 4) cannot

be distinguished in the semantic model. The

effects of such a loss of information are quite

critical. For example, in the representations

above, there is no way to recover the fact that

1000 ml of cyclophosphamide in total were

administered to Sue. It is fundamental to stress

that such a loss of information is completely

independent of the representation language

used to model data. For instance, the above

example could be represented as

<Sue, cyclophosphamide, 500 || {1,2,3,4}> ∈ R

<Mary, infliximab, 200 || {2,3,4}> ∈ R

<Sue, cyclophosphamide, 500 || {[1–3],[4–4]}>

∈ R and <Mary, infliximab, 200 || {[2–4]}> ∈R

<Sue, cyclophosphamide, 500 || [1–3]> ∈ R and

<Sue, cyclophosphamide, 500 || [4–4]> ∈ R and

<Mary, infliximab, 200 || {[2–4]}> ∈ R

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

But, as long as the point-based semantics is

used, the data semantics is the one elicited

above. Thus, the loss of information is the same

in all the representations. Therefore, regardless

of the chosen representation, a more expressive

semantics is needed to cope properly with telic

data. In the interval-based semantics each tuple

in a temporal relation is associated with a set of

time intervals, which are the temporal extents in

which the fact described by the tuple occur. In

this semantics the index is a time interval. Time

intervals are atomic primitive entities, in the

sense that they cannot be decomposed. In the

interval-based semantics, the above example

can be modelled as follows:

[1–3] → {<Sue, cyclophosphamide, 500>}

[2–4] → {<Mary, infliximab, 200>}

[4–4] → {<Sue, cyclophosphamide, 500>}

Note that the interval-based semantics is

expressive enough to cope with the intended

meaning of telic facts. The above model does

not imply that Sue had a cyclophosphamide

administration of 500 mg at 1, or one in the

period [1–4]. In the interval-based semantics,

intervals are atomic entities, that cannot be

merged or decomposed. This fact correctly

accounts for the fact that neither downward nor

upward inheritance hold for telic facts.

3 RELATIONSHIPS BETWEEN USER-

DEFINED GRANULARITIES

Intuitively, in DW, temporal aggregation is a

‘roll-up’ operation grouping data from a finer to

a coarser temporal granularity (e.g., from ‘days’

into ‘weeks’). Such an operation is quite

complex, in case user-defined temporal

granularities have to be taken into account.

Granularities have been formally defined in the

temporal database glossary [17] as follows:

Definition (temporal granularity). Assume a

time domain T as a set of totally ordered time

instants. A granularity is a mapping G from the

integers (the index set) to the subsets of the

time domain such that:

(1) if i < j and G(i) and G(j) are non-empty,

then each element in G(i) is less than all the

elements in G(j), and

(2) if i < k < j and G(i) and G(j) are non-empty,

then G(k) is non-empty.

Each non-empty set G(i) in the above definition

is called granule.

The first condition in the granularity definition

represents the fact that granules in a granularity

do not overlap and that their index order is the

same as their time domain order. The second

condition represents the fact that the subset of

the index set for the granules is contiguous. On

the basis of the definition, a granularity defines

a countable set of granules; each granule can be

identified by an integer.

Several formal languages have been

provided to support the definition of temporal

granularities [18]. If the designer (or user) is

free to define any temporal granularity (in the

given formal language), the switching from any

granularity to another one is an hard issue to

cope with. In several approaches, the problem

is simply demanded to users\designers which

define temporal granularities: whenever a new

granularity is defined, the algorithms to convert

and\or aggregate data from the new granularity

to all the already present granularities must be

provided by the user\designer herself [2; 19].

Other approaches (e.g., [6]) introduce semantic

assumptions (see the discussion in Section 5).

In this paper, we propose a flexible and user-

friendly methodology to approaching the

problem. We first identify a basic set of

temporal relationships that characterize the

mapping between granularities. Then, we

propose a set of appropriate aggregation

functions\operations on the basis of such

relationships. In such a way, appropriate

aggregation functions between any user-defined

granularities can be automatically suggested to

the user, in the hypothesis that the relationships

between such granularities is known.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

3.1 Motivating Issues

One of the goals of this paper is to identify the

relationships between user-defined granularities

that characterize temporal aggregation in the

DW context. To achieve such a goal, we point

out three fundamental phenomena regarding

temporal aggregation along user-defined

temporal granularities in DWs. In case the

mapping between granularities are not regular

[5], temporal aggregation may involve at least

three different kinds of problems:

Loss of information. Some data may be lost

when aggregating. E.g., if days are aggregated

into working-weeks (e.g., from Monday to

Friday), some data (data concerning Saturdays

and Sundays) are left out of the aggregation;

Lack of information. The granularity being

aggregated might not “cover” the grouping

granularity. For instance, if one aggregates

working days (e.g., days from Monday to

Friday) into weeks, data concerning Saturdays

and Sundays will be missing. This is a potential

problem if operators such as average (AVG

henceforth) have to be applied.

“Split” of information. Some data may be

potentially part of different aggregates. For

instance, when aggregating weeks into months,

what about the data regarding weeks spanning

over two different months? We call this

phenomena “internal” split. Also “external”

splits need to be considered. For instance, let us

take into account a user-defined granularity

“odd-months” (i.e., Jan, March, May, July,

Sept, Nov). If we aggregate weeks into odd-

months, data regarding weeks spanning over

two different months have to be somehow

“split”, since only a part of them may be

relevant for the aggregation (thus, a loss of

information may be caused by external splits).

3.2 Relationships

In our approach, we aim at facing the three

above phenomena; this goal is at the basis of

our identification of the relationships between

granularities. Given any two granularities G

and H with index set I

and I

respectively, we

identify different possible relationships. We

first introduce three general predicates that may

hold between two granularities.

Definition. Intersect, Split, Cover

Given any two (user-defined) granularities G

and H with index set I

and I

respectively

Intersect(G,H): ∃i∈ I

∃

j ∈ I

G(i) ∩ H(j) ≠∅

Split(G,H): ∃i∈ I

∃

j ∈ I

(G(i) ∩ H(j) ≠∅ ∧

(G(i) ⊃ H(j))

Cover(G,H): ∀i∈ I

∃ j

,…,j

∈ I

H(i)

⊆ G(j

) ∪ … ∪ G(j

)

Roughly speaking, Intersect(G,H) holds if there

is at least an intersection between a granule of

G and a granule of H; Split(G,H) holds if there

is at least a granule of G which overlaps (but is

not fully contained into) a granule of H (notice

that the part of G(i) which exceed H(j) may be

contained or not into some other granule of H);

Cover(G,H) holds if each granule of H is fully

contained in the union of one or more granules

of G. Of course, Intersection is implied by

Cover and by Split. We can now introduce our

relationships (see also [20]).

Definition. Relationships between

granularities. Given any two (user-defined)

granularities G and H with index set I

and I

respectively

SAME. G and H are the same if they are equal,

i.e., ∀i∈ I

∃ j

∈ I

H(i) = G(j) ∧ ∀j∈ I

∃i

∈

G(j) = H(i)

PARTITION. G is a partition of H if

¬ SAME(G,H) ∧ Cover(G,H) ∧ Cover(H,G) ∧

¬ Split(G,H)

SLIT-PARTITION. G is a split partition of H

if Cover(G,H) ∧ Cover(H,G) ∧ Split(G,H)

COVER. G covers H if

Cover(G,H) ∧ ¬ Cover(H,G) ∧ ¬ Split(G,H)

SPLIT-COVER. G split-covers H if

Cover(G,H) ∧ ¬ Cover(H,G) ∧ Split(G,H)

COVERED_BY. G is covered by H if

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Cover(H,G) ∧ ¬ Cover(G,H) ∧ ¬ Split(G,H)

SPLIT-COVERED_BY. G split-covers H if

Cover(H,G) ∧ ¬ Cover(G,H) ∧ Split(G,H)

INTERSECT. G intersects H if

Intersects(G,H) ∧ ¬ Cover(G,H) ∧

¬ Cover(H,G) ∧ ¬ Split(G,H)

SPLIT-INTERSECT. G and H non-strictly

intersects if

Intersect(G,H) ∧ ¬ Cover(G,H) ∧

¬ Cover(H,G) ∧ Split(G,H)

DISJOINT. G and H are disjoint if

¬ Intersect(G,H)

SAME covers the case when, indeed, the two

granularities are the same. Such a relationship

will be used futher on in Section 5.

PARTITION and SPLIT-PARTITION

cope with the cases in which exactly the same

parts of the timeline are covered by G and H.

However, in the former case (PARTITION),

each granule of H is exactly constituted by a set

of granules of G (e.g., days are a partition of

months); on the other hand, in the latter case

(SPLIT-PARTITION), there are granules of G

spanning outside some granules of H (e.g.,

weeks and months).

COVER and SPLIT-COVER cope with the

cases in which all the parts of the timeline

covered H are also covered by G (while they

are parts of the timeline covered by G and not

by H). However, in the former case, each

granule of H is exactly constituted by a set of

granules of G (e.g., days cover working-

weeks); in the latter case, there are granules of

G spanning outside some granules of H (e.g.,

weeks and working months). COVERED_BY

and SPLIT-COVERED_BY cope with the

symmetric cases.

INTERSECT and SPLIT-INTERSECT

cope with the cases in which G and H do not

cover the same parts of the timeline

(COVER(G,H) and SPLIT-COVER(G,H) do

not hold), but there is at least one intersection.

Again, we distinguish between cases in which

the (intersecting) granules of G are fully

contained (or not) in granules of H.

Finally, DIJOINT cover cases of non-

intersecting granularities (e.g., odd-days and

even-days).

The relationships above can be

structured along a decision tree, as shown in

Figure 1. The eight relations we have identified

constitute the leaves of the tree. At each level of

the tree, alternatives are mutually exclusive. The

conditions equal?, intersect? and split?

represent binary (yes/no) choices. The

condition covering(A,B)? has four (mutually

alternative) possibilities: A covers B and B

covers A (denoted by A↔B in Figure 1), A

covers B and B does not cover A (A→B), A

does not cover B and B covers A (A←B), A

does not cover B and B does not cover A (A|B).

We can thus conclude that, being the leaves of

such a decision three, our relationships are

exhaustive and mutually exclusive.

yes

yes yes

covering(A,B)?

no no no

SPLIT

INTERSECT

COVERED

SPLIT

COVER

split? split? split? split?

PARTITION

SPLIT

PARTITION

↔

A→B A←B

SPLIT

COVERED

DISJOINT

intersect? SAME

no equal? yes

yes

Figure 1. Decision tree structuring the different relationships

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

3.3 Automatic Evaluation of Temporal

Relationships

For the sake of generality, we do not commit to

any specific language modeling user-defined

granularities. Many of such languages have

been proposed in the literature (see Section 5).

However, whatever of such languages is

considered, the determination of the

relationships between two user-defined

granularities can be performed in an automatic

way. In particular, if the two granularities are

periodic (which are the ones most commonly

used), the relationship can be easily determined

by generating all the instances of the two

granularities in a common interval, which must

be as long as the least common multiple of the

two periodicities, and then automatically

checking which one of the relationships holds

between them (exactly one relation holds

between each pair of granularities, since they

are exhaustive and mutually exclusive.

4 TEMPORAL AGGREGATION

In our approach, temporal aggregation is based

on the above relationships, and on the

knowledge on whether data (measures) to be

aggregated are telic\atelic and complete (see

Section 4.2) or not. In other words, the below

definitions assume that, given a granularity G to

be aggregated into a granularity H (where both

granularities may be user-defined, through a

formal specification language), the following

information is available:

- The data to be aggregated are complete or

not

- The data to be aggregated are telic or atelic

- The relationship between G and H

In the following, we first point out which

aggregation operators can be used, depending

on the telic\atelic character of the data. We then

focus on the treatment of incomplete

information in the aggregation operations and

on the treatment of “splits of information”.

Finally, we wrap-up all such ingredients into a

comprehensive approach, coping with

aggregation along the temporal dimension and

considering user-defined granularities.

4.1 Aggregation Operators.

In this paper, we just focus on the most typical

aggregation operations, namely the min, max,

avg and sum operators. Min, max and avg

operators can be applied to both telic and atelic

data. On the other hand, additivity (along the

temporal dimension) is a property of telic data,

and not of atelic ones. The reason can be easily

found in the upward inheritance property that

characterize atelic data. Upward inheritance

state that if the data value is α in a time period

[i,j] and in a contiguous period [j+1,k], then it

is α in the whole period [i,k]. This fact trivially

implies that atelic data are not additive (along

the temporal dimension). For instance, if the

balance of an account is 100K in all months of

2008, we can properly infer that the account

value in the year 2008 is 100K (not additive;

the balance is atelic). Notice however, that the

telic\atelic dichotomy does not only explain the

additive\non-additive feature that characterizes

all the DW approaches in the literature, but also

has a deep impact on the treatment of data

incompleteness and of split.

4.2 Treatment of Data Incompleteness

Informally, given a granularity G, we say that a

data (measure) m is complete on G if, for each

instance g of G in the frame of time of interest,

the value of m in g is known (e.g., the balance

of the account is known for each day in the

frame of time of interest). Data incompleteness

is related only in part to the problem of

aggregating data along different temporal

granularities. As a matter of facts, data may be

incomplete for many different reasons, mostly

independent of the aggregation or granularity

conversion operations. In general, the fact that

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

there is no tuple holding at a given time simply

denotes the fact that nothing happened at that

time. For example, no contract may be active

on 2010. In other cases, data may be incomplete

because the information may exist in the “real

world”, but for some reason it has not been

stored. For example, for a patient in ICU, the

blood pressure is constantly monitored, and

recorded at a given rate (say, each minute). If

the DB does not contain the pressure of the

patient at a given minute, this fact does not

imply that “in the real world” there was no

pressure value for the patient. Simply, for some

reason the data has not been recorded.

However, data incompleteness interacts with

the aggregation problem in at least two

different ways. First of all, if the origin

granularity does not fully cover the task

granularity, from the point of view of

aggregation we have that some of the data (the

ones in the span of time present in the task

granularity but not in the origin one) are

missing. For example, if the origin granularity

is working-week and the task granularity is

years, values in Saturdays and Sundays are

missing, and data incompleteness must be

faced. Second, one has to decide what policy

has to be applied in case some value in the

origin granularity (that should be included in

the task one) is missing (or NULL).

For the sake of generality, we provide users

with the possibility of choosing among different

ways of coping with missing (NULL) values

during aggregation. We have identified 7 main

options (of course, the list is not exhaustive,

and other options may be investigated):

(i) NULLIFY: if at least one of the values to

be aggregated is NULL, then the result of

the whole aggregation is NULL;

(ii) IGNORE: NULL values are ignored; the

aggregation operators are executed

considering only the available data;

(iii) INTERPOLATE: each NULL value is

substituted by the value obtained through

the average between the temporally closest

values preceding and following it;

(iv) PREC: each NULL value is substituted by

the closest value preceding it;

(v) NEXT: each NULL value is substituted by

the closest values following it;

(vi) AVG_INTERP: each NULL value is

substituted by the value obtained through

the average between all the available

values to be aggregated with it.

EXAMPLES. Suppose we have the following

daily values, and we want to aggregate it into

weeks, through the sum operator;

Day1: 100 | Day2: NULL | Day3: NULL |

Day4: 150 | Day5: 100 | Day6: NULL |

Day7:100

The results obtained through the above

strategies are the following:

NULLIFY NULL

IGNORE  450

INTERPOLATE100+125+125+150+100+

100+100= 800

PREC100+100+100+150+100+100+100=

750

NEXT 100+150+150+150+100+100+100=

850

AVG_INTERP 100+112.5+112.5+150+100+

112.5+100= 787.5

In general, we allow users to freely choose

between all available options, but recommend

to consider the telic\atelic nature of the data to

be aggregated. In particular:

Atelic data: any of the above strategies can

be selected

Telic data: only the NULLIFY and

IGNORE strategies can be selected

The reason of the above restriction lies

in the different nature of telic and atelic data.

Telic facts are not homogeneous facts having a

specific goal, and do not have either the upward

or the downward inheritance properties. As a

consequence, in general, each fact is an

“individual” one, independent and unrelated to

the others. In such a context, it does not look

meaningful to evaluate the (missing) value of

one fact on the basis of other values of other

facts. For instance, the fact that I had a 100K

contract on May and a 1000K contract on July

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

do not tell me anything on whether I have had a

contract on June, and on the amount of that

contract, if any (contracts are telic). The

behavior of atelic data is different. Atelic facts

are homogeneous facts for which both

downward and upward inheritance hold. Thus,

for such facts, a sort of “persistency” hold, so

that one can meaningfully try to approximate

the value of a fact on the basis of temporally

contiguous values. For instance, from the fact

that my bank account was 100K on May and

120K on July, I can make some reasonable

guess about the amount of my account on June

(the balance of accounts is atelic).

4.3 Treatment of “Splits”

The problem of “splits” regards the evaluation

of values of data which, during aggregation,

overlaps different periods. A typical example is

the monthly aggregation of weekly data (where

weeks can stretch over two months). Once

again, we do not think there is just one solution:

the strategy we suggest is to provide users with

a suite of alternative possibilities. We propose

to consider at least the following functions:

(i) TOTAL: the whole value is attributed to

each part of the split

(ii) NONE: the value is NULL for each part of

the split

(iii) DIVIDE: the value is divided and equally

attributed to each part of the split

(iv) PROP_DIVIDE: the value is proportionally

attributed to each part of the split (of

course, such an evaluation requires the

availability of support measures).

EXAMPLES. For instance, the 14

(European-

calendar) week of 2009 starts on Monday

March 30

, and ends on Sunday April 5

. If the

value for that week is 140, we can obtain the

following result for the split into Months:

TOTAL  140 is attributed to both (the last

week of) May and (the first week of) April

NONE  the value 140 is neither attributed to

March nor to April

DIVIDE 70 is attributed to both (the last

week of) May and (the first week of) April

DIVIDE_PROP  40 is attributed to (the last

week of) May and 100 to (the first week of)

April

Notice that the above approach can be applied

not only to cope with “internal” splits (as in the

above examples), but also with “external”

splits, as exemplified below.

EXAMPLES. Consider again the 14

(European-calendar) week of 2009, but now let

us consider aggregation into odd-months (so

that only the data concerning March are

relevant here). If the value for that week is 140,

we can obtain the following result for the split:

TOTAL  140 is attributed to the odd-month

NONE  the value 140 is not considered

DIVIDE  70 is attributed to odd-month

DIVIDE_PROP  40 is attributed to odd-

month

Obviously, the application of the NONE option

causes a loss of information, while the

application of TOTAL does not preserve the

whole sum of the values.

Finally, it is worth stressing the impact

of the telic\atelic dichotomy on the “split”

issue. We impose only the TOTAL function

can be adopted to deal with atelic data (while

all the options are possible for telic ones). This

rule depends on the very nature of atelic data:

since they have the downward inheritance

property, their value in each subinterval is equal

to the value in the whole interval. For example,

if the balance of my account is 100K in the 14

week of 2009, it is 100K in each day or

subparts of that week. Obviously, such an

implication is not true for telic data.

4.4 Temporal Aggregation: General

Approach

Table 1 in the following summarizes the

treatment we suggest for temporal aggregation,

considering user-defined granularities. The first

three columns represent the “input” of our

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

approach, namely the following information

about the data to be aggregated from a temporal

user-defined granularity G into a temporal user-

defined granularity G’:

(i) The first column considers whether the

(values of the) data (measures) to be

aggregated are complete or not,

(ii) The second states whether such data are

telic or atelic, and

(iii) The third states the relationship holding

between G and G’ (the case in which the

relationship SAME holds between the two

granularities is not reported in the table,

since it is not a case of temporal

aggregation - it will be useful in Section 5).

For the sake of compactness, “SPLIT-“ has

been abbreviated by “S-“ (e.g., S-

PARTITION stands for SPLIT-

PARTITION). For each case, Table 1

states the possible actions to be done.

(iv) The fourth column states whether some

treatment of incompleteness (NULL

values) in the input data has to be

performed. If so, the user may take

advantage of one of the operations

identified in section 4.2 (notice that,

depending on the telic\atelic character of

the data, different functions are suggested)

(v) The fifth column states whether a “split” of

information is necessary or not. If the value

is”yes”, one of the dedicated operations

proposed in section 4.3 can be selected by

the user, to choose the modality s\he

prefers (once again, different functions are

available, depending on the telic\atelic

character of data)

(vi) The sixth column simply states whether the

conversion from G to G’ causes or not a

loss of information. We think that this is an

important information to be given to the

user, who, in cases a loss of information is

possible, may decide or not whether going

on with the aggregation. Notice that, in this

column, the loss of information that may be

caused by the application of the “Split”

functions (e.g., “NONE”) has not been

taken into account.

(vii) Finally, the seventh column identify the

possible aggregation operators (in practice,

it states for which kind of data “sum” is

possible).

EXAMPLES.

Ex.1. Let us suppose a user wants to aggregate

the balance of accounts (atelic facts) from

weeks into months. Suppose also that such data

are (at the level of weeks) complete, in the

sense that there are no missing values. Since

balances of accounts are atelic data, and the

relationship between the granularities weeks

and months is SPLIT-PARTITION, the

proposed case correspond to the 10

row of

Table 1. This means that the user may apply the

min, max, or avg aggregation operators, and

can apply the TOTAL function to cope with

split of information. There is no loss of

information due to the granularity conversion,

and no need to cope with incompleteness.

Ex.2. Same as above, but balance (atelic data)

has to be aggregated from weeks into odd-

months (i.e., Jan,March,May,July,Sept,Nov).

The relationship between weeks and odd-

months is SPLIT-COVER, so that the 12

row

has to be considered. As above, the user may

apply the min, max, or avg aggregation

operators, and can apply the TOTAL function

to cope with split of information. There is no

need to cope with incompleteness. However, in

this case a warning that there is a loss of

information due to the aggregation may be

signaled (since data concerning “even” months

are lost in the aggregation).

Ex.3. Suppose that balances (atelic data) are

aggregated from working weeks (from Monday

to Friday) into months, and that input data are

incomplete (NULL values are present). The

corresponding row is the 30

. In such a case,

besides the function TOTAL to cope with

splits, our approach suggest the choice of one

function to cope with incompleteness (one of

NULLIFY, IGNORE, INTERPOLATE, PREC,

NEXT,AVG_INTERP). There is no loss of

information due to the aggregation.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Table 1. Temporal aggregation (summary). The column “Loss of info” does not take into consideration the loss of

information possibly caused by the treatment of “Splits”.

Complete\

Incomplete Telic\

Atelic Relationship Treat. of

Incomp. Treat. of

Split Loss of

Info(*) Operators

(temporal dim.)

Complete Telic PARTITION no no no min,max,avg,sum

Complete Telic S-PARTITION no yes no min,max,avg,sum

Complete Telic COVER no no yes min,max,avg,sum

Complete Telic S-COVER no yes yes min,max,avg,sum

Complete Telic COVERED_BY yes no no min,max,avg,sum

Complete Telic S-COVERED_BY yes yes no min,max,avg,sum

Complete Telic INTERSECTS yes no yes min,max,avg,sum

Complete Telic S-INTERSECTS yes yes yes min,max,avg,sum

Complete Atelic PARTITION no no no min,max,avg

Complete Atelic S-PARTITION no yes no min,max,avg

Complete Atelic COVER no no yes min,max,avg

Complete Atelic S-COVER no yes yes min,max,avg

Complete Atelic COVERED_BY yes no no min,max,avg

Complete Atelic S-COVERED_BY yes yes no min,max,avg

Complete Atelic INTERSECTS yes no yes min,max,avg

Complete Atelic S-INTERSECTS yes yes yes min,max,avg

Incomplete Telic PARTITION yes no no min,max,avg,sum

Incomplete Telic S-PARTITION yes yes no min,max,avg,sum

Incomplete Telic COVER yes no yes min,max,avg,sum

Incomplete Telic S-COVER yes yes yes min,max,avg,sum

Incomplete Telic COVERED_BY yes no no min,max,avg,sum

Incomplete Telic S-COVERED_BY yes yes no min,max,avg,sum

Incomplete Telic INTERSECTS yes no yes min,max,avg,sum

Incomplete Telic S-INTERSECTS yes yes yes min,max,avg,sum

Incomplete Atelic PARTITION yes no no min,max,avg

Incomplete Atelic S-PARTITION yes yes no min,max,avg

Incomplete Atelic COVER yes no yes min,max,avg

Incomplete Atelic S-COVER yes yes yes min,max,avg

Incomplete Atelic COVERED_BY yes no no min,max,avg

Incomplete Atelic S-COVERED_BY yes yes no min,max,avg

Incomplete Atelic INTERSECTS yes no yes min,max,avg

Incomplete Atelic S-INTERSECTS yes yes yes min,max,avg

Ex.4. Suppose contracts (telic data) are

available at the granularity of academic

trimesters (e.g., Nov-Jan, Feb-April, May-July),

and must be aggregated into years. Suppose

also that data are complete (i.e., no NULL

value in academic trimesters). The relationship

between academic trimester and year is S-

COVERED_BY, thus the corresponding entry

is the 6

row of Table 1. NULLIFY and

IGNORE are suggested to cope with data

incompleteness (data incompleteness arises

because academic trimesters do not partition

nor cover years, so that data corresponding to

August, September and October are not

available), one of NONE, DIVIDE, and

DIVIDE_PROP is suggested to slit information

(from Nov-Jan into years). No loss of

information is caused by this aggregation.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

5 CONCLUSIONS AND RELATED

WORKS

We propose a general methodology to cope

with the aggregation of data (measure values)

along different user-defined granularities, in the

DW context. Specifically, we propose a new

methodology which, given two any pair of

(possibly) user-defined temporal granularities

origin

and G

task

, and given a data (measure) to

be aggregated (from G

origin

to G

task

), and

considering the temporal relationships between

the two granularities and some features (e.g.,

the telic vs atelic nature) of the data, supports

the user with a set of proper aggregation

functions\operations.

The solution we propose grounds on our

previous work in [20]. We have identified three

core phenomena in temporal data

conversion\aggregation between different time

granularities (loss of information, lack of

information, split of information), and we have

pointed out the basic relationships between

granularities which characterize such

phenomena. Such relationships are exhaustive

and mutually exclusive, so that, whatever two

temporal granularities have been (or will be)

defined by users, exactly one of the

relationships holds between them. We provide a

set of application-independent functions to cope

with lack of information (missing data) and

split of information, and then, depending on the

relationship between the two granularities

considered in the conversion\aggregation, and

on the telic vs. atelic nature of the data, we state

what functions could be chosen by users. In

particular, we extend the work in [20] (i) by

proposing a general and integrated treatment of

all the above issues, including the challenging

problem of coping with lack of data (see, in

particular, Table 1), and (ii) by specifically

addressing the problems and challenges arisen

in the area of Data Warehousing. To the best of

our knowledge, there is no other approach in

the literature integrating the three cue issues of

our approach: temporal granularities, temporal

aggregation, and DWs. We briefly review

separately such areas of research in the

following.

DW approaches usually support the

aggregation of data (measure values) along the

temporal dimension. However, to the best of

our knowledge, only few of such approaches

have faced this problem by considering

different user-defined granularities. For

example, Chamoni and Stock [21] have devised

a temporal extension of OLAP to model the

evolution of dimensions, identifying several

temporal structures, but not taking into account

different user-defined granularities. Recently,

Malinowski and Zimanyi [11; 5] have proposed

a new conceptual model for DWs. The model

provides temporal support for concepts,

attributes, hierarchies and measures, as well as

aggregation of measure values along the

temporal dimension (in which different

temporal granularities are supported). Some

complex cases of temporal aggregation are

discussed, considering situations such “lack of

information” and “split of information”.

However, they did not propose any general

solution to such cases, simply pointing out that

“mapping between different time granules must

be defined explicitly” [11].

On the basis of the formal definition of

temporal granularity proposed in the Database

glossary [18] (see also Section 3), in [18]

several primitive relationships between

granularities have been identified. Several

researchers have noticed that, in many

application domains, providing a standard set of

pre-defined granularities is not satisfactory, and

user-defined granularities must be supported

(see, e.g., [18; 6]). Staring from the pioneering

work by Clifford and Rao [22], in the area of

temporal Databases and of Artificial

Intelligence several approaches have been

defined to cope with user-defined granularities,

and periodic granularities [18; 6; 23], and with

the closely related notion of periodicities [24].

When dealing with multiple granularities, a

main problem regards the conversion of values

when swapping from one granularity to a

different one. Many approaches simply

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

demands to users or system designer to

explicitly define customized functions

performing in an ad-hoc way each one of the

conversions (see, e.g., (Dyreson, 1994)).

Bettini et al. [6] have proposed the adoption of

semantic assumptions or coercion functions to

achieve such a goal. Specifically, they

implicitly model the semantics of the

association of tuples with time by pairing

temporal tables with semantic assumptions.

Notably, such assumptions are associated to

tables once-and-for-all by the DB designer.

Such a fixed association looks not applicable in

the DW context, where one has also to support

the fact that users are free to choose among

different aggregation options when rolling up

along a temporal dimension. In this context, the

flexibility granted by our approach seems more

appropriate.

In temporal relational aggregation,

tuples can be grouped according to their

timestamp values. Then, an aggregation

function is applied to each group, leading to a

temporal relation. Starting already from Ben-

Zvi’s temporal relational model [25], temporal

query language have been extended to include

aggregation functions. As described in the

survey by Gamper et al., quite different types of

grouping operations have been pointed out [26].

[7] provides an in-depth survey about temporal

(and spatial) aggregation. In particular, Vega

Lopez et al., identifies a uniform framework

supporting the analysis and comparisons of

different types of aggregation. The research

about temporal aggregation has also proposed a

wide range of computationally efficient

algorithms to evaluate temporal aggregation.

Starting from Tuma’s work [27], different tree-

based algorithms have been proposed in the

literature, including the aggregation tree

algorithm [28], the balanced-tree algoritm [29],

the SB-tree algorithm [30]. Recent approaches

have also focused on temporal aggregation

within data streams [31]. On the other hand, to

the best of our knowledge, such an area of

research has not investigated the conversion

and aggregation of measure values along

different (user-defined) granularities For

instance, Vega Lopez et al. [7] demand all

measure conversion to a (not well specified)

cast function which should perform all

mappings. To the best of our knowledge, no

approach in this field has systematically faced

the issue of supporting users in the choice of

aggregation functions, depending on the

telic\atelic nature of measures and on the

relationship holding between the origin and task

granularities. Thus, our approach can be

conceived as orthogonal to the ones above.

Thus, in our future work, we plan to integrate

our approach with one of the above, to achieve

a comprehensive approach coping also with the

(ii) subproblem mentioned in the introduction.

Lenz and Shoshani [32] have studied the

conditions to get a correct summarization of

OLAP and statistical databases (summarization

roughly correspond to aggregation).

Interestingly, the summarizability conditions

they have identified roughly correspond to the

temporal problems we have identified in our

paper: loss of information, lack of information,

split of information. Informally speaking, and

using our terminology, Lenz and Shoshani’s

Condition One requires that there must not be

any split of data. Condition Two requires no

lack and no loss of information. Condition

Three resembles our requirement that

operations must be consistent with respect to

the nature of data. However, Lenz and Shoshani

pointed out their conditions in order to identify

non-problematic cases for summarization, to

focus only on such cases. The approach we

have followed in this paper is completely

different to this respect. Indeed, the goal of our

work is to support temporal aggregation along

user-defined temporal granularities, allowing

users to switch from any granularity to any

other one, independently on whether the switch

is “problematic” or not. In our context, split,

loss and lack of information are not a

consequence of some design error, or of some

mistake by users. They are just a natural

implication of having different granularities to

switch from/to.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

ACKNOWLEDGEMENTS.

This work has been partially supported by

Compagnia San Paolo, GINSENG project.

REFERENCES

[1] W. Immon (2002). Building the Data Warehouse,

John Wiley & Sons Publishers.

[2] R. T. Snodgrass et al., (1995). The TSQL2

Temporal Query Language, Kluwer.

[3] Y. Wu, Sushil Jagodia, and X. Sean Wang (1999).

Temporal Database Bibliography Update,” Temporal

Databases—Research and Practice, Springer-Verlag,

LNCS 1399. 338–367.

[4] L. Liu and M. Tamer Özsu (2009). Encyclopedia of

Database Systems, Springer.

[5] E. Malinowski and E. Zimanyi (2009). Advanced

Data Warehouse Design. From Conventional to

Spatial and Temporal Applications. Springer Verlag.

[6] C. Bettini, S. Jajodia, and X.S. Wang (2000). Time

granularities in Databases, Data Mining, and

Temporal Reasoning. Springer.

[7] I.F. Vega Lopez, R.T. Snodgrass, and B. Moon

(2005). Spatiotemporal Aggregate Computation: A

Survey. IEEE TKDE, 17(2). 271–286.

[8] V. Kathri, R.T. Snodgrass, and P. Terenziani (2009).

Telic distinction in temporal databases. In [4].

[9] R. Kimball (1996). Slowly Changing Dimensions.

DBMS Magazine 9(4).

[10] C. S. Jensen and C. E. Dyreson (1998). A Consensus

Glossary of Temporal Database Concepts, in

Temporal Databases: Research and Practice, O.

Etzion, S. Jajodia and S. Sripada (eds), Springer-

Verlag. 367–405.

[11] E. Malinowski and E. Zimanyi (2008). A conceptual

model for temporal data warehouses and its

transformation to the ER and object-relational

models. Data & Knowledge Engineering 64, 101-

133.

[12] P. Terenziani, and R.T. Snodgrass (2004).

Reconciling Point-based and Interval-based

Semantics in Temporal Relational Databases: A

Proper Treatment of the Telic/Atelic Distinction.

IEEE TKDE, 16(4). 540-551.

[13] Aristotle. The Categories, on Interpretation. Prior

Analytics. Cambridge, Harvard University Press.

[14] Z. Vendler (1967). Verbs and times. Linguistics in

Philosophy, Cornell Univ. Press, NY, 97–121.

[15] D. Dowty (1986). The Effects of the Aspectual Class

on the Temporal Structure of Discourse, Tense and

Aspect in Discourse, Linguistics and Philosophy

9(1). 37–61.

[16] Y. Shoham (1987). Temporal logics in Al:

semantical and ontological considerations. Artificial

Intelligence; 33(1). 89-104.

[17] C. Bettini, C.E. Dyreson, W.S. Evans, R.T.

Snodgrass, and X. Wang (1997). A glossary of time

granularity concepts. Temporal Databases, Dagstuhl,

Springer, 406–413.

[18] C. Bettini (2009). Temporal granularities. In [4].

[19] M. Soo and R. Snodgrass (1993). Multiple Calendar

Support for Conventional Database Management

Systems. Proc. Int’l Workshop an Infrastructure for

Temporal Databases.

[20] P. Terenziani (2012). Temporal aggregation on user-

defined granularities. Journal of Intelligent

Information Systems, 38(3), pp 785-813.

[21] P. Chamoni and S. Stock (1999). Temporal

Structures in Data Warehousing. Proc. DaWaK’99,

LNCS 1676, 353-358.

[22] J. Clifford and A. Rao (1987). A simple, general

structure for temporal domains. In Proc. Conf. on

Temporal Aspects in Information Systems, 23–30.

[23] C. Combi, M. Franceschet, A. Peron (2004).

Representing and reasoning about temporal

granularities, J. Logic and Computation 14 (1), 1–77.

[24] P. Terenziani (2009). Temporal periodicity. In [4].

[25] J. Ben-Zvi (1982). The Time Relational Model. PhD

Thesis, Computer Science Department, UCLA.

[26] J. Gamper, M. Bohlen, and C. Jensen (2009).

Temporal aggregation. In [4].

[27] P.A. Tuma (1992). Implementing Hystorical

Aggregates in TempIS. PhD thesis, Wayne State

University, USA.

[28] M. Kline and R.T. Snodgrass (1995). Computing

Temporal Aggregates. In Proc. Internat. Conf. on

Data Engineering, 222-231.

[29] B. Moon, I.F. Vega Lopez, and V. Immanuel (2003).

Efficient algorithms for large-scale temporal

aggregation. IEEE TKDE 15(3). 744-759.

[30] J. Yang and J. Widom (2003). Incremental

computation and maintenance of temporal

aggregates. VLDB Journal 12(3). 262-283.

[31] D. Zhang, D. Gunopulos, V.J. Tsotras, and B. Seeger

(2003). Temporal and spatio-temporal aggregations

over data streams using multiple time granularities.

Information Systems 28. 61-84.

[32] H-J Lenz, Arie Shoshani (1997). Summarizability in

OLAP and statistical data bases. Proceedings., Ninth

International Conference on Scientific and Statistical

Database Management, 132-143.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Optimal Health Care Inventory Management Using Analytics

Neeraj Agrawal Prakarsh Paritosh Ashish Paralikar Dibyajyoti Pati

General Electric

JFWTC, 122 EPIP Bangalore, India 560066

neeraj.agrawal@ge.com

ABSTRACT

Inventory Management is an important issue for

health-care systems because it inﬂuences clinical

and ﬁnancial decisions. Before selecting, adapt-

ing and implementing a process for Inventory man-

agement, it is important that the various factors af-

fecting Inventory management are carefully consid-

ered. The objective of this paper is twofold. First,

it proposes an analytic model for hospital inventory

management commodities, which would be able

to predict the future demands of various inventory

commodities. The model takes into account previ-

ous demand, population and geographic Location

and other factors to successfully predict the future

demand. Second, the paper suggests an optimiza-

tion model that would minimize the cost involved

in supply chain & logistics management so that the

required commodities can be made available to the

hospitals at the minimum possible cost. A web

based portal is proposed to facilitate the hospitals to

optimally distribute future inventory requirements

among the different vendors.

KEYWORDS

Health-care Information System, Analytic Mod-

eling, Supply Chain Management,Inventory Opti-

mization, Hospital Vendor Portal

1 INTRODUCTION

For the optimal use of hospital resources and

the subsequent need to restore budgets, man-

agements are responsible for ﬁnding solutions

to achieve more operational efﬁciency in hos-

pital processes [1]. Some of the areas of hos-

pital management are that of controlling the

inventory of medical gases, drugs and medi-

cal supplies. According to the classical inven-

tory models the optimum inventory levels are

maintained with the objective of minimizing

the sum of cost of excess and under stocking.

It is observed that in case of hospital invento-

ries, concepts of holding and ordering cost is

not very suitable [4].

Awareness of logistics management is becom-

ing more widespread and many initiatives and

studies dealing with supply chain integration

have been undertaken [1]. However, internal

supply chain remains the weak link in pro-

cess integration and optimization [1]. The lack

of systemic approach to internal supply chain

management is reﬂected in huge costs in mate-

rials management and low service quality de-

livered to patients [1]. In the current scenario

of increasing health care costs, systems inven-

tory must be optimized without sacriﬁcing the

level of service provided.

Figure 1. Proposed Inventory Management Model.

Good inventory management is essential to the

successful operation of any health care organi-

zation. One of the most important is the pro-

portion of the organizations budget that rep-

resents money spent for inventory. Although

the amounts and dollar values of the invento-

ries carried by different types of health care

providers vary widely, in a typical hospital’s

budget 25 to 30 percent goes for medical sup-

plies and their handling. Because the inventory

of medical supplies may comprise a signiﬁ-

cant portion of a health care organization’s to-

tal assets, reducing its inventories signiﬁcantly

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Analyze Demand and

Usage of Hospital

Inventory Data

Develop and Analytical

Model to predict future

demand.

Optimally divide the

demand among different

vendors

Web based portal to

display, order inventory

to the different vendors.

Vendors

Vendor1 Vendor2

Deamand 1 Deamand 2

Figure 2. Different stages of proposed inventory model.

raises its return on investment, and hence its

position in the ﬁnancial markets [2].

Since the revenues of hospitals are impacted by

the carrying costs and ordering costs of phar-

macy, medical-surgical supplies, and medical

gases, it is important to create a mathematical

model to minimize the total cost and simul-

taneously ensuring that the commodities are

available in time [4]. Proposed inventory man-

agement model ant its stages are shown in Fig-

ure 1 and Figure 2 respectively. The objective

of this paper is to come up with a predictive

analytical model to optimize the inventory of

pharmacy, medical-surgical supplies, and med-

ical gases from the database available in the

hospital information system. For this purpose

the usage pattern of various commodities were

analyzed and a model was designed to predict

the usage for the upcoming month and give

economical solution to choose among avail-

able vendors keeping in mind their capacity

and storage capabilities. Model has been de-

signed using various parameters such as previ-

ous demand of surgical supplies, oxygen and

other anesthetic gases, geographical location,

population density and population distribution

of the area, seasonal variation for different hos-

pitals. A web based interface was developed

for different hospitals to display the optimal fu-

ture demand of inventory. This portal will also

aid in placing the order to the different avail-

able vendors.

2 FACTORS TO BE CONSIDERED FOR

DESIGNING ANALYTIC MODEL

The Present system of Inventory management

suffers with several shortcomings resulting in

its inefﬁciency.

•Availability of consumable on time is of

utmost importance in health-care industry,

where delay of a few seconds can cost a

life. The challenge is even greater as the

number of expected patients are unpre-

dictable; suppliers are unreliable and costs

are rising [3]. Analytic model should be

able to reduce this uncertainties and make

consumable available on time

•Cost of medical supplies has been spiral-

ing up, greater numbers of patients are

demanding high quality and reasonably

priced health-care services. State-of-the-

art inventory management analytic model

should automate the inventory reconcili-

ation and visibility process. This would

reduce the amount of work performed by

hospital staff to maintain correct levels of

inventory contrary to an archaic inventory

management system.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figure 3. The overall ranking of top 25 drugs from the HDU across all years with the inclusion of Methotrexate (MTX)

and Tacrolimus (TAC).

•Inventory standards can be measured by

the ability to deliver utmost quality ser-

vice to the patients. It can be measured

in terms of services offered, price charged

and perceived value [3]. A hospital will

have to make decisions regarding the stor-

age of supplies or direct delivery at the

point of use. It deals with the implemen-

tation of CPFR (Collaborative Planning,

Forecasting and Replenishment) among

hospitals and suppliers and manufacturers

of medical supplies.

•Good Inventory Management also de-

pends on the warehouse’s location, pro-

duction level of goods, etc. It also

deals with ﬁnalizing the distribution net-

work between the suppliers and hospitals

to minimize transportation and inventory

costs. Hence, Optimization of distribution

network between suppliers and hospitals

is needed for an efﬁcient system of Inven-

tory Management.

3 INVENTORY MANAGEMENT

MODEL

To accommodate the factors explained in Sec-

tion 2, analytics has been used to predict the

future requirements well in advance. Predicted

inventory requirement is distributed among

available vendors using an optimization model.

3.1 Medical Inventory Analytic Model

For designing the analytical model, following

features from hospital database were taken into

consideration:

•Previous demands of various pharmacy

such as surgical supplies, Oxygen and

other anesthetic gases

•Geographical location of hospital

•Population density and population distri-

bution of the area

•Month of year

•No of surgery and patient admitted in an

year

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figure 4. Trends in the percentage of persons using pre-

scription drugs: United States, 1999-2008

For model designing, the results of various sur-

veys involving the demand of inventory com-

modities against various factors were studied

and analyzed. For instance, The graph in

Figure3 shows the ranking of top 25 drugs

based on their uses [5]. It can be clearly seen

from graph, usage shows linear relationship

with year. In Figure4, graph shows trends

in the percentage of persons using drugs, this

graph also shows linear increase in usage with

time [6].

The trend of various other hospital drugs, com-

modities and anesthetic gases were plotted

against time for different graphical location.

It was seen that the usage pattern of major-

ity of hospital commodities varies linearly with

time. The pattern of various hospital drugs

used varied linearly with logarithm of popu-

lation growth. Hence, a regression model can

be developed on hospital database after cluster-

ing. Clustering can be done to group different

population density region and to cluster data

depending on graphical location.

A random sample data with 500 observation

was created to predict the demand of different

gases. This set was used to train the model.

Sample data set is created for a region with less

population density. Attributes used in this data

set are year and the corresponding gas demand

in last month. The Results of regression model

developed for predicting the oxygen demand is

shown in Fig 5.

Figure 5. Summary of regression model developed to

predict the demand of oxygen gases.

Different regression model were build for dif-

ferent commodities and different gases. Gener-

alized equation for commodities prediction is

given in (1)

P redictedV alues =A×LastM on.Demand

+B×Y ear

+C×Month

+D×log(P opulation).

(1)

3.2 Medical Inventory Optimization Solu-

tion

The primary function of Medical Inventory

Optimization solution is to allow hospitals to

effectively fulﬁll demand and identify how to

gain additional proﬁts. Improved efﬁciency

through effective resource management and

optimization lead to an increase in service

level, improved performance against patients

requests.

Manufacturer

Distributer 1

Distributer 2

Distributer 3

Hospitals

Figure 6. Stages of optimization model for inventory

distribution among vendor for a hospitals

Hence, an important aspect of efﬁcient inven-

tory management requires that the forecast de-

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

mand of hospital commodities is optimally dis-

tributed among the different distribution cen-

ters so that the cost incurred by the hospitals is

minimum.

Hospital medical optimization model can be

broadly divided into three stages. The stages

are shown in the Figure 6

The members participating in the model are n

manufacturers as M1, M2, M3, . . . , Mn and i

distribution centers as D1, D2, D3, . . . , Di for

j products as P1, P2, P3, . . . , Pj and k trans-

portation costs as T1, T2, T3, . . . , Tk. The

value of each product is V1, V2, V3, . . . , Vj.

The database holds the information about the

stock levels as Q1, Q2, Q3, . . . , Qj of each

product and lead time of products at each level.

The Lower and upper bounds for each product

are calculated based on the regression analysis

model explained in Section 3.1.

The Optimization is utilized to predict the

emerging excess/shortage of stock levels

which are vital information to be maintained in

the future to minimize the cost and maximize

the availability. Multi-echelon model has been

used as a building block in the formulation of

the entire model. The multi-echelon model

will consider uncertainty, non-stationary de-

mand and lead times of the products. The

evaluation function is determined for each ran-

domly generated individual.

The cost function is given by 2

Cost =XDi X(T i +V i)×Qi (2)

Constraints for optimization model are given

by 3

Qi <=Qmax, Qi >=Qmin (3)

Evolutionary Algorithms (EA) like Genetic Al-

gorithm (GA) is used for on the above cost

equation to minimize the cost distributing the

cost across various Distributors. Given the

non-stationary nature of the demand process,

there may be more inventories left over at the

end of a period than is desired in the next pe-

riod. This is because it is assumed that inven-

tories may be reduced only through demand.

Minimizing takes the interdependence of deci-

sions between time periods through the use of

cumulative supply variables and demand ran-

dom variables.

Figure 7. Authentication page for hospital and vendors.

3.3 A Web Based Portal Design

A web based portal is designed to provide easy

access to hospitals to order their inventory as

suggested by optimization model. Portal will

need authentication from hospitals and vendors

as shown in Figure 7.

After the authentication, hospitals can see the

future demand of various commodities on por-

tal as predicted by analytical model and dis-

tributed among vendors optimally by optimiza-

tion model running on hospital server. Pre-

dicted demand for various gases is shown in

Figure 8.

When vendor logs into their account using por-

tal, same demand will be shown to vendor.

Vendors will have option to either decline or

accept the requirement from hospitals. If ven-

dor declines the request, optimization model

will automatically calculate the next best best

way to distribute the demands. Vendor page to

accept or decline the request is shown in Figure

4 CONCLUSION

The Analytic model that was developed, pro-

vided a solution how much of the inventory

commodities was needed to be order next

month. The output from the analytic model

allowed eliminating placing orders for com-

modities whose demand was not expected in

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figure 8. Next month demands of different gases predicted by model.

Figure 9. Vendor’s page to decline or accept the hospital demand.

the near future. The objective function and the

constraints used to deﬁne the parameters for

the Optimization Model found the optimal so-

lution by distributing the demand among dif-

ferent vendors by taking their distance from

the respective hospitals, cost of supply and

storage capacities such that the cost incurred

by the hospital was minimum. Finally a web

based portal has been designed and proved to

be effective and fast tool to order optimum in-

ventory. The beneﬁt of this model in an im-

plementable manner, along with accurate and

real-time data can provide a health care system

an opportunity to reduce their inventory sur-

plus and allow them to track what the optimal

stock levels for different inventory commodi-

ties should be. Reduction in storage room and

optimization of ordered items provides more

visibility in supply chain management and it

allows a health care system to be more re-

sponsive towards their inventory ordering sys-

tem. The lead time to locate items was reduced

by reduction in storage rooms and the storage

rooms that were freed up of inventory can be

utilized for other hospital purposes.

REFERENCES

[1] Raffaele Iannone, Alfredo Lambiase, Salvatore Mi-

randa, Stefano Riemma and Debora Sarno. ”Mod-

elling Hospital Materials Management Processes.”

International Journal Engineering Business Man-

agement, 2013, 5:15. doi: 10.5772/56607

[2] Varghese, Vijith, Manuel Rossetti, Edward Pohl,

Server Apras, and Douglas Marek. ”Applying Ac-

tual Usage Inventory Management Best Practice in

a Health Care Supply Chain.” International Journal

of Supply Chain Management 1, no. 2 (2012).

[3] Sarda, Anil N., and Yogesh J. Gharpure. ”Trends

in Inventory Management in health care sector in

IndiaIssues & Challenges.”

[4] Kyong Kim,Yusuf Ahmed,Dr. Alper Murat,Dr.

Dean Pichette, ”Inventory Management and Storage

Optimization for Health Care System”.

[5] Jayaraman B and Barrett JS, ”A SAS-based Query

and Exploitation Interface to Hospital Drug Utiliza-

tion Data.” ACCP Annual Meeting, 2008.

[6] Gu, Qiuping, Charles F. Dillon, and Vicki L. Burt.

”Prescription drug use continues to increase: US

prescription drug data for 2007-2008.” NCHS data

brief 42 (2010): 1-8. .

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

ResearchGate has not been able to resolve any citations for this publication.

Enhancing Attribute Oriented Induction Of Data Mining ‫

Article

Full-text available

Apr 2019

Data summarization is a data mining technique to summarize huge data in few understandable knowledge. Attribute-Oriented Induction(AOI) is a data summarization algorithm, it suffer from overgeneralization problem. In this paper, we use an entropy measure to enhance generalization process, feature selection, and stop condition. Experimental results show that the proposed technique will reduce the effect of overgeneralization problem.

Verbs and Times

Article

Jan 1967
Phil Rev

Z. Vendler

A simple, general structure for temporal domains

Article

Jan 1987

SystemX and DBLEARN: Easily Getting More from Your Relational Database

Article

Mar 1994
INTEGR COMPUT-AID E

Two systems developed in the Centre for Systems Science at Simon Fraser University over the past several years are described briefly. These systems permit easy ad hoc access to information in databases, often stored implicitly, for decision-makers. We argue for their relevance and utility and explain their architectural characteristics, providing pointers to appropriate references for specific theoretical and operational details. Examples are given to illustrate the power and usefulness of these systems and these examples are drawn from actual databases in use with Rogers Cablesystems Ltd. and the Natural Sciences and Engineering Research Council of Canada (NSERC).

DBLearn: A system prototype for knowledge discovery in relational databases

Article

May 1994
SIGMOD REC

A prototyped data mining system, DBLearn, has been developed, which efficiently and effectively extracts different kinds of knowledge rules from relational databases. It has the following features: high level learning interfaces, tightly integrated with commercial relational database systems, automatic refinement of concept hierarchies, efficient discovery algorithms and good performance. Substantial extensions of its knowledge discovery power towards knowledge mining in object-oriented, deductive and spatial databases are under research and development.

Temporal Aggregation

Chapter

Jan 2009

A SAS-based query and exploitation interface to hospital drug utilization data

Conference Paper

Sep 2008

Applying actual usage inventory management best practice in a health care supply chain

Article

Jan 2012

This paper lays a foundation for the better understanding of the application and acceptance of Actual Usage Inventory Management within the health care supply chain. Actual Usage Inventory Management consist advanced inventory control practices driven by actual usage data. To determine the possible savings from using Actual Usage Inventory Management, a case study was performed on pharmaceutical products in a large healthcare system in the mid-west. The case study used the (r, Q) inventory policy to model the current inventory system and to propose a more cost-effective inventory control system at each echelon. A multi-echelon inventory control system is also proposed and the cost benefits are measured. Demand forecasting algorithms were applied to forecast demand for inventory control procedures. The results indicate that there is great potential for significant cost savings within the healthcare provider network. It is likely that if other providers adopt such practices that they will be able to better control material supply costs.

A Novel Approach for Mixed Data Clustering using Dynamic Growing Hierarchical Self-Organizing Map and Extended Attribute-Oriented Induction

Article

Mar 2013

Data clustering is one of the most important data mining techniques which groups data supported on their similarity. A number of approaches are existing for clustering numerical data and the problem of clustering mixed data is still unresolved. The standard clustering techniques are in general used for numeric data and are not probable to handle mixed data for the reason that of their computational incompetence. The requisite for an enhanced mixed data clustering approach is becoming vital and it is turning out to be a hot research area. By the sort of resolving this issue, Growing Hierarchical Self-Organizing Map (GHSOM) and Extended Attribute-Oriented Induction (EAOI) for clustering mixed data type is previously projected except it does not have any capability to control the growth of the map and in addition the structure of GHSOM is static. To overcoming this issue, a Dynamic Growing Hierarchical Self-Organizing Map (DGHSOM) with EAOI is projected in this paper for handling the mixed data. The main importance of DGHSOM is that it has the ability to grow or modify the structure to represent the application enhanced. The experimentation for the proposed technique is approved with the help of UCI Adult Data Set and Cleve Dataset and it is fond that it is superior to previous approaches based on the number of resultant clusters and outliers with substantial reduction in the processing time. The Clustering error also reduced.

A framework for post-rule mining of distributed rule bases

Article

Jan 2005

Inductive algorithms produce rules that are used to evaluate and interpret their interestingness while some use rule querying as a post data mining technique for query optimisation purposes. The need for interactive and user-query tools demand a data mining query language. The problem here addresses post rule based querying in a distributed environment. The generalization method of Attribute-Oriented Induction (AOI) with key-preservation (AOI-KP) is used to associate every rule with a collection of tuple keys at each host processor and utilizes a star schema database design. The key-preservation aspect seems to address the information loss problem in AOI and may help to perform efficient data queries using tuple keys. An SQL-like data and rule query operator that utilises data mining primitives in a distributed environment is presented with appropriate rule filtering mechanisms. Initial performance results indicate good scalability and speed-up in a distributed environment.

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Figures

Recommended publications

Data Mining in Systems Biology

Datenbanken, Data Warehousing & Data Analytics

Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data,...

IMPROVING THE EFFICIENCY AND QUALITY OF EXTRACTED KNOWLEDGE FOR TEXT MINING