Conference PaperPDF Available

Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015

Authors:
Conference Title
The International Conference on Database, Data Warehouse,
Data Mining and Big Data (DDDMBD2015)
Conference Dates
September 10-12, 2015
Conference Venue
Surya University
Tangerang, Indonesia
ISBN
978-1-941968-20-8 ©2015 SDIWC
Published by
The Society of Digital Information and Wireless
Communications (SDIWC)
Wilmington, New Castle, DE 19801, USA
www.sdiwc.net
DDDMBD2015 Table of Contents
Partitioning Technology and Fast Content Movements of Big Data………………………………….…..1
Mining Patterns with Attribute Oriented Induction………………………………………………………….11
Supporting Aggregation in Data Warehousing Considering User-Defined Temporal
Granularities…………………………………………………………………………………………………………………….22
Optimal Health Care Inventory Management Using Analytics……………………………………………..37
Partitioning Technology and Fast Content Movements of Big Data
TE-YUAN LIN, CHIOU-SHANN FUH
Computer Science and Information Engineering, National Taiwan University
8F., No.7, Songren Rd., Xinyi Dist., Taipei City 110, Taiwan (R.O.C.)
D03922002@ntu.edu.tw, fuh@csie.ntu.edu.tw
ABSTRACT
Database storage storing abundant data usually
accompanies slow performance of query and data
manipulation. This thesis presents a model and
methodology of faster data manipulation
(insert/delete) of mass data rows stored in a big
table. In this thesis, it depicts the solution to
manipulate large data sets of one table which moves
into and out of another logical table with
outstanding efficiency compared with traditional
transactional way. With this idea, the table structure
needs to be redesigned to accommodate and keep
data, in other words, the table needs to be
“partitioned”.
It also covers partitioning strategies which are
applied to various scenarios such as the data sliding
window scenario, data archiving, and partition
consolidation and movement practice.
KEYWORDS
big data, partition table, partitioning, big table,
MapReduce
1 INTRODUCTION
1.1 Motivation
The term “Big Data from software and
computer science field is a collection of
datasets that grows so large and complex that it
becomes difficult to process using traditional
relational database management tools or
processing applications. In fact, “Big Data” is
one of the unstoppable IT trends in the Cloud
Computing topic. Nowadays, most relevant
applied topics of Big Data are generally derived
from MapReduce [1]. However, if we move our
focus from the requirement of search/data
analysis to real-time data, the shortage of
MapReduce is inevitable from its nature of
designed architecture. Some disadvantages of
MapReduce are, first, good at batched, offline
big data processing, but not suitable for real-
time transactional jobs. Second, MapReduce
uses brute force to compute result instead of
indexed data. Third and the most important, for
many commercial users, SQL (Structured
Query Language)-like data and relational
database remains the majority. Once we get
back to the traditional usage of RDBMS
(Relational Data-Base Management Systems),
the common slowness of big data manipulation
is unavoidable. A notion came to our minds: Is
it possible to insert/delete a huge dataset in no
time? Is it possible to leverage technical
innovations based on current database platform
to minimize maintenance time and maximize
efficiency for enterprise level database? How to
make these work like a charm automatically?
Thus we decide to design a mechanism to fulfill
all of the requirements.
1.2 Database Platforms
To better consistently explain the idea
thoroughly in this thesis, Microsoft SQL Server
(but not limited to this) is chosen as the
platform for explain the following samples and
the details. The hierarchy of Database Logical
Structures is shown in below figure:
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 1
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Figure 1: SQL Server Logical Structure.
As described in Figure 1, an instance of the
database engine is a copy of the SQL Server
executable that runs as an operating system
service. Each instance manages several system
databases and one or more user databases. Each
computer can run multiple instances of the
database engine. Applications connect to the
instance in order to perform work in a database
managed by the instance [2].
SQL Server maps a database over a set of
operating-system files which can be categorized
in two main types, data and log and with three
different file extensions: (.mdf and .ndf for data,
and .ldf for log). Data and log information are
never mixed in the same file, and individual
files are used only by one database. Each
database has one set of data spaces, which can
be a filegroup, a partition scheme, or a
filestream directory [3].
Filegroups are named collections of files and
are used to help with data placement and
administrative tasks such as backup and restore
operations. Partition schemes in the database
can map the partitions of a partitioned table or
index to filegroups. The filestream directory is
used for integrating with unstructured data such
as text documents, images, and videos that are
often stored outside the database, separate from
its structured data. The data of a table is
physically stored in one or more files, its
logical container is not a file, but a filegroup. A
logical container “owns” the logical objects it
contains, and a logical object “belongs” to the
logical container that holds it. Filegroups own
files and tables, therefore tables and files
belong to filegroups. Partition schemes own
partitioned tables, therefore partitioned tables
belong to partition schemes, and if a partition
scheme exists, one or more filegroups belong to
it. Some examples are:
A primary filegroup (which always holds
the system tables) and a default filegroup
must exist. The primary and default
filegroups can be the same filegroup or
separate filegroups.
A filegroup may have more than one file.
A partition scheme can be defined on a
single filegroup with one file.
A partition scheme can be defined on a
single filegroup with multiple files.
A partition scheme can be defined on
multiple filegroups, with each filegroup
having one or more files.
Figure 2 illustrates the relationships between
database’s data spaces, filegroups, and partition
schemes.
Figure 2: The relationships between data spaces,
filegroups, and partition schemes.
1.3 Table and Partition Overview
A table is contained in one or more partitions
and each partition contains data rows in either a
heap or a clustered index structure. The pages
of the heap or clustered index are managed in
one or more allocation units, depending on the
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 2
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
column types in the data rows. Figure 3 shows
the organization of a table. Table and index
pages are contained in one or more partitions. A
partition is a user-defined unit of data
organization. By default, a table or index has
only one partition that contains all the table or
index pages. The partition resides in a single
filegroup. A table or index with a single
partition is equivalent to the organizational
structure of tables and indexes in earlier
versions of SQL Server [4].
Figure 3: Table and partition organization [4].
When a table or index uses multiple partitions,
the data are partitioned, so that groups of rows
are mapped into individual partitions, based on
a specified column. The partitions can be put on
one or more filegroups in the database. The
table or index is treated as a single logical
entity when queries or updates are performed
on the data.
2 METHODOLOGY
2.1 Partition Approach
The two common partitioning techniques
applied to the dataset dividend slices are
Horizontal Partitioning and Vertical
Partitioning. To be simple, no matter what kind
of techniques, we can deem them as a “logical
view” to a table. For horizontal partitioning
views, the table maintains the same structure
and columns but is partitioned along the row-
level boundaries, for the vertical partitioning
views, the vertically partitioned table splits the
original table into more than one physical table.
Each table has the same number of records and
the same primary key, but the other columns
are different. These two tables have a 1-to-1
relationship, and this technique might be used
to overcome a database engine’s limit on the
number of columns supported in one table. For
example, a financing table might need 1,000
columns that have a 1-to-1 relationship with
each other, but the database engine might only
support a maximum of 256 columns per table.
In that case the logical table could be split
(vertically partitioned) into 4 physical tables.
Horizontal partitioning is a method of
dividing rows of data in a table, with different
rows belonging to different partitions of that
table. We also call it “Row groups”.
Figure 4: Horizontal partitioning table concept.
Vertical partitions have subsets of table
columns, each subset stores a part of column-
wised data. Theoretically, each subset table can
be accessed independently, we also call it
“Segments”.
Figure 5: Vertical partitioning table concept.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 3
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Horizontal and vertical partitions are just like
two knives that cut the base table apart from
different angles. It is worth pointing out that
this “cutting” technique is different from usual
SQL statement to filter out rows and columns,
on the contrary, the data quantity to the base
table remains the same, but had divided into
pieces under cover. The smaller pieces can be
stored in a very distributed way on the different
disks as more as possible, but from developers’
point of view, it is nothing but a normal table of
no difference.
An example here is the base table (e.g.
SalesTable) looks like the following:
Figure 6: The sample base table called SalesTable.
After combining with horizontal and vertical
techniques, the original bigger table has been
divided into 12 smaller tables.
Figure 7: The concept of how data are stored after
partitioned horizontally and vertically.
If we perform a simple query like this:
SELECT ProductKey, SUM (SalesAmount)
FROM SalesTable
WHERE OrderDateKey < 20130108
The result would look like the following
highlighted region with merely 3 small tables:
Figure 8: The sample result of vertical partitions filtering.
We can fetch only the needed columns and
rows with some filtering condition, and what’s
even more here is that we do not need to bother
any disk I/O (Input/Output) effort on the rest 9
tables (marked with X in the Figure 8), merely
the involved disks of the 3 tables required. It is
quite different from the original way we
perform a select instruction toward the whole
table or indexes without partitioning on the
same disks. See? It is quite natural to imagine
the benefits that the partitioning techniques can
bring to us:
Less storage used and faster storage speed:
By means of multiple I/O paths via multiple file
groups and disks.
Smaller, faster backups and restores:
By means of backups of changed filegroups
only
Faster inserts and deletes of large data sets
By means of partition switching
For the top two points are obviously, but not
very clear for the last one? At this point, it is
all right, we are going to describe it further in
the coming pages soon.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 4
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Of course, every benefit has its cost that needs
to pay, no exception for the partition techniques,
they are:
Partitioned objects require more memory
than non-partitioned objects.
Database administrator must have
sufficient knowledge of partitioning.
Partitioned objects may have some
restrictions under certain situations.
Although partitioning requires extra effort
when managing very large numbers of records,
once the data size scale is big enough, it is
surely less total administrative time than trying
to manage the records without partitioning.
Here we would like to introduce partitioning
methodology of how we boost the performance
of data access and manipulation. Whether
horizontally or vertically partitioning the
dataset is allowed, but operational complexity
and the transparency degree also matters. The
rule of thumb is: from the database
administrator’s standpoint, the lower
operational complexity the better, on the
contrary, from the developer’s view, the higher
transparency the better. Consider the side
impact of vertical partitioning, what happens
when the columns are removed from the
original base table and placed in its child tables,
it is through denormalization, overall data
access patterns are more complicated because
developers need to know the denormalization
relations between each child table very well
besides the functional relations of different
tables. Combining both horizontal and vertical
methods together is even more complicated by
multiplying the number of divided slices.
Another consideration is the behavior how we
use data. People usually get used to archiving
the historical data by “time”, “region”, or
“category”. Interestingly these are more similar
to horizontally distribution instead of vertical
column slices. Thus, here the horizontal
partitioning method will be adopted in practice
throughout the next implementation.
2.2 Horizontal Partitioning
The three amigos composed of the partitioning
definition is by “partition boundary”, creating a
“partition function” and “partition scheme”,
followed by assigning to the target table(s). The
dependency relationship is shown in Figure 9:
Figure 9: The dependency relationship of partition
definitions.
Each component in the relationship must
depend on the existence of its predecessor
component, and can be used by many of its
successors. The partitioned table must be
created on a partition scheme, which defines the
filegroup storage locations for the partitioned
table. This is the biggest difference between
non-partitioned tables. Non-partitioned tables
also do have a partition, but the number of
partition is one, while partitioned table can be
separated into thousands of partitions.
More details of some terminologies:
Partition Boundary: By specifying a set of
boundary values from the range of
possible values of the input parameters
data type. These boundary values define a
number of sub-ranges within the overall
range, called partitions.
Each partition:
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 5
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Partitions are numbered starting with 1.
Any given possible value within the boundary
value range can exist in only one partition.
Two types (RANGE LEFT/ RANGE RIGHT)
of boundary range control which partition a
record is stored in when the partitioning value
equals the boundary exactly.
Table 1: Example with the boundary value of 1000:
(Partition #1)
RANGE
LEFT
<= 1000
> 1000
RANGE
RIGHT
< 1000
>= 1000
Say the boundary here is 1,000 and split the
overall range as two partitions (#1 and #2), if
there is an incoming value of 800 inserted into
the partitioned table, which partition number
should this value reside in? Quite simply, it is
#1 since the incoming value less than the
boundary value of 1,000. What if the incoming
value is exactly the same as the boundary value
of 1,000?
The answer depends on the partition range
direction we use. If we use RANGE LEFT,
once the incoming value is right equal to the
boundary value, then it would be assigned to
reside in the “LEFT” partition, which the
partition number is #1 here. Meanwhile, if we
use RANGE RIGHT, the coming value of
1,000 would fall into the “RIGHT” partition,
where the partition number is #2. If we have
multiple boundary values, they are like:
RANGE LEFT and values of (0, 10, 20):
Partition 1: values from the data type
minimum to values <= 0
Partition 2: values > 0 and values <= 10
Partition 3: values > 10 and values <= 20
Partition 4: values > 20 and values <= data
type maximum
RANGE RIGHT and values of (0, 10, 20):
Partition 1: values from the data type
minimum to values < 0
Partition 2: values >= 0 and values < 10
Partition 3: values >= 10 and values < 20
Partition 4: values >= 20 and values <=
data type maximum
2.3 Metadata-Only Manipulations
One fact in the world is, moving a large data set
could never be faster than the small one. Just
like under the same driving mode, traveling
10,000 miles with less gasoline usage than 5
miles only happens in the Arabian Nights.
For instance, we have two tables (Tables A &
B), a common manipulation is to move “2008
Data” from Table A to B. The puzzle is that the
size of “2008 Data” is up to 10 TB.
Figure 10: Traditional SQL operation of data
manipulation.
By using traditional method, the pseudo SQL
statement would be two steps as followed:
insert Table B
select * from Table A
where data_period = 2008 Data;
delete Table A
where data_period = 2008 Data
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 6
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
In the first step, as it says, select the specified
period data set and insert into to the target table
(Table B). Due to the data size is up to 10 TB,
this operation comes to no surprise that would
require several days to most of slow disks,
several hours taken even for very high-end
storage disk arrays. It merely counted the
insertion effort, for any database engine, since
it needs to record every action for possible
rollback request later, extra disk I/O operation
cost charges due to the write-in to log files, and
hence, the real duration and the space needed
could be doubled.
Unfortunately, it is just the beginning. The
second step is another disaster to the whole
system because we need to get rid of the data
set “2008 Data” from the source table (Table A).
The ensuing disk I/O flood will cost another
several days, or another several hours at least.
After the torture like this, if you dared to touch
the disks surface and feel the temperature,
believe me, you must regret.
The cost does not yet include the derived
network bandwidth, the processor power, and
the most of all, the precious waiting time.
Thinking differently is the only way out! The
tricky way is: DO NOT REALLY move data
sets literally, but move the “metadata” behind
them, instead. By changing the link definition
of big data set and the partition number where it
lives in, it is possible to load or remove data
from the partitioned table almost immediately
regardless of how big they are.
Let’s leverage the technique stated earlier.
Partition Table A into six areas so the data
within would be divided according to the date
boundaries. Table B is a blank table with an
empty partition.
Imagine the following concept:
Figure 11: Table and partition metadata movement
concept (Before switch out).
After our architecture redesigned with table
partitioning techniques, each area of data set
has its own link to the logical partition number
with filegroup. The default partition of Table B
shares the same filegroup with the partition #2
of Table A. By exchanging the link of the
partition “2008 Data” metadata definition,
which is the concept called “Switch out”, the
fast performance is no longer a dream. The
magic pseudo statement as follows:
Alter Table A
SWITCH PARTITION 2
TO B
Guess what, this short statement will do the
tricks and finish the job in seconds. After the
“Switch out” operation, the link definition of
data set “2008 Data” has been changed from
Table A to B. For users’ point of view, it is like
the data set “2008 Data” has been deleted from
Table A and loaded into Table B.
All of these happens in a twinkle. This is
because the “Switch” operation can avoid any
real data movement. The new metadata
definition looks like the followed figure 12:
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 7
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Figure 12: Table and partition metadata movement
concept (After switch out).
With the same idea, we can perform “Switch
in” operation to pull in the data set we want to
the table or specific partition.
2.4 Sliding the Partitions
With the basis of clear interpretation in
previous sections, now is a good timing to
design a partitioning strategy that leverages
partition switching and the performance
advantage it brings to manage the big tables in
business operations.
The strategy is by sliding the partitions to fade
out old data and bring in new data into the
partitioned table periodically, we call this
method “Sliding window”, as it is described by
analogy with the network packet terms in
TCP/IP (Transmission Control Protocol /
Internet Protocol) world. The concept
procedures consist of the following ten steps &
pseudo codes:
1. Design a partition function based on the
proper column of the source table(s).
CREATE PARTITION FUNCTION
pf_Annually_Right(datetime)
AS RANGE Right FOR VALUES
(
‘2008-01-01’, ‘2009-01-01’,
‘2010-01-01’,
‘2011-01-01’, ‘2012-01-01’)
For the values, though we used the “fixed”
dates here, to be more flexible in the real
world systems, we strongly suggest to
replace the fixed values with
corresponding dynamic values and
format from the real-time system date or
some data calculations result with
incrementally or diminishingly method.
For example:
SUBSTRING(CONVERT(varchar,DATEADD(
dd,+7,getdate()),121),0,12)
2. Plan filegroup(s) distribution and design
the partition scheme to combine the
partition function and the filegroup
definition.
CREATE PARTITION SCHEME ps_
pf_Annually AS PARTITION
pfDaily_Right ALL TO ([PRIMARY])
To be more performance-oriented thinking,
plan more than one filegroups and assign
to different physical disk arrays will get
further parallel disk I/O benefits. Here for
demonstration, we put all in the
PRIMARY group.
3. Use the current existing source table or
design the new source table to
accommodate the data pertaining to the
partitioning structure.
CREATE TABLE [sourceTableName]
(
columnDate, column2, column3....
)
ON ps_pf_Annually([column_Date])
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 8
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
4. Design a staging table (temporary archive
table) with identical columns/indexes
definitions as the source table.
CREATE TABLE [tablename_Staging]
(
columnDate, column2, column3....
)
5. Move the oldest data partition from the
source partitioned table to the archive table
by “Switch out” command, now the oldest
data would reside in the staging archive
table, archive it in the background to other
tables (will not lock/affect the source
table's transaction and performance) or
truncate it directly (in seconds, no matter
how many rows or how big size it is).
Figure 13: Switch out the old partition to staging
table.
ALTER TABLE [sourceTableName]
SWITCH PARTITION 2
TO[tablename_Staging];
TRUNCATE TABLE [tablename_Staging]
6. Get rid of the boundary value of the oldest
partition by “Merge” command.
Figure 14: Before merging the partitions, the oldest
boundary value is 2008-01-01.
ALTER PARTITION FUNCTION
pf_Annually_Right()
MERGE RANGE (‘2008-01-01’)
Figure 15: After merging the partitions, the oldest
boundary value becomes 2009-01-01.
7. Specify the NEXT USED filegroup for the
incoming new partition.
ALTER PARTITION SCHEME ps_
pf_Annually
NEXT USED[PRIMARY]
8. Create a new boundary value for the new
partition at the newest end partition of the
table, remember by “Split” command, this
will cut the leading end partition into two
empty partitions.
ALTER PARTITION FUNCTION
pf_Annually_Right()
SPLIT RANGE (‘2012-01-01’)
9. (Optional) Get the data from an existing
staging table with identical
columns/indexes definitions as the target
partitioned table by “Switch in” command
to the proper partition location.
Figure 16: Before extending new partition, the
maximum boundary value is 2011-01-01.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 9
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Figure 17: After splitting partitions, the maximum
boundary value becomes 2012-01-01, and load the
data of the staging table into Partition 5 of the target
partitioned table.
Notice: Keep this new partition empty,
never put into both the leading ends. Make
sure the indexes of the existing staging
table are rebuilt/refreshed before switching
into the target partitioned table.
10. The staging table now is empty and the
source table partition has been refreshed.
In the above steps, we developed a strategy to
slide the partitioning structure by cutting out
the oldest partition and extend an extra new
partition, during the process, not only
refreshing the structure, but also deleting the
historic data and loading the latest data in
seconds. To further automate this process,
simply put these steps into the schedule jobs
with proper frequency, this idea is totally
different from traditional SQL DML (Data
Manipulation Language) and extremely
efficient in dealing with big data for any
enterprise level application.
3 CONCLUSION
By leveraging the “Divide & Conquer” theorem,
big data can be partitioned into separate logical
units, by changing the metadata relationship,
each big dataset unit can “roam around”
different tables quickly under moving
requirements, and this really raises the
performance and the usage flexibility of
application possibility in the age of
data/information explosion.
Partitioning also has its disadvantages and
limitations. The framework has the tight
dependency on the database engine. The same
scenario with good performance and feasibility
in one database platform may not work that
well in the others due to the version
supportability and internal engine behavior
differences. A little bit complex designing
requirements before setting out on a journey is
also the showstopper widely used in daily data
operations. However, we still can reap more
benefits from using the idea than its limitations.
As the maturity of the framework goes by, the
true influence of partitioning is worth expecting.
REFERENCES
[1] J. Dean and S. Ghemawat (December, 2004).
MapReduce: Simplified Data Processing on Large
Clusters, Paper presented at the meeting of
Proceedings of Symposium on Operating Systems
Design and Implementation, San Francisco, CA.
[2] Microsoft, “Database Engine Instances,”
http://technet.microsoft.com/en-
us/library/hh231298.aspx, 2013.
[3] Microsoft, “Filestream Overview,”
http://technet.microsoft.com/en-
us/library/bb933993(v=sql.105).aspx, 2013
[4] Microsoft, “Table and Index Organization,”
http://technet.microsoft.com/en-
us/library/ms189051(v=sql.105).aspx, 2013
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 10
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Mining Patterns with Attribute Oriented Induction
Spits Warnars
Database, Datawarehouse & Data Mining Research Center, Surya University
Jl. Boulevard Gading Serpong Blok O/1, Tangerang 15810, Indonesia
Spits.warnars@surya.ac.id
ABSTRACT
Mining data in human activity life such as business,
education, engineering, health and so on, is
important and help human itself in order to justify
their decision making process. Attribute Oriented
Induction (AOI) has been using to mine significant
different patterns since was coined in 1989, has
been combined and as complement with other data
mining pattern. AOI has been proved and powerful,
has future opportunity to be explored in order to
help human life to find data patterns. AOI is chosen
since can reduce many patterns by summarize/roll
up many patterns in low into high level in concept
tree/hierarchy. However, non summarize pattern at
low level in concept tree/hierarchy can be used to
sharpen the mining knowledge pattern just as like
roll up and drill down in data warehouse. Mapping
implementation of AOI in human life area such as
business, education, engineering, health and so on,
is useful in order to give valuable knowledge AOI
mining pattern, particularly for those who interest
with AOI data mining technique as data mining
technique which can summarize many pattern into
simple patterns.
KEYWORDS
Data Mining, Attribute Oriented Induction, AOI,
pattern, rule.
1. INTRODUCTION
Attribute Oriented Induction (AOI) method was
first proposed in 1989 integrates a machine
learning paradigm especially learning-from-
examples techniques with database operations,
extracts generalized rules from an interesting
set of data and discovers high level data
regularities [39]. AOI provides an efficient and
effective mechanism for discovering various
kinds of knowledge rules from datasets or
databases.
AOI approach is developed for learning
different kinds of knowledge rules such as
characteristic rules, discrimination rules,
classification rules, data evolution regularities
[1], association rules and cluster description
rules[2].
1) Characteristic rule is an assertion which
characterizes the concepts which satisfied
by all of the data stored in database. This
rule provide generalized concepts about a
property which can help people recognize
the common features of the data in a class.
For example the symptom of the specific
disease [9].
2) Discriminant rule is an assertion which
discriminates the concepts of one (target)
class from another (constrasting). This rule
give a discriminant criterion which can be
used to predict the class membership of of
new data,for example to distinguish one
disease from the other [9].
3) Classification rule is a set of rules which
classifies the set of relevant data according
to one or more specific attributes. For
example, classifying diseases into classes
and provide the symptoms of each [40].
4) Association rule is association relationships
among the set of relevant data. For
example, discovering a set of symptoms
frequently occurring together[12,35].
5) Data evolution regularities rule is a general
evolution behaviour of a set of the relevant
data (valid only in time-related/temporal
data). For example, describing the major
factors that influence the fluctuations of
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 11
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
stock values through time [3,37]. Data
evolution regularities can then be classified
into characteristic rule and discrimination
rule[3].
6) Cluster description rule is used to cluster
data according to data semantics [12], for
example clustering the university student
based on different attribute(s).
2. QUANTITATIVE AND QUALITATIVE
RULES IN AOI
Rules in AOI can be represented with
quantitative and qualitative rules:
1) Quantitative rule is a rule which is
associated with quantitative information
such as statistical information which asses
the representativeness of the rule in the
database [1]. There are three types
quantitative rule i.e. quantitative
characteristic rule, quantitative
discriminative rule and quantitative
characteristic and discriminative rule.
a. Quantitative characteristic rule is
quantitative information of a
characteristic rule and each rule in final
generalization can be measured with t-
weight in formula 1.
t-weight =Votes(qa)/ (1)
where :
t-weight = percentage of each rule in
the final generalized
relation.
Votes(qa) = number of tuples in each
rule in the final
generalized relation
Where Votes(qa) is in
Votes{q1,...,qN}.
N = number of rules in the final
generalized relation.
Quantitative characteristic rule is
represented with symbol and should
be in the form of:
V(x)=target_class(x)condition1(x)
[t:w1] V...V conditionn(x)[t:wn]
Where :
x is the target class between 1..n.
n is the number of rules in the final
generalized relation.
[t:w1] is t-weight (formula 1) for rule
1 until
[t:wn] as t-weight (formula 1) for rule
n.
Example:
V(x) = graduate(x)(Birthplace(x) Є
Canada Λ GPA(x) Є excellent)
[t:75%] V (Major(x) Є science Λ
Birthplace(x) Є Foreign Λ
GPA(x) Є good) [t:25%]
b. Quantitative discriminative rule is a
discrimination rule that use quantitative
information. Each rule in the target class
will be discriminated against a rule in
the constrating class and is measured
with d-weight in formula 2.
d-weight =Votes(qa ϵ Cj) / (2)
where :
d-weight = percentage ratio per rule
in the target class to the
total number of tuples in
the target class and the
contrasting class for the
same rule.
Votes(qa) = number of tuples in each
rule in the target class Cj.
Cj is in {C1,...,CK}.
K = total number of the target and
constrating classess for the
same rule.
Quantitative discriminative rule is
shown with symbol and should be in
the form of:
V(x)=target_class(x) condition1(x)
[d:w1] V...V conditionn(x)[d:wn]
Where:
x is the target class between 1..n.
n is the number of rules in the target
class.
[d:w1] is d-weight (formula 2) for
rule 1 in the target class.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 12
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
[t:wn] is d-weight (formula 2) for rule
n of target class.
Example:
V(x) = graduate(x) (Birthplace(x)
ЄForeign Λ GPA(x) Є good)
[d:100%] V (Major(x) Є social
Λ GPA(x) Є good) [d:25%]
c. Quantitative characteristic and
discriminative rule use quantitative
information characteristic rule and
discriminative rule which have both t-
weight and d-weight for the same rules.
Each rule is measured with t-weight in
formula 1 for characteristic rule and d-
weight in formula 2 for discriminative
rule. Quantitative characteristic and
discriminative rule is shown with
symbol  and should be in the form
of:
V(x)=target_class(x) 
condition1(x)[t: w1,d:w1] V...V
conditionn(x)[t:wn,d:wn]
Where:
x is target class between 1..n.
n is number of rules in target class.
[t: w1] is t-weight in formula 1.
[d: w1] is d-weight in formula 2.
Example:
V(x) = professor(x) (Birthplace(x)
ЄForeign Λ GPA(x) Єgood)
[t:20%,d:100%] V (Major(x) Є
social Λ GPA(x) Є good)
[t:10%,d:25%]
2) Qualitative rule can be obtained by using
the same process of learning applied in its
quantitative counterpart without the
association of the quantitative attribute in
the generalized relations [1]. Qualitative
characteristic rule uses symbol and
qualitative discriminative rule uses
symbol. Qualitative rule either
characteristic or discriminative rules should
be in the form of:
V(x)=target_class(x) [|] condition1(x)
V...V conditionn(x)
Example:
V(x) = graduate(x) (Birthplace(x)
ЄCanada Λ GPA(x) Є excellent) V
(Major(x) Є science Λ Birthplace(x)
Є Foreign Λ GPA(x) Є good)
3. Concept Hierarchies
One advantage of AOI is that it has concept
hierarchy as the background knowledge which
can be provided by the knowledge engineers or
domain experts [2,3,4]. Concept hierarchy
stored a relation in the database provides
essential background knowledge for data
generalization and multiple level data mining.
Concept hierarchy represents a taxonomy of
concept of the attribute domain values. Concept
hierarchy can be specified based on the
relationship among database attributes or by set
groupings and be stored in the form of relations
in the same database [7].
Concept hierarchy can be adjusted dynamically
based on the distribution of the set of data
relevant to the data mining tasks. The
hierarchies for numerical attributes can be
constructed automatically based on data
distribution analysis [7]. Concept hierarchy for
numeric will be treated differently for the sake
of efficiency [20,21,22,23,26]. For example if
there are a range of value between 0 and 1.99,
then there will be 199 values start from 0.00
until 1.99, but for efficiency there will be only
1 record created with 3 fields rather than with
200 records with 2 fields.
In concept hierarchy concepts are ordered by
levels from specific or low level concepts into
general or higher level. Generalization is
achieved by ascending to the next higher level
concepts along the paths of the concept
hierarchy. The most general concept is the null
description as the most specific concepts
correspond to the specific values of the
attributes in the database which described as
ANY. Concept hierarchy can be balanced or
unbalanced, where unbalanced hierarchy then
must be converted to a balanced hierarchy.
Figure 1 shows the concept hierarchy tree for
attribute workclass in adult dataset [18] which
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 13
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
has three levels. The first level as the low level
has 8 concepts and they are without-pay, never-
worked, private, self-emp-not-inc, self-emp-inc,
federal-gov,state-gov and local-gov concepts.
The second level has 5 concepts and they are
charity, unemployed, entrepreneur, centre and
territory concepts. The third level as the high
level has 2 concepts and they are non
government and government concepts. For
example, the concept of non government at the
high level has 3 sub concepts in the second
level: charity, unemployed and entrepreneur
concepts. The concept entrepreneur at the
second level has 3 sub concepts in the low
level: private, self-emp-not-inc and self-emp-
inc concepts.
Figure 1. A concept hierarchy tree for attribute
workclass in adult dataset[18]
Concept hierarchy in figure 1 can be
represented with:
Without-pay
Charity
Never-worked
Unemployed
{Private, self-emp-not-inc,
self-emp-inc}
entrepreneur
{federal-gov,state-gov}
Centre
Local-gov
Territory
{Charity,Unemployed,
entrepreneur}
Non government
{Centre, Territory}
Government
{Non government,
Government}
ANY(workclass)
Where symbol
indicates generalization, for
example Without-pay
Charity indicates that
Charity concept is a generalization of Without-
pay concept.
There are four types of concept generalization
in the concept hierarchy [6]:
1) Unconditional concept generalization: rule
is associated with the unconditional IS-A
type rules. A concept is generalized to a
higher level concept because of the
subsumption relationship indicated in the
concept hierarchy.
2) Conditional/deductive rule generalization:
rule is associated with a generalization path
as a deduction rule where the type of rules
is conditional and can only be applied to
generalize a concept if the corresponding
condition can be satisfied. For example,
form: A(x) Λ B(x) C(x) has the meaning
that for a tuple x, the concept(attribute
value) A can be generalized to concept C if
condition B can be satisfied by x. Or
concept C can be generalized if it can be
satisfied by concept A and B.
3) Computational rule generalization: each
rule is represented by a condition which is
value-based and can be evaluated against an
attribute or a tuple or the database by
performing some computations. The true
value of the condition would then determine
whether a concept can be generalized via
the path.
4) Hybrid rule-based concept generalization: a
hierarchy can have paths associated with all
the above 3 different types of rules. It has a
powerful representation capability and is
suitable for many kinds of application.
Rules number 2-4 is three types of rule based
concept hierarchy [5,34] while rule number 1 is
a non rule based concept hierarchy.
A rule-based concept hierarchy is a concept
hierarchy whose paths have associated
generalization rules. In the rule-based
induction, data cube (hypercube) in
multidimensional datawarehouse is the
favourable data structure [6]. To perform a rule-
based induction on the data in a large
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 14
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
warehouses, the path relation algorithm is an
excellent choice because datawarehouse has
already structured as cube/hypercube [6]. Rule-
based concept has induction anomaly problem
which affects the efficiency which is caused by:
1) A rule may depend on an attribute which
has been removed.
2) A rule may depend on an attribute whose
concept level in the prime relation has been
generalized too high to match the condition
of the rule.
3) A rule may depend on a condition which
can only be evaluated against the initial
relation, e.g. the number-of-tuples in the
relation.
There are three ways to solve the induction
anomaly problem [6]:
1) Reapplying the deduction rules all over
again on the initial relation which are costly
and wasteful.
2) Repetitive generalization required by roll-
up and drill-down which can be done in an
efficient way without induction anomaly
problem.
3) Propose the use of path relation (the last
method backtracking algorithm [5,6]
4. AOI prototype
The AOI method was implemented in a data
mining system prototype called DBMINER
[5,7,17,28,29] which previously called
DBLearn and been tested successfully against
large relational database. DBLearn
[24,25,27,38] is a prototype data mining system
which was developed in Simon Fraser
University. DBMINER was developed by
integrating database, OLAP and data mining
technologies [17,36] has following features:
1) Incorporating several data mining
techniques like attribute oriented induction,
statistical analysis, progressive deepening
for mining multiple-level rules and meta-
rule guided knowledge mining [7] data cube
and OLAP technology [17].
2) Mining new kinds of rules from large
databases including multiple level
association rules, classification rules, cluster
description rules and prediction.
3) Automatic generation of numeric
hierarchies and refinement of concept
hierarchies.
4) High level SQL-like and graphical data
mining interfaces.
5) Client server architecture and performance
improvements for larger application.
6) SQL-like data mining query language
DMQL and Graphical user interfaces have
been enhanced for interactive knowledge
mining.
7) Perform roll-up and drill-down at multiple
concept levels with multiple dimensional
data cubes.
5. AOI algorithms
AOI can be implemented with an architecture
design shown in figure 2 where characteristic
rule (LCHR) and classification rule (LCLR) can
be learned directly from the transactional
database (OLTP) or Data warehouse (OLAP)
[6,8] with the help of the concept hierarchy as
the knowledge generalization. Concept
hierarchy can be created from OLTP database
as a direct resource.
Figure 2. AOI architecture
From a database we can identify two types of
learnings:
1) Positive learning as the target class where
the data are tuples in the database which are
consistent with the learning concepts.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 15
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Positive learning/target class will be built
when learn characteristic rule
2) Negative learning as the contrasting class in
which the data do not belong to the target
class. negative learning/contrasting class
will be built when learn discrimination or
classification rule.
Characteristic rule has been used by AOI in
order to recognize, learning and mining as a
specific character for each of attribute as their
specific mining characterization. Characteristic
rule process the generalization with help of
concept hierarchy as the standard saving
background knowledge to find target class as a
positive learning. Mining rule can not be
limited with just only one rule, as the more
rules can be created the more mining can be
done. This has been proven as an intelligent
system which can help human to make a system
that has ability to think like a human [3]. Rules
often can be discovered by generalization in
several possible directions [9].
Relational database as resources for data
mining with AOI can be read with data
manipulation language select sql statement
[13,14,15,16]. Using a query for building rules
gives an efficient mechanism for understanding
the mined rules [11,12]. In the current AOI, a
query is processed with SQL-like data mining
query language DMQL at the beginning of the
process. It collects the relevant sets of data by
processing a transformed relational query,
generalizes the data by AOI and then presents
the outputs in different forms [7].
AOI generalizes and reduces the prime relation
further until the final relation can satisfy the
user expectation based on the set threshold. One
or two thresholds can be applied, where one
threshold is used to control both of number of
distinct attributes and tuples in the
generalization process, whilst two thresholds
are used to control the number of distinct
attributes and tuples in the generalization
process.
Threshold as a control for the maximum
number of tuples of the target class in the final
generalized relation can be replaced with group
by operator in sql select statement which will
limit the final result of generalization. Setting
different threshold will generate different
generalized tuples as the needed of global
picture of induction repeatedly as time-
consuming and tedious work [10]. All
interesting generalized tuples as multiple rule
can be generated as the global picture of
induction by using group by operator or distinct
function in the sql select statement.
AOI can perform datawarehouse techniques by
doing generalization process repetitively in
order to generate rules at different concepts
levels in a concept hierarchy, enabling the user
to find the most suitable discovery levels and
rules. This technique performs roll up
(progressive generalization [6]) or drill down
(progressive specialization [6]) and operation
[2,7] have been recognized as datawarehouse
techniques. Finding the most suitable discovery
levels and rules would add multidimensional
views to a database using generalization
process repetitively at different concepts level.
Building a logical formula as the representation
of a final result of AOI can not be done with
select sql statement and not select sql statement.
However, the sql statement can be matched
with other applications like Java, Visual Basic,
programming server program like ASP, JSP or
PHP. The data resulted from the sql statement
can be used to create a logical formula using
one of those application softwares.
There are 8 strategy steps that must be done [3]
in the process of generalization. Here is step
one to seven which is for characteristic rule and
step one to eight are for
classification/discriminant rule.
1) Generalization on the smallest
decomposable components, generalization
should be performed on the smallest
decomposable components of a data
relation.
2) Attribute removal, if there is a large set of
distinct values for an attribute but there is
no higher level concept provided for the
attribute, the attribute should be removed
during generalization.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 16
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
3) Concept tree Ascension, if there exists a
higher level concept in the concept
hierarchy for an attribute value of a tuple,
the substitution of the value by its higher
level concept would generalize the tuples.
4) Vote propagation, the value of the vote is
the value of accumulated tuples where the
vote will be accumulated when merging
identical tuples in the generalization.
5) Threshold control on each attribute, if the
number of distinct values in a resulting
relationthe is larger than the specified
threshold value, further generalization on
this attribute should be performed.
6) Threshold control on generalized relations,
if the number of tuples is larger than the
specified threshold value, further
generalization will be done based on the
selected attributes and the merging of the
identical tuples should be performed.
7) Rule transformation, change final
generalization to quantitative rule and
qualitative rule from a tuple (conjunctive)
or multiple tuples (disjunctive).
8) Handling overlapping tuples, if there are
overlapping tuples in both target and
constrasting classes, these tuples should be
marked and eliminated from the final
generalized relation.
AOI characteristic rule algorithm [3] is given as
follow:
For each of attribute Ai (1
i
n,
where n= # of attributes) in the
generalized relation GR
{ While
#_of_distinct_values_in_attribute_Ai >
threshold
{If no higher level concept in
concept hierarchy for
attribute_Ai
Then remove attribute Ai
Else substitute the value of Ai
by its corresponding
minimal generalized
concept
Merge identical tuples
}
}
While #_of_tuples in GR > threshold
{ Selective generalize attributes
Merge identical tuples
}
This AOI characteristic rule algorithm is the
implementation of step one to seven of the
generalization strategy steps. The algorithm
shows two sub processes i.e. control number of
distinct attributes and control number of tuples.
1) Control number of distinct attributes is a
vertical process which checks every per
attribute vertically. This is done by
checking all attributes in the learning results
of a dataset which have distinct attributes
less equal than the threshold. This first sub
process is just applied attributes that have
distinct attributes greater than threshold
while the number of distinct attributes are
also greater than the threshold. Each
attribute which have distinct attribute
greater than threshold will be checked if it
has a higher level concept in the concept
hierarchy. If it has no higher level concept
then the attribute will not be used. On the
other hand if they have higher level concept
then the attribute value will be substituted
with the value of the higher level concept.
Merging identical tuples will be done in
order to summarize generalization and
accumulate the value of the vote of the
identical tuples by eliminating the
redundant tuples. Eventually, after this first
sub process all the attributes in
generalization will have number of distinct
attributes less equal than the threshold. This
first sub process is implementation of step
one to five of the generalization strategy
steps.
2) Control number of tuples is a horizontal
process which checks per rule horizontally.
This is carried out for those attributes which
passed the first sub process where each
attribute will have number of distinct
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 17
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
attributes less equal than the threshold. This
second sub process is only done while the
number of rules is greater than threshold.
Selective generalization of the attributes
and merging of the indentical tuples will
reduce the number of rules. Selecting
candidate attribute for further generalization
can be done by preferences with finding the
ratio on the number of tuples or the number
of distinct attribute values. Selecting
candidate attribute for further generalization
can be examined by user based on the non
interesting one, either non interesting
attribute or non interesting rule. As with
first sub process merging the identical
tuples will be done in order to summarize
generalization and accumulate the vote
value of identical tuples by eliminating the
redundant tuples. Eventually, after this
second sub process the number of rules is
less equal than the threshold. This second
sub process is the implementation of step
three, four and six of the generalization
strategy steps.
AOI discriminant rule algorithm [1] is shown
below:
For each of attribute Ai (1
i
n,
where n= # of attributes) in the
generalized relation GR
{ Mark the overlapping tuples
While
#_of_distinct_values_in_attribute_Ai >
threshold
{ If no higher level concept in
concept hierarchy for
attribute_Ai
Then remove attribute Ai
Else substitute the value of
Ai by its corresponding
minimal generalized
concept
Mark the overlapping tuples
Merge identical tuples
}
}
While #_of_tuples in GR > threshold
{ Selective generalize attributes
Mark the overlapping tuples
Merge identical tuples
}
AOI discriminant rule algorithm is the
implementation of step one until eight of
generalization strategy steps. Since AOI
discriminant rule and AOI characteristic rule
algorithms have the same generalization
strategy steps between steps one and seven,
then literally they have the same process and
the difference is just only in step eight. They
also have the same sub processes i.e. control
number of distinct attributes as the first sub
process and control number of tuples as the
second sub process. The step handling
overlapping tuples as the eight generalization
strategy step is process in the beginning before
the first sub process and both in first and
second processes before merge indentical
tuples.
6. AOI Advantages and disadvantages
AOI provides a simple and efficient way to
learn knowledge rules from a large database
and has many advantages [9] such as:
1) AOI provides additional flexibility over
many machine learning algorithms.
2) AOI can learn knowledge rules in different
conjunctive and disjunctive forms and
provides more choices for the experts and
users.
3) AOI can use database facilities as the
traditional relational database such as
selection, join, projection whereas most
learning algorithms suffer from inefficiency
problems in a large database environment.
4) AOI can learn qualitative rules with
quantitative information while many
machine learning algorithm only can learn
qualitative rules.
5) AOI can handle noisy data and exceptional
cases elegantly by incorporating statistical
techniques in the learning process whereas
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 18
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
some learning system can only work in a
‘noise free’ environment.
However, AOI also has disadvantages [10] such
as:
1) AOI can only provides a snapshot of the
generalized knowledge and not a global
picture. Yet, the global picture can be
revealed by trying different thresholds
repeatedly.
2) Adjusting different thresholds will result in
different sets of generalized tuples.
However, using different thresholds
repeatedly is a time consuming and tedious
work.
3) There will be a problem in selecting the best
generalized rules between the large and
small threshold. Where in a large threshold
value will lead to a relatively complex rule
with many disjuncts and the results may not
be fully generalized. On the other hand a
small threshold value will lead to a simple
rule with few disjuncts and the results may
over generalized the rule with a risk of
losing some valuable informations.
7. AOI Current Studies
There are a number of recent studies on AOI.
One study by Chen et al has proposed a global
AOI method employing multiple-level mining
technique with multiple minimum supports in
order to generalize all interesting general
knowledge [30]. Wu et al have proposed a
Global Negative AOI (GNAOI) approach that
can generate comprehensive and multiple-level
negative generalized knowledge at the same
time [31]. Furthermore, Muyeba et al have
proposed clusterAOI, a hybrid interestingness
heuristic algorithm, which uses attribute
features such as concept hierarchies and distinct
domain attribute values to dynamically
recalculate new attribute thresholds for each less
significant attribute [32]. Moreover, Huang et al
have introduced the Modified AOI (MAOI)
method to deal with the multi-valued attribute
table and further sort the readers into different
clusters. Instead of using the concept hierarchy
and concept trees, MAOI method implemented
the concept climbing and generalization of
multi-valued attribute table with Boolean
Algebra and modified Karnaugh Map, and then
described the clusters with concept description
[33].
Meanwhile, Over generalization problem in
AOI was reduced with entropy measurement,
where AOI algorithm was extended by feature
selection for generalization process depends on
feature entropy measurement [41]. Meanwhile,
AOI is combined with EP(Emerging Pattern)
become AOI-HEP(Attribute Oriented Induction
High Emerging Pattern) use to mine frequent
and similar pattern [42,43,44] and have future
research such as inverse discovery learning,
learning more than two datasets and learning
other knowledge rules[45]. Moreover, MAOI
(Modified AOI) algorithm was proposed to deal
with multi-valued attributes which convert the
data to Boolean bit uses K-map to converge the
attributes[46]. Furthermore, AOI was modified
and called Frequency Count AOI (FC-AOI) and
used to mine the network data[47]. Meanwhile,
AOI was extended and used as Extended
Attribute Oriented Induction (EAOI) for
clustering mixed data type, where EAOI has
function to drawback major values and numeric
attributes[48,49].
Moreover, AOI was chosen as second step from
5 steps proposed algorithm in order to produce
AOI characteristic rule for parallel machine
scheduling[50]. Another approach was proposed
where doing classification using decision tree
induction which improve C4.5 classifier with 4
steps where the first step is generalization by
AOI[51]. Meanwhile, CFAOI (Concept-Free
AOI) was proposed in order to improve AOI
from the constraint of concept tree on multi
value attributes, by combining the simplified
binary digits with Karnaugh Map [52].
8. Conclusion
AOI has ages of 26 years since 1989 proof that
still exist in finding pattern and have been
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 19
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
combined and as complement with other data
mining techniques. AOI can mine many
different patterns and other possible patterns in
the future. AOI has been proof as powerful
mining technique when many patterns can be
mining with simple pattern results. AOI has
powerful in order to roll up/summarize data in
low to high level in concept tree/hierarchy,
which show that produce simple pattern.
Implementation AOI shows that AOI is useful
and recognized to mine pattern summarize
pattern from huge pattern and many kinds of
different patterns. Using AOI in many kind of
field such as business, education, engineering,
health and so on, should be mapped in order to
increase the reliability of AOI as proof and
powerful data mining technique.
Acknowledgement
This research is supported under Program of
research incentive of national innovation
system (SINAS) from Ministry of Research,
Technology and Higher Education of the
Republic of Indonesia, decree number
147/M/Kp/IV/2015, Research code: RD-2015-
0020.
REFERENCES
[1] Han,J., Cai, Y., and Cercone, N. 1993. Data-driven discovery of
quantitative rules in relational databases. IEEE Trans on Knowl
and Data Engin, 5(1),29-40.
[2] Han,J. and Fu, Y. 1995. Exploration of the power of attribute-
oriented induction in data mining. in U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth and R. Uthurusamy, eds. Advances in
Knowledge Discovery and Data Mining, 399-421.
[3] Han, J., Cai, Y. and Cercone, N. 1992. Knowledge discovery in
databases: An attribute-oriented approach. In Proceedings of the
18th Int. Conf. Very Large Data Bases, 547-559.
[4] Han,J. 1994. Towards efficient induction mechanisms in database
systems. Theoretical Computer Science, 133(2), 361-385.
[5] Cheung, D.W., Fu, A.W. and Han, J. 1994. Knowledge discovery
in databases: A rule-based attribute-oriented approach. In
Proceedings of Intl Symp on Methodologies for Intelligent
Systems, 164-173.
[6] Cheung, D.W., Hwang, H.Y., Fu, A.W. and Han, J. 2000.
Efficient rule-based attribute-oriented induction for data mining.
Journal of Intelligent Information Systems, 15(2), 175-200.
[7] Han,J., Fu, Y.,Wang, W., Chiang, J., Gong, W., Koperski,
K., Li,D., Lu, Y., Rajan,A., Stefanovic,N., Xia,B. and
Zaiane,O.R.1996. DBMiner:A system for mining knowledge in
large relational databases. In Proceedings of Int'l Conf. on Data
Mining and Knowledge Discovery, 250-255.
[8] Han,J., Lakshmanan, L.V.S. and Ng, R.T. 1999. Constraint-based,
multidimensional data mining. IEEE Computer, 32(5), 46-50.
[9] Cai, Y. 1989. Attribute-oriented induction in relational databases.
Master thesis, Simon Fraser University.
[10] Wu, Y., Chen, Y. and Chang, R. 2009. Generalized Knowledge
Discovery from Relational Databases. International Journal of
Computer Science and Network, 9(6),148-153.
[11] Imielinski, T. and Virmani, A. 1999. MSQL: A Query Language
for Database Mining. in Proceedings of Data Mining and
Knowledge Discovery, 3, 373-408.
[12] Muyeba, M. 2005. On Post-Rule Mining of Inductive Rules using
a Query Operator. In Proceedings of Artificial Intelligence and
Soft Computing.
[13] Meo, R., Psaila,G. and Ceri,S. 1998. An Extension to SQL for
Mining Association Rules. In Proceedings of Data Mining and
Knowledge Discovery,2,195-224.
[14] Muyeba,M.K. and Keane,J.A. 1999. Extending attribute-oriented
induction as a key-preserving data mining method. In Proceedings
3rd European Conference on Principles of Data Mining and
Knowledge Discovery, Lecture Notes in Computer science, 1704,
448-455.
[15] Muyeba, M. and Marnadapali, R. 2005. A framework for Post-
Rule Mining of Distributed Rules Bases. In Proceeding of
Intelligent Systems and Control.
[16] Zaiane, O.R. 2001. Building Virtual Web Views. Data and
Knowledge Engineering, 39, 143-163.
[17] Han, J., Chiang, J. Y., Chee, S., Chen, J., Chen, Q., Cheng, S.,
Gong, W., Kamber, M.,Koperski, K., Liu, G., Lu, Y., Stefanovic,
N., Winstone, L., Xia, B. B., Zaiane, O. R., Zhang, S., and Zhu,
H. 1997. DBMiner: a system for data mining in relational
databases and data warehouses. In Proceedings of the 1997
Conference of the Centre For Advanced Studies on Collaborative
Research, 8-.
[18] Frank, A. and Asuncion, A. 2010. UCI Machine Learning
Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University
of California, School of Information and Computer Science.
[19] Elfeky, M.G., Saad, A.A. and Fouad, S.A. 2000. ODMQL: Object
Data Mining Query Language. In Proceedings of the International
Symposium on Objects and Databases, 128-140.
[20] Han, J. and Fu, Y. 1994. Dynamic Generation and Refinement of
Concept Hierarchies for Knowledge Discovery in Databases. In
Proceedings of AAAI Workshop on Knowledge Discovery in
Databases, 157-168.
[21] Huang, Y. and Lin, S. 1996. An Efficient Inductive Learning
Method for Object-Oriented Database Using Attribute Entropy.
IEEE Transactions on Knowledge and Data Engineering,
8(6),946-951.
[22] Hu, X. 2003. DB-HReduction: A Data Preprocessing Algorithm
for Data Mining Applications. Applied Mathematics
Letters,16(6),889-895.
[23] Hsu, C. 2004. Extending attribute-oriented induction algorithm for
major values and numeric values. Expert Systems with
Applications, 27, 187-202.
[24] Han, J., Fu, Y., Huang, Y., Cai, Y., and Cercone, N. 1994.
DBLearn: a system prototype for knowledge discovery in
relational databases. ACM SIGMOD Record, 23(2), 516.
[25] Han, J., Fu, Y., and Tang, S. 1995. Advances of the DBLearn
system for knowledge discovery in large databases. In
Proceedings of the 14th international Joint Conference on
Artificial intelligence, 2049-2050.
[26] Beneditto, M.E.M.D. and Barros, L.N.D. 2004. Using Concept
Hierarchies in Knowledge Discovery. Lecture Notes in Computer
Science, 3171,255265.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 20
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
[27] Fudger, D. and Hamilton, H.J. 1993. A Heuristic for Evaluating
Databases for Knowledge Discovery with DBLEARN. In
Proceedings of the International Workshop on Rough Sets and
Knowledge Discovery: Rough Sets, Fuzzy Sets and Knowledge
Discovery (RSKD '93), 44-51.
[28] Han, J. 1997. OLAP Mining: An Integration of OLAP with Data
Mining. In Proceedings of the 7th IFIP 2.6 Working Conference on
Database Semantics (DS-7),1-9.
[29] Han, J., Fu,Y., Koperski, K., Melli, G., Wang, W. And Zaïane,
O.R. 1996. Knowledge Mining in Databases: An Integration of
Machine Learning Methodologies with Database Technologies,
Canadian Artificial Intelligence,(38),4-8.
[30] Chen, Y.L., Wu,Y.Y. and Chang, R. 2012. From data to global
generalized knowledge. Decision Support Systems, 52(2), 295-
307.
[31] Wu,Y.Y., Chen,Y.L., and Chang,R., 2011, Mining negative
generalized knowledge from relational databases, Knowledge-
Based Systems,24(1), 134-145.
[32] Muyeba, M.K., Crockett, K. and Keane, J.A. 2011. A hybrid
interestingness heuristic approach for attribute-oriented mining. In
Proceedings of the 5th KES international conference on Agent
and multi-agent systems: technologies and applications (KES-
AMSTA'11), 414-424.
[33] Huang, S., Wang, L. and Wang, W. 2011. Adopting data mining
techniques on the recommendations of the library collections. In
Proceedings of the 11th international conference on Information
and Knowledge engineering, 46-52.
[34] Thanh, N.D., Phong, N.T. and Anh, N.K. 2010. Rule-Based
Attribute-Oriented Induction for Knowledge Discovery. In
Proceedings of the 2010 2nd International Conference on
Knowledge and Systems Engineering (KSE '10), 55-62.
[35] Han,J. and Fu, Y. 1995. Discovery of Multiple-Level Association
Rules from Large Databases. In Proceedings of the 21th
International Conference on Very Large Data Bases (VLDB '95),
420-431.
[36] Han, J. 1998. Towards on-line analytical mining in large
databases. SIGMOD Rec. 27(1), 97-107.
[37] Han, J., Cai, O., Cercone, N. and Huang, Y. 1995. Discovery of
Data Evolution Regularities in Large Databases. Journal of
Computer and Software Engineering,3(1),41-69.
[38] Cercone, N., Han, J., McFetridge, P., Popowich, F., Cai,Y., Fass,
D., Groeneboer, C., Hall, G. and Huang, Y. 1994. System X and
DBLearn: How to Get More from Your Relational Database,
Easily. Integrated Computer-Aided Engineering, 1(4),311-339.
[39] Cai, Y., Cercone, N. and Han, J. 1991. Learning in relational
databases: an attribute-oriented approach. Comput. Intell,
7(3),119-132.
[40] Cai, Y., Cercone, N. and Han, J. 1990. An attribute-oriented
approach for learning classification rules from relational
databases. In Proceedings of 6th International Conference on Data
Engineering, 281-288.
[41] Al-Mamory, S.O., Hasson, S.T. and Hammid, M.K. 2013.
Enhancing Attribute Oriented Induction of Data Mining, Journal
of Babylon University, 7(21), 2286-2295.
[42] S. Warnars. 2015. Mining Frequent and similar patterns with
Attribute Oriented Induction High Level Emerging Pattern (AOI-
HEP) Data Mining technique, International Journal of Emerging
Technologies in Computational and Applied Sciences (IJETCAS),
3(11), 266-276.
[43] S.Warnars, 2014. Mining Frequent pattern with Attribute Oriented
Induction High Level Emerging Pattern (AOI-HEP). Proceedings
of the 2nd International Conference on Information and
Communication Technology (ICoICT), 144-149.
[44] S.Warnars, 2012. Attribute Oriented Induction High Level
Emerging Pattern. Proceedings of the International Conference on
Granular Computing(GrC).
[45] S.Warnars, 2014. Attribute Oriented Induction High Level
Emerging Pattern (AOI-HEP) future research. Proceedings of the
8nd International Conference on Information & Communication
Technology and Systems (ICTS), 13-18.
[46] Huang, s, Hsu, P. and Lam, H.N.N. 2013. An attribute oriented
Induction approach for Knowledge discovery from relational
databases. Advances in Information Sciences and Service
Sciences (AISS), 5(3), 511-519.
[47] Tanutama, L. 2013. Frequency count Attribute Oriented Induction
of Corporate Network data for Mapping Business activity.
International Conference on Advances Science and Contemporary
Engineering (ICASCE), 149-152.
[48] Prasad, D.H. and Punithavalli, M. 2012. An integrated GHSOM-
MLP with Modified LM Algorithm for Mixed Data Clustering,
ARPN Journal of Engineering and Applied Sciences, 7(9), 1162-
1169.
[49] Prasad, D.H. and Punithavalli, M. 2013. A Novel approach for
mixed Data Clustering using Dynamic Growing Hierarchical Self-
Organizing Map and Extended Atrribute-Oriented Induction,Life
Science Journal, 10(1), 3259-3266.
[50] Kaviani, M., Aminnayeri, M, Rafienejad, S.N. and Jolai, F.2012.
An appropriate pattern to solving a parallel machine scheduling
by combination of meta-heuristic and data mining, Journal of
American Science, 8(1), 160-167.
[51] Ali, M.M, Qaseem, M.S., Rajamani, L. and Govardhan, A. 2013.
Extracting useful Rules Through Improved Decision Tree
Induction using Information Entropy, International Journal of
Information Sciences and Techniques(IJIST), 3(1), 27-41.
[52] Huang, S. 2013. CFAOI: Concept-Free AOI on Multi Value
Attributes. Life Science Journal, 10(4), 2341-2348.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 21
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Supporting Aggregation in Data Warehousing Considering User-Defined
Temporal Granularities
Paolo Terenziani
1
, Antonella Andolina
2
1
DISIT – Universita’ del Piemonte Orientale “Amedeo Avogadro”
Viale Teresa Michel 11, 15121 Alessandria, Italy
Email: terenz@di.unito.it
2
ITCS Sommeiller, C.so Duca degli Abruzzi, 20 - 10129 Torino, Italy
Email: antoando@libero.it
ABSTRACT
Time-varying data are important in many
applications. Starting from the 80’s, they have been
widely studied in the field of temporal databases.
More recently, data warehouses (DWs) have gained
an increasing importance, strongly focusing on
time-varying data. DWs support different temporal
granularities (e.g., days, months, years), as well as
several operators for aggregating data along them.
In the past two decades, the importance of coping
also with user-defined granularities has been clearly
identified. However, most current DWs only
supports the treatment of limited sets of pre-defined
granularities. In this paper, we ground our approach
on some basic results in the temporal database
literature, including the telic\atelic distinction, and
we propose a general and application-independent
framework supporting aggregation of measures
along user-defined granularities.
KEYWORDS
Temporal Databases, Data Warehousing, user-
defined temporal granularities, temporal
aggregation, telic\atelic data
1 INTRODUCTION
The history of past activities, in the form of
time-stamped data, is a crucial source of
information for knowledge\data-based decision
making (consider, e.g., Business Intelligence).
Data warehouses (DWs) are widely exploited to
access historical data to support strategic
decisions. A DW is “a collection of subject-
oriented, integrated, non-volatile, and time-
variant data to support management’s
decisions” [1]. Thus, the temporal dimension is
an intrinsic part of the definition of DWs. Facts
in DWs have a temporal dimension, regardless
of whether source data are temporal or not. If
source data are not temporal, the time when
data are loaded into the DW is associated to
them, otherwise the time-stamps in the source
data are imported into the DW.
The research in the area of databases clearly
points out that the treatment of temporal data is
a semantically and practically complex issue
and that, given both its generality and
complexity, it cannot be delegated to users and
system-designers (see, e.g., the illuminating
examples in [2]). The area of research of
temporal databases (TDB henceforth) has
originated to cope with temporal data in
databases, leading , in the last three decades, to
a large set of dedicated conferences and
scientific papers (e.g., more than 2000 papers
are referred in the bibliography in [3], and the
recent Encyclopedia of Database Systems [4]
contains about 90 entries on time-related
terms). However, several problems still deserve
further investigation. We focus on one of them:
aggregation of measures along multiple user-
defined temporal granularities, to cope with
problems like the one in Ex.1, where working-
weeks (the period from Mondays to Fridays) is
an example of user-defined granularity.
Ex.1 given salaries expressed at the granularity
of working-weeks aggregate them at the
granularity of months.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 22
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
In Data Warehousing (DW), the aggregation
(through functions such as sum, average, etc.)
of measures along dimensions (e.g., in “roll up
operations) is a fundamental and frequently
adopted operation. Considering the temporal
dimension, current DW systems operate on a
fixed pre-defined set of granularities, usually
taken from the Gregorian calendar. On the other
hand, supporting user-defined granularities
involve being able to take into account also
new, domain and\or task dependent
granularities, defined by the system designer, or
by users. In the area of temporal databases,
there is a wide consensus about the importance
of coping with user-defined granularities, and
many different approaches have been devised to
achieve such a task (see the discussion in
Section 5 of this paper). For example, in their
attempt to provide a new standard (TSQL2 [2])
for the treatment of time in temporal databases,
Snodgrass and co-authors emphasized that in
many practical applications, supporting a
“standard” set of granularities (e.g., seconds,
minutes, ..., days, months, years) does not
suffice, since user-defined granularities need to
be used. In [2], the authors stress that the usage
of a calendar (informally, a calendar can be
seen as a set of related time granularities)
depends on the cultural, legal and even business
orientation of the user, and that no calendar is
inherently “better” than another. They also
show many examples of different calendars and
periodicities, including e.g. Gregorian vs. lunar
calendar, and fiscal vs. academic year. On the
other hand, in most DW approaches only a pre-
defined set of temporal granularities (usually
taken from the Gregorian calendar only) is
provided, and only regular mappings (i.e.,
mappings for which some conversion constant
exists [5]) between such a pre-defined set of
granularities are usually supported. In case they
are considered, irregular mappings between
pre-defined granularities are treated in an ad-
hoc way. On the other hand, no support is
provided to extend a DW with new user-
defined granularities (whose mapping onto the
pre-existent ones may be, obviously, irregular).
Indeed, in their recent book about advanced
spatial and temporal DW, Malinowski and
Zimanyi [5] have explicitly claimed: “Even
though the problem of managing data with
multiple temporal granularities has been widely
investigated in the area of temporal databases,
this is still an open research area in the field of
temporal datawarehouses” (page 246).
In this paper, we overcome such a relevant
limitation of the current literature, coping with
temporal aggregation in the context of multiple
user-defined temporal granularities. We
consider a context in which, as in most TDB
approaches to user-defined temporal
granularities (e.g., in TSQL2 [2], and in [6]),
temporal granularities may be arbitrarily
defined by system administrators\users, and
users may switch between any pair of
granularities. Abstractly speaking, the whole
problem may be split into three parts:
(i) defining proper mapping algorithms that
determine the correspondence between
granules in G
origin
and granules in G
task
;
(ii) define efficient algorithms to collect data to
be aggregated at a given granularity;
(iii) support users with appropriate
aggregation functions to perform
“meaningful” aggregations.
The TDB literature has devoted a lot of
attention to problems (i) (see, e.g., the book [6])
and (ii) (called temporal group composition in
[7]). On the other hand, in this paper we focus
on the problem (iii) above, which has been
partially neglected by the TDB literature,
especially as concerns its impact on DWs. We
propose a general-purpose and domain-
independent methodology to systematically
provide users with a set of proper aggregation
functions/operations to aggregate values along
the DW temporal dimension, when moving
from an origin temporal granularity (e.g.,
business-weeks) to a task granularity (e.g.,
months), and considering the temporal
relationships of the origin and task granularities
and the nature of the data to be aggregated. In
particular,
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 23
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
(1) we focus on the telic vs. atelic nature of
data, whose impact on TDBs has only
recently started to be explored [8], and
(2) differently from the current DW
approaches, in which the granularities used
in the temporal dimension are limited
(usually, a subset of the Gregorian
calendar), we support user-defined
granularities, and irregular mappings
between them.
To the best of our knowledge, there is no DW
approach that deals in a systematic way with
problem (iii), and with the above tasks.
Obviously, it would be possible to delegate
users and system developers to develop
aggregation functions between multiple user-
defined granularities at the application level.
However, such a choice would result in a huge
extra-amount of (very complex) work for them.
We thus aim at providing a suitable suite of
aggregation functions once-and-for-all, at the
system level. Obviously, we do not claim that
the set of aggregation functions/operations we
identify is “complete”. We believe that it is not
reasonable to demand all of the work to the
users. Thus, our approach is a first step towards
a general (not ad-hoc or application\domain-
dependent) and user-friendly approach to cope
with the above phenomena.
Our methodology has at least three main
contexts of applications within the DW context.
It can be applied:
(1) to aggregate a measure from one temporal
granularity to another one
(2) in cases in which source data are temporal,
to aggregate source data from the source
temporal granularity to the basic temporal
granularity of the DW (or Data Mart)
(3) in cases in which slowly changing
dimensions [9] are coped with using TDB
methodologies (as suggested in [5]), to
cope with the aggregation of measures
when dimensional attributes are changing.
Since we operate at the data level, in this paper
we fully explore only the first two issues above.
Finally, it is worth mentioning that, in this
paper, we focus only on valid time [10].
Transaction time [10] may be present in the
source data (and, possibly, it can be loaded in
the DW), but transaction time for aggregated
data will not give useful information for
decision-making users [11]. Anyway, our
methodology is general, thus it can be applied
also to transaction time (which is inherently
atelic [12]), and to other temporal dimensions,
such as lifespan and loading time in [5].
The paper is organized as follows. We
first introduce the telic\atelic distinction in
Section 2. Section 3 identifies three core
phenomena to be coped with when dealing with
temporal aggregation in DW, as well as a set of
relationships which holds between pairs of
user-defined temporal granularities. In section
4, we introduce a set of general functions to
cope with the above phenomena, and a criteria
to choose between them. Finally, section 5
presents conclusions and related works.
2 PRELIMINARIES: THE TELIC\ATELIC
DISTINCTION IN DATABASES
The telic/atelic distinction originates from the
ancient Greek philosophy. In [13], Aristotle has
divided facts into two main categories: telic and
atelic. Telic facts (e.g., John built a house) are
those facts that have telos (telos means goal in
ancient Greek), while atelic facts (e.g., John
was asleep) do not have it (the prefix “a
corresponds to privative α in Greek).
Starting from Aristotle, the distinction between
telic and atelic facts has been widely studied in
many different areas and, in particular within
linguistics. In the area of linguistics, natural
language sentences are studied and classified
according to their syntactic, and semantic
properties, and their linguistic behvior. For
instance Vendler classified sentences as
activities, accomplishment, achievements and
states [14]. In particular, the linguistic analysis
has pointed out that the telic\atelic distinction
plays an important role in characterizing the
semantics of sentences\facts. In particular,
Dowty [15] pointed out that stative sentences
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 24
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
(which are atelic) have the downward
inheritance property [16]: for example, if John
was asleep from 1 to 2pm, he was asleep in
each subintervals of such an interval. Dowty
also noticed that such a property does not hold
for accomplishments (which are telic): if John
built a house starting from April 1
st
and ending
on October 31
st
, we cannot conclude that he
built the house in any subinterval of such an
interval (indeed, one can say that John was
building a house on May, but not that he built
the (whole) house on May). Indeed, also
upward inheritance [16] holds for atelic facts,
but not for telic ones: if John was asleep from 1
to 2 and then from 2 to 3, we can correctly
conclude that John was asleep between 1 and 3;
on the other hand, if John built a house from
April 1
st
to October 31
st
, and then a (another)
house from November 1
st
to January 30
th
, one
cannot conclude that John built a house from
April 1
st
to January 30
th
. The telic\atelic has
been studied in many different areas, including
linguistics, philosophy, cognitive science, and
is is quite articulated. However, in this paper, it
is enough to oversimplify it considering the
following definition:
Definition. Atelic vs. telic values (data). In
this paper, we define atelic those values (data)
for which upward and downward inheritance
hold, and telic those values for which such
properties do not hold.
Despite its generality and relevance, the
telic/atelic distinction has usually been
neglected in the TDB area. A relevant
exception are the work by Terenziani and
Snodgrass, that have proposed a relational data
model and algebra considering the distinction
[12], and the telic\atelic entry in the database
Encyclopedia [8]. For the first time, in this
paper we analyze the impact of the telic/atelic
distinction on the problem of temporal
aggregation between different temporal
granularities. But, first, it is important to point
out what is the main impact of the telic\atelic
distinction on TDBs: as discussed in [12], the
snapshot semantics ([2]; also called point-based
semantics), which is usually adopted in the
TDB area to model the meaning of temporal
data, applies correctly to atelic facts, but not to
telic ones, for which an interval-based
semantics is required. To show this fact, let us
consider drug administration, as an example of
telic fact. Let us suppose that Sue had an
administration of 500 mg of cyclophosphamide
(a cancer drug) starting at 1 and ending at 3
(inclusive), an administration of 500 mg of
cyclophosphamide starting at 4 and ending at 4,
and that Mary had an administration of 200 mg
of infliximab starting at 2 and ending at 4. If we
adopt the point-based semantics, the above
example is modeled as:
1 {<Sue, cyclophosphamide, 500>}
2 {<Sue, cyclophosphamide, 500>,
<Mary, infliximab, 200>}
3 {<Sue, cyclophosphamide, 500>,
<Mary, infliximab, 200>}
4 {<Sue, cyclophosphamide, 500>,
<Mary, infliximab, 200>}
Point-based semantics is not expressive enough
to deal with the temporal interpretation of telic
data, so that, in the example, we have a loss of
information. The two different drug
administrations given to Sue (one starting at 1
and ending at 3, the other from 4 to 4) cannot
be distinguished in the semantic model. The
effects of such a loss of information are quite
critical. For example, in the representations
above, there is no way to recover the fact that
1000 ml of cyclophosphamide in total were
administered to Sue. It is fundamental to stress
that such a loss of information is completely
independent of the representation language
used to model data. For instance, the above
example could be represented as
<Sue, cyclophosphamide, 500 || {1,2,3,4}> R
<Mary, infliximab, 200 || {2,3,4}> R
<Sue, cyclophosphamide, 500 || {[1–3],[4–4]}>
R and <Mary, infliximab, 200 || {[2–4]}> R
<Sue, cyclophosphamide, 500 || [1–3]> R and
<Sue, cyclophosphamide, 500 || [4–4]> R and
<Mary, infliximab, 200 || {[2–4]}> R
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 25
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
But, as long as the point-based semantics is
used, the data semantics is the one elicited
above. Thus, the loss of information is the same
in all the representations. Therefore, regardless
of the chosen representation, a more expressive
semantics is needed to cope properly with telic
data. In the interval-based semantics each tuple
in a temporal relation is associated with a set of
time intervals, which are the temporal extents in
which the fact described by the tuple occur. In
this semantics the index is a time interval. Time
intervals are atomic primitive entities, in the
sense that they cannot be decomposed. In the
interval-based semantics, the above example
can be modelled as follows:
[1–3] {<Sue, cyclophosphamide, 500>}
[2–4] {<Mary, infliximab, 200>}
[4–4] {<Sue, cyclophosphamide, 500>}
Note that the interval-based semantics is
expressive enough to cope with the intended
meaning of telic facts. The above model does
not imply that Sue had a cyclophosphamide
administration of 500 mg at 1, or one in the
period [1–4]. In the interval-based semantics,
intervals are atomic entities, that cannot be
merged or decomposed. This fact correctly
accounts for the fact that neither downward nor
upward inheritance hold for telic facts.
3 RELATIONSHIPS BETWEEN USER-
DEFINED GRANULARITIES
Intuitively, in DW, temporal aggregation is a
‘roll-up’ operation grouping data from a finer to
a coarser temporal granularity (e.g., from ‘days’
into ‘weeks’). Such an operation is quite
complex, in case user-defined temporal
granularities have to be taken into account.
Granularities have been formally defined in the
temporal database glossary [17] as follows:
Definition (temporal granularity). Assume a
time domain T as a set of totally ordered time
instants. A granularity is a mapping G from the
integers (the index set) to the subsets of the
time domain such that:
(1) if i < j and G(i) and G(j) are non-empty,
then each element in G(i) is less than all the
elements in G(j), and
(2) if i < k < j and G(i) and G(j) are non-empty,
then G(k) is non-empty.
Each non-empty set G(i) in the above definition
is called granule.
The first condition in the granularity definition
represents the fact that granules in a granularity
do not overlap and that their index order is the
same as their time domain order. The second
condition represents the fact that the subset of
the index set for the granules is contiguous. On
the basis of the definition, a granularity defines
a countable set of granules; each granule can be
identified by an integer.
Several formal languages have been
provided to support the definition of temporal
granularities [18]. If the designer (or user) is
free to define any temporal granularity (in the
given formal language), the switching from any
granularity to another one is an hard issue to
cope with. In several approaches, the problem
is simply demanded to users\designers which
define temporal granularities: whenever a new
granularity is defined, the algorithms to convert
and\or aggregate data from the new granularity
to all the already present granularities must be
provided by the user\designer herself [2; 19].
Other approaches (e.g., [6]) introduce semantic
assumptions (see the discussion in Section 5).
In this paper, we propose a flexible and user-
friendly methodology to approaching the
problem. We first identify a basic set of
temporal relationships that characterize the
mapping between granularities. Then, we
propose a set of appropriate aggregation
functions\operations on the basis of such
relationships. In such a way, appropriate
aggregation functions between any user-defined
granularities can be automatically suggested to
the user, in the hypothesis that the relationships
between such granularities is known.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 26
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
3.1 Motivating Issues
One of the goals of this paper is to identify the
relationships between user-defined granularities
that characterize temporal aggregation in the
DW context. To achieve such a goal, we point
out three fundamental phenomena regarding
temporal aggregation along user-defined
temporal granularities in DWs. In case the
mapping between granularities are not regular
[5], temporal aggregation may involve at least
three different kinds of problems:
Loss of information. Some data may be lost
when aggregating. E.g., if days are aggregated
into working-weeks (e.g., from Monday to
Friday), some data (data concerning Saturdays
and Sundays) are left out of the aggregation;
Lack of information. The granularity being
aggregated might not “cover” the grouping
granularity. For instance, if one aggregates
working days (e.g., days from Monday to
Friday) into weeks, data concerning Saturdays
and Sundays will be missing. This is a potential
problem if operators such as average (AVG
henceforth) have to be applied.
“Split” of information. Some data may be
potentially part of different aggregates. For
instance, when aggregating weeks into months,
what about the data regarding weeks spanning
over two different months? We call this
phenomena “internal” split. Also “external”
splits need to be considered. For instance, let us
take into account a user-defined granularity
“odd-months” (i.e., Jan, March, May, July,
Sept, Nov). If we aggregate weeks into odd-
months, data regarding weeks spanning over
two different months have to be somehow
“split”, since only a part of them may be
relevant for the aggregation (thus, a loss of
information may be caused by external splits).
3.2 Relationships
In our approach, we aim at facing the three
above phenomena; this goal is at the basis of
our identification of the relationships between
granularities. Given any two granularities G
and H with index set I
G
and I
H
respectively, we
identify different possible relationships. We
first introduce three general predicates that may
hold between two granularities.
Definition. Intersect, Split, Cover
Given any two (user-defined) granularities G
and H with index set I
G
and I
H
respectively
Intersect(G,H): i I
G
j I
H
G(i) H(j) ≠∅
Split(G,H): i I
G
j I
H
(G(i) H(j) ≠∅
(G(i) H(j))
Cover(G,H): i I
H
j
1
,…,j
k
I
G
H(i)
G(j
1
) G(j
k
)
Roughly speaking, Intersect(G,H) holds if there
is at least an intersection between a granule of
G and a granule of H; Split(G,H) holds if there
is at least a granule of G which overlaps (but is
not fully contained into) a granule of H (notice
that the part of G(i) which exceed H(j) may be
contained or not into some other granule of H);
Cover(G,H) holds if each granule of H is fully
contained in the union of one or more granules
of G. Of course, Intersection is implied by
Cover and by Split. We can now introduce our
relationships (see also [20]).
Definition. Relationships between
granularities. Given any two (user-defined)
granularities G and H with index set I
G
and I
H
respectively
SAME. G and H are the same if they are equal,
i.e., i I
H
j
I
G
H(i) = G(j) j I
G
i
I
H
G(j) = H(i)
PARTITION. G is a partition of H if
¬ SAME(G,H) Cover(G,H) Cover(H,G)
¬ Split(G,H)
SLIT-PARTITION. G is a split partition of H
if Cover(G,H) Cover(H,G) Split(G,H)
COVER. G covers H if
Cover(G,H) ¬ Cover(H,G) ¬ Split(G,H)
SPLIT-COVER. G split-covers H if
Cover(G,H) ¬ Cover(H,G) Split(G,H)
COVERED_BY. G is covered by H if
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 27
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Cover(H,G) ¬ Cover(G,H) ¬ Split(G,H)
SPLIT-COVERED_BY. G split-covers H if
Cover(H,G) ¬ Cover(G,H) Split(G,H)
INTERSECT. G intersects H if
Intersects(G,H) ¬ Cover(G,H)
¬ Cover(H,G) ¬ Split(G,H)
SPLIT-INTERSECT. G and H non-strictly
intersects if
Intersect(G,H) ¬ Cover(G,H)
¬ Cover(H,G) Split(G,H)
DISJOINT. G and H are disjoint if
¬ Intersect(G,H)
SAME covers the case when, indeed, the two
granularities are the same. Such a relationship
will be used futher on in Section 5.
PARTITION and SPLIT-PARTITION
cope with the cases in which exactly the same
parts of the timeline are covered by G and H.
However, in the former case (PARTITION),
each granule of H is exactly constituted by a set
of granules of G (e.g., days are a partition of
months); on the other hand, in the latter case
(SPLIT-PARTITION), there are granules of G
spanning outside some granules of H (e.g.,
weeks and months).
COVER and SPLIT-COVER cope with the
cases in which all the parts of the timeline
covered H are also covered by G (while they
are parts of the timeline covered by G and not
by H). However, in the former case, each
granule of H is exactly constituted by a set of
granules of G (e.g., days cover working-
weeks); in the latter case, there are granules of
G spanning outside some granules of H (e.g.,
weeks and working months). COVERED_BY
and SPLIT-COVERED_BY cope with the
symmetric cases.
INTERSECT and SPLIT-INTERSECT
cope with the cases in which G and H do not
cover the same parts of the timeline
(COVER(G,H) and SPLIT-COVER(G,H) do
not hold), but there is at least one intersection.
Again, we distinguish between cases in which
the (intersecting) granules of G are fully
contained (or not) in granules of H.
Finally, DIJOINT cover cases of non-
intersecting granularities (e.g., odd-days and
even-days).
The relationships above can be
structured along a decision tree, as shown in
Figure 1. The eight relations we have identified
constitute the leaves of the tree. At each level of
the tree, alternatives are mutually exclusive. The
conditions equal?, intersect? and split?
represent binary (yes/no) choices. The
condition covering(A,B)? has four (mutually
alternative) possibilities: A covers B and B
covers A (denoted by AB in Figure 1), A
covers B and B does not cover A (AB), A
does not cover B and B covers A (AB), A
does not cover B and B does not cover A (A|B).
We can thus conclude that, being the leaves of
such a decision three, our relationships are
exhaustive and mutually exclusive.
yes
yes
yes yes
covering(A,B)?
no no no
no
SPLIT
INTERSECT
INTERSECT
COVERED
BY
SPLIT
COVER
COVER
split? split? split? split?
PARTITION
SPLIT
PARTITION
A
B
AB AB
A
|
B
SPLIT
COVERED
BY
DISJOINT
intersect? SAME
no equal? yes
no
yes
Figure 1. Decision tree structuring the different relationships
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 28
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
3.3 Automatic Evaluation of Temporal
Relationships
For the sake of generality, we do not commit to
any specific language modeling user-defined
granularities. Many of such languages have
been proposed in the literature (see Section 5).
However, whatever of such languages is
considered, the determination of the
relationships between two user-defined
granularities can be performed in an automatic
way. In particular, if the two granularities are
periodic (which are the ones most commonly
used), the relationship can be easily determined
by generating all the instances of the two
granularities in a common interval, which must
be as long as the least common multiple of the
two periodicities, and then automatically
checking which one of the relationships holds
between them (exactly one relation holds
between each pair of granularities, since they
are exhaustive and mutually exclusive.
4 TEMPORAL AGGREGATION
In our approach, temporal aggregation is based
on the above relationships, and on the
knowledge on whether data (measures) to be
aggregated are telic\atelic and complete (see
Section 4.2) or not. In other words, the below
definitions assume that, given a granularity G to
be aggregated into a granularity H (where both
granularities may be user-defined, through a
formal specification language), the following
information is available:
- The data to be aggregated are complete or
not
- The data to be aggregated are telic or atelic
- The relationship between G and H
In the following, we first point out which
aggregation operators can be used, depending
on the telic\atelic character of the data. We then
focus on the treatment of incomplete
information in the aggregation operations and
on the treatment of “splits of information”.
Finally, we wrap-up all such ingredients into a
comprehensive approach, coping with
aggregation along the temporal dimension and
considering user-defined granularities.
4.1 Aggregation Operators.
In this paper, we just focus on the most typical
aggregation operations, namely the min, max,
avg and sum operators. Min, max and avg
operators can be applied to both telic and atelic
data. On the other hand, additivity (along the
temporal dimension) is a property of telic data,
and not of atelic ones. The reason can be easily
found in the upward inheritance property that
characterize atelic data. Upward inheritance
state that if the data value is α in a time period
[i,j] and in a contiguous period [j+1,k], then it
is α in the whole period [i,k]. This fact trivially
implies that atelic data are not additive (along
the temporal dimension). For instance, if the
balance of an account is 100K in all months of
2008, we can properly infer that the account
value in the year 2008 is 100K (not additive;
the balance is atelic). Notice however, that the
telic\atelic dichotomy does not only explain the
additive\non-additive feature that characterizes
all the DW approaches in the literature, but also
has a deep impact on the treatment of data
incompleteness and of split.
4.2 Treatment of Data Incompleteness
Informally, given a granularity G, we say that a
data (measure) m is complete on G if, for each
instance g of G in the frame of time of interest,
the value of m in g is known (e.g., the balance
of the account is known for each day in the
frame of time of interest). Data incompleteness
is related only in part to the problem of
aggregating data along different temporal
granularities. As a matter of facts, data may be
incomplete for many different reasons, mostly
independent of the aggregation or granularity
conversion operations. In general, the fact that
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 29
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
there is no tuple holding at a given time simply
denotes the fact that nothing happened at that
time. For example, no contract may be active
on 2010. In other cases, data may be incomplete
because the information may exist in the “real
world”, but for some reason it has not been
stored. For example, for a patient in ICU, the
blood pressure is constantly monitored, and
recorded at a given rate (say, each minute). If
the DB does not contain the pressure of the
patient at a given minute, this fact does not
imply that “in the real world” there was no
pressure value for the patient. Simply, for some
reason the data has not been recorded.
However, data incompleteness interacts with
the aggregation problem in at least two
different ways. First of all, if the origin
granularity does not fully cover the task
granularity, from the point of view of
aggregation we have that some of the data (the
ones in the span of time present in the task
granularity but not in the origin one) are
missing. For example, if the origin granularity
is working-week and the task granularity is
years, values in Saturdays and Sundays are
missing, and data incompleteness must be
faced. Second, one has to decide what policy
has to be applied in case some value in the
origin granularity (that should be included in
the task one) is missing (or NULL).
For the sake of generality, we provide users
with the possibility of choosing among different
ways of coping with missing (NULL) values
during aggregation. We have identified 7 main
options (of course, the list is not exhaustive,
and other options may be investigated):
(i) NULLIFY: if at least one of the values to
be aggregated is NULL, then the result of
the whole aggregation is NULL;
(ii) IGNORE: NULL values are ignored; the
aggregation operators are executed
considering only the available data;
(iii) INTERPOLATE: each NULL value is
substituted by the value obtained through
the average between the temporally closest
values preceding and following it;
(iv) PREC: each NULL value is substituted by
the closest value preceding it;
(v) NEXT: each NULL value is substituted by
the closest values following it;
(vi) AVG_INTERP: each NULL value is
substituted by the value obtained through
the average between all the available
values to be aggregated with it.
EXAMPLES. Suppose we have the following
daily values, and we want to aggregate it into
weeks, through the sum operator;
Day1: 100 | Day2: NULL | Day3: NULL |
Day4: 150 | Day5: 100 | Day6: NULL |
Day7:100
The results obtained through the above
strategies are the following:
NULLIFY NULL
IGNORE 450
INTERPOLATE100+125+125+150+100+
100+100= 800
PREC100+100+100+150+100+100+100=
750
NEXT 100+150+150+150+100+100+100=
850
AVG_INTERP 100+112.5+112.5+150+100+
112.5+100= 787.5
In general, we allow users to freely choose
between all available options, but recommend
to consider the telic\atelic nature of the data to
be aggregated. In particular:
Atelic data: any of the above strategies can
be selected
Telic data: only the NULLIFY and
IGNORE strategies can be selected
The reason of the above restriction lies
in the different nature of telic and atelic data.
Telic facts are not homogeneous facts having a
specific goal, and do not have either the upward
or the downward inheritance properties. As a
consequence, in general, each fact is an
“individual” one, independent and unrelated to
the others. In such a context, it does not look
meaningful to evaluate the (missing) value of
one fact on the basis of other values of other
facts. For instance, the fact that I had a 100K
contract on May and a 1000K contract on July
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 30
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
do not tell me anything on whether I have had a
contract on June, and on the amount of that
contract, if any (contracts are telic). The
behavior of atelic data is different. Atelic facts
are homogeneous facts for which both
downward and upward inheritance hold. Thus,
for such facts, a sort of “persistency” hold, so
that one can meaningfully try to approximate
the value of a fact on the basis of temporally
contiguous values. For instance, from the fact
that my bank account was 100K on May and
120K on July, I can make some reasonable
guess about the amount of my account on June
(the balance of accounts is atelic).
4.3 Treatment of “Splits”
The problem of “splits” regards the evaluation
of values of data which, during aggregation,
overlaps different periods. A typical example is
the monthly aggregation of weekly data (where
weeks can stretch over two months). Once
again, we do not think there is just one solution:
the strategy we suggest is to provide users with
a suite of alternative possibilities. We propose
to consider at least the following functions:
(i) TOTAL: the whole value is attributed to
each part of the split
(ii) NONE: the value is NULL for each part of
the split
(iii) DIVIDE: the value is divided and equally
attributed to each part of the split
(iv) PROP_DIVIDE: the value is proportionally
attributed to each part of the split (of
course, such an evaluation requires the
availability of support measures).
EXAMPLES. For instance, the 14
th
(European-
calendar) week of 2009 starts on Monday
March 30
th
, and ends on Sunday April 5
th
. If the
value for that week is 140, we can obtain the
following result for the split into Months:
TOTAL 140 is attributed to both (the last
week of) May and (the first week of) April
NONE the value 140 is neither attributed to
March nor to April
DIVIDE 70 is attributed to both (the last
week of) May and (the first week of) April
DIVIDE_PROP 40 is attributed to (the last
week of) May and 100 to (the first week of)
April
Notice that the above approach can be applied
not only to cope with “internal” splits (as in the
above examples), but also with “external”
splits, as exemplified below.
EXAMPLES. Consider again the 14
th
(European-calendar) week of 2009, but now let
us consider aggregation into odd-months (so
that only the data concerning March are
relevant here). If the value for that week is 140,
we can obtain the following result for the split:
TOTAL 140 is attributed to the odd-month
NONE the value 140 is not considered
DIVIDE 70 is attributed to odd-month
DIVIDE_PROP 40 is attributed to odd-
month
Obviously, the application of the NONE option
causes a loss of information, while the
application of TOTAL does not preserve the
whole sum of the values.
Finally, it is worth stressing the impact
of the telic\atelic dichotomy on the “split”
issue. We impose only the TOTAL function
can be adopted to deal with atelic data (while
all the options are possible for telic ones). This
rule depends on the very nature of atelic data:
since they have the downward inheritance
property, their value in each subinterval is equal
to the value in the whole interval. For example,
if the balance of my account is 100K in the 14
th
week of 2009, it is 100K in each day or
subparts of that week. Obviously, such an
implication is not true for telic data.
4.4 Temporal Aggregation: General
Approach
Table 1 in the following summarizes the
treatment we suggest for temporal aggregation,
considering user-defined granularities. The first
three columns represent the “input” of our
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 31
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
approach, namely the following information
about the data to be aggregated from a temporal
user-defined granularity G into a temporal user-
defined granularity G’:
(i) The first column considers whether the
(values of the) data (measures) to be
aggregated are complete or not,
(ii) The second states whether such data are
telic or atelic, and
(iii) The third states the relationship holding
between G and G’ (the case in which the
relationship SAME holds between the two
granularities is not reported in the table,
since it is not a case of temporal
aggregation - it will be useful in Section 5).
For the sake of compactness, “SPLIT-“ has
been abbreviated by “S-“ (e.g., S-
PARTITION stands for SPLIT-
PARTITION). For each case, Table 1
states the possible actions to be done.
(iv) The fourth column states whether some
treatment of incompleteness (NULL
values) in the input data has to be
performed. If so, the user may take
advantage of one of the operations
identified in section 4.2 (notice that,
depending on the telic\atelic character of
the data, different functions are suggested)
(v) The fifth column states whether a “split” of
information is necessary or not. If the value
is”yes”, one of the dedicated operations
proposed in section 4.3 can be selected by
the user, to choose the modality s\he
prefers (once again, different functions are
available, depending on the telic\atelic
character of data)
(vi) The sixth column simply states whether the
conversion from G to G’ causes or not a
loss of information. We think that this is an
important information to be given to the
user, who, in cases a loss of information is
possible, may decide or not whether going
on with the aggregation. Notice that, in this
column, the loss of information that may be
caused by the application of the “Split”
functions (e.g., “NONE”) has not been
taken into account.
(vii) Finally, the seventh column identify the
possible aggregation operators (in practice,
it states for which kind of data “sum” is
possible).
EXAMPLES.
Ex.1. Let us suppose a user wants to aggregate
the balance of accounts (atelic facts) from
weeks into months. Suppose also that such data
are (at the level of weeks) complete, in the
sense that there are no missing values. Since
balances of accounts are atelic data, and the
relationship between the granularities weeks
and months is SPLIT-PARTITION, the
proposed case correspond to the 10
th
row of
Table 1. This means that the user may apply the
min, max, or avg aggregation operators, and
can apply the TOTAL function to cope with
split of information. There is no loss of
information due to the granularity conversion,
and no need to cope with incompleteness.
Ex.2. Same as above, but balance (atelic data)
has to be aggregated from weeks into odd-
months (i.e., Jan,March,May,July,Sept,Nov).
The relationship between weeks and odd-
months is SPLIT-COVER, so that the 12
th
row
has to be considered. As above, the user may
apply the min, max, or avg aggregation
operators, and can apply the TOTAL function
to cope with split of information. There is no
need to cope with incompleteness. However, in
this case a warning that there is a loss of
information due to the aggregation may be
signaled (since data concerning “even” months
are lost in the aggregation).
Ex.3. Suppose that balances (atelic data) are
aggregated from working weeks (from Monday
to Friday) into months, and that input data are
incomplete (NULL values are present). The
corresponding row is the 30
th
. In such a case,
besides the function TOTAL to cope with
splits, our approach suggest the choice of one
function to cope with incompleteness (one of
NULLIFY, IGNORE, INTERPOLATE, PREC,
NEXT,AVG_INTERP). There is no loss of
information due to the aggregation.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 32
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Table 1. Temporal aggregation (summary). The column “Loss of info” does not take into consideration the loss of
information possibly caused by the treatment of “Splits”.
Complete\
Incomplete Telic\
Atelic Relationship Treat. of
Incomp. Treat. of
Split Loss of
Info(*) Operators
(temporal dim.)
Complete Telic PARTITION no no no min,max,avg,sum
Complete Telic S-PARTITION no yes no min,max,avg,sum
Complete Telic COVER no no yes min,max,avg,sum
Complete Telic S-COVER no yes yes min,max,avg,sum
Complete Telic COVERED_BY yes no no min,max,avg,sum
Complete Telic S-COVERED_BY yes yes no min,max,avg,sum
Complete Telic INTERSECTS yes no yes min,max,avg,sum
Complete Telic S-INTERSECTS yes yes yes min,max,avg,sum
Complete Atelic PARTITION no no no min,max,avg
Complete Atelic S-PARTITION no yes no min,max,avg
Complete Atelic COVER no no yes min,max,avg
Complete Atelic S-COVER no yes yes min,max,avg
Complete Atelic COVERED_BY yes no no min,max,avg
Complete Atelic S-COVERED_BY yes yes no min,max,avg
Complete Atelic INTERSECTS yes no yes min,max,avg
Complete Atelic S-INTERSECTS yes yes yes min,max,avg
Incomplete Telic PARTITION yes no no min,max,avg,sum
Incomplete Telic S-PARTITION yes yes no min,max,avg,sum
Incomplete Telic COVER yes no yes min,max,avg,sum
Incomplete Telic S-COVER yes yes yes min,max,avg,sum
Incomplete Telic COVERED_BY yes no no min,max,avg,sum
Incomplete Telic S-COVERED_BY yes yes no min,max,avg,sum
Incomplete Telic INTERSECTS yes no yes min,max,avg,sum
Incomplete Telic S-INTERSECTS yes yes yes min,max,avg,sum
Incomplete Atelic PARTITION yes no no min,max,avg
Incomplete Atelic S-PARTITION yes yes no min,max,avg
Incomplete Atelic COVER yes no yes min,max,avg
Incomplete Atelic S-COVER yes yes yes min,max,avg
Incomplete Atelic COVERED_BY yes no no min,max,avg
Incomplete Atelic S-COVERED_BY yes yes no min,max,avg
Incomplete Atelic INTERSECTS yes no yes min,max,avg
Incomplete Atelic S-INTERSECTS yes yes yes min,max,avg
Ex.4. Suppose contracts (telic data) are
available at the granularity of academic
trimesters (e.g., Nov-Jan, Feb-April, May-July),
and must be aggregated into years. Suppose
also that data are complete (i.e., no NULL
value in academic trimesters). The relationship
between academic trimester and year is S-
COVERED_BY, thus the corresponding entry
is the 6
th
row of Table 1. NULLIFY and
IGNORE are suggested to cope with data
incompleteness (data incompleteness arises
because academic trimesters do not partition
nor cover years, so that data corresponding to
August, September and October are not
available), one of NONE, DIVIDE, and
DIVIDE_PROP is suggested to slit information
(from Nov-Jan into years). No loss of
information is caused by this aggregation.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 33
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
5 CONCLUSIONS AND RELATED
WORKS
We propose a general methodology to cope
with the aggregation of data (measure values)
along different user-defined granularities, in the
DW context. Specifically, we propose a new
methodology which, given two any pair of
(possibly) user-defined temporal granularities
G
origin
and G
task
, and given a data (measure) to
be aggregated (from G
origin
to G
task
), and
considering the temporal relationships between
the two granularities and some features (e.g.,
the telic vs atelic nature) of the data, supports
the user with a set of proper aggregation
functions\operations.
The solution we propose grounds on our
previous work in [20]. We have identified three
core phenomena in temporal data
conversion\aggregation between different time
granularities (loss of information, lack of
information, split of information), and we have
pointed out the basic relationships between
granularities which characterize such
phenomena. Such relationships are exhaustive
and mutually exclusive, so that, whatever two
temporal granularities have been (or will be)
defined by users, exactly one of the
relationships holds between them. We provide a
set of application-independent functions to cope
with lack of information (missing data) and
split of information, and then, depending on the
relationship between the two granularities
considered in the conversion\aggregation, and
on the telic vs. atelic nature of the data, we state
what functions could be chosen by users. In
particular, we extend the work in [20] (i) by
proposing a general and integrated treatment of
all the above issues, including the challenging
problem of coping with lack of data (see, in
particular, Table 1), and (ii) by specifically
addressing the problems and challenges arisen
in the area of Data Warehousing. To the best of
our knowledge, there is no other approach in
the literature integrating the three cue issues of
our approach: temporal granularities, temporal
aggregation, and DWs. We briefly review
separately such areas of research in the
following.
DW approaches usually support the
aggregation of data (measure values) along the
temporal dimension. However, to the best of
our knowledge, only few of such approaches
have faced this problem by considering
different user-defined granularities. For
example, Chamoni and Stock [21] have devised
a temporal extension of OLAP to model the
evolution of dimensions, identifying several
temporal structures, but not taking into account
different user-defined granularities. Recently,
Malinowski and Zimanyi [11; 5] have proposed
a new conceptual model for DWs. The model
provides temporal support for concepts,
attributes, hierarchies and measures, as well as
aggregation of measure values along the
temporal dimension (in which different
temporal granularities are supported). Some
complex cases of temporal aggregation are
discussed, considering situations such “lack of
information” and “split of information”.
However, they did not propose any general
solution to such cases, simply pointing out that
mapping between different time granules must
be defined explicitly” [11].
On the basis of the formal definition of
temporal granularity proposed in the Database
glossary [18] (see also Section 3), in [18]
several primitive relationships between
granularities have been identified. Several
researchers have noticed that, in many
application domains, providing a standard set of
pre-defined granularities is not satisfactory, and
user-defined granularities must be supported
(see, e.g., [18; 6]). Staring from the pioneering
work by Clifford and Rao [22], in the area of
temporal Databases and of Artificial
Intelligence several approaches have been
defined to cope with user-defined granularities,
and periodic granularities [18; 6; 23], and with
the closely related notion of periodicities [24].
When dealing with multiple granularities, a
main problem regards the conversion of values
when swapping from one granularity to a
different one. Many approaches simply
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 34
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
demands to users or system designer to
explicitly define customized functions
performing in an ad-hoc way each one of the
conversions (see, e.g., (Dyreson, 1994)).
Bettini et al. [6] have proposed the adoption of
semantic assumptions or coercion functions to
achieve such a goal. Specifically, they
implicitly model the semantics of the
association of tuples with time by pairing
temporal tables with semantic assumptions.
Notably, such assumptions are associated to
tables once-and-for-all by the DB designer.
Such a fixed association looks not applicable in
the DW context, where one has also to support
the fact that users are free to choose among
different aggregation options when rolling up
along a temporal dimension. In this context, the
flexibility granted by our approach seems more
appropriate.
In temporal relational aggregation,
tuples can be grouped according to their
timestamp values. Then, an aggregation
function is applied to each group, leading to a
temporal relation. Starting already from Ben-
Zvi’s temporal relational model [25], temporal
query language have been extended to include
aggregation functions. As described in the
survey by Gamper et al., quite different types of
grouping operations have been pointed out [26].
[7] provides an in-depth survey about temporal
(and spatial) aggregation. In particular, Vega
Lopez et al., identifies a uniform framework
supporting the analysis and comparisons of
different types of aggregation. The research
about temporal aggregation has also proposed a
wide range of computationally efficient
algorithms to evaluate temporal aggregation.
Starting from Tuma’s work [27], different tree-
based algorithms have been proposed in the
literature, including the aggregation tree
algorithm [28], the balanced-tree algoritm [29],
the SB-tree algorithm [30]. Recent approaches
have also focused on temporal aggregation
within data streams [31]. On the other hand, to
the best of our knowledge, such an area of
research has not investigated the conversion
and aggregation of measure values along
different (user-defined) granularities For
instance, Vega Lopez et al. [7] demand all
measure conversion to a (not well specified)
cast function which should perform all
mappings. To the best of our knowledge, no
approach in this field has systematically faced
the issue of supporting users in the choice of
aggregation functions, depending on the
telic\atelic nature of measures and on the
relationship holding between the origin and task
granularities. Thus, our approach can be
conceived as orthogonal to the ones above.
Thus, in our future work, we plan to integrate
our approach with one of the above, to achieve
a comprehensive approach coping also with the
(ii) subproblem mentioned in the introduction.
Lenz and Shoshani [32] have studied the
conditions to get a correct summarization of
OLAP and statistical databases (summarization
roughly correspond to aggregation).
Interestingly, the summarizability conditions
they have identified roughly correspond to the
temporal problems we have identified in our
paper: loss of information, lack of information,
split of information. Informally speaking, and
using our terminology, Lenz and Shoshani’s
Condition One requires that there must not be
any split of data. Condition Two requires no
lack and no loss of information. Condition
Three resembles our requirement that
operations must be consistent with respect to
the nature of data. However, Lenz and Shoshani
pointed out their conditions in order to identify
non-problematic cases for summarization, to
focus only on such cases. The approach we
have followed in this paper is completely
different to this respect. Indeed, the goal of our
work is to support temporal aggregation along
user-defined temporal granularities, allowing
users to switch from any granularity to any
other one, independently on whether the switch
is “problematic” or not. In our context, split,
loss and lack of information are not a
consequence of some design error, or of some
mistake by users. They are just a natural
implication of having different granularities to
switch from/to.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 35
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
ACKNOWLEDGEMENTS.
This work has been partially supported by
Compagnia San Paolo, GINSENG project.
REFERENCES
[1] W. Immon (2002). Building the Data Warehouse,
John Wiley & Sons Publishers.
[2] R. T. Snodgrass et al., (1995). The TSQL2
Temporal Query Language, Kluwer.
[3] Y. Wu, Sushil Jagodia, and X. Sean Wang (1999).
Temporal Database Bibliography Update,” Temporal
Databases—Research and Practice, Springer-Verlag,
LNCS 1399. 338–367.
[4] L. Liu and M. Tamer Özsu (2009). Encyclopedia of
Database Systems, Springer.
[5] E. Malinowski and E. Zimanyi (2009). Advanced
Data Warehouse Design. From Conventional to
Spatial and Temporal Applications. Springer Verlag.
[6] C. Bettini, S. Jajodia, and X.S. Wang (2000). Time
granularities in Databases, Data Mining, and
Temporal Reasoning. Springer.
[7] I.F. Vega Lopez, R.T. Snodgrass, and B. Moon
(2005). Spatiotemporal Aggregate Computation: A
Survey. IEEE TKDE, 17(2). 271–286.
[8] V. Kathri, R.T. Snodgrass, and P. Terenziani (2009).
Telic distinction in temporal databases. In [4].
[9] R. Kimball (1996). Slowly Changing Dimensions.
DBMS Magazine 9(4).
[10] C. S. Jensen and C. E. Dyreson (1998). A Consensus
Glossary of Temporal Database Concepts, in
Temporal Databases: Research and Practice, O.
Etzion, S. Jajodia and S. Sripada (eds), Springer-
Verlag. 367–405.
[11] E. Malinowski and E. Zimanyi (2008). A conceptual
model for temporal data warehouses and its
transformation to the ER and object-relational
models. Data & Knowledge Engineering 64, 101-
133.
[12] P. Terenziani, and R.T. Snodgrass (2004).
Reconciling Point-based and Interval-based
Semantics in Temporal Relational Databases: A
Proper Treatment of the Telic/Atelic Distinction.
IEEE TKDE, 16(4). 540-551.
[13] Aristotle. The Categories, on Interpretation. Prior
Analytics. Cambridge, Harvard University Press.
[14] Z. Vendler (1967). Verbs and times. Linguistics in
Philosophy, Cornell Univ. Press, NY, 97–121.
[15] D. Dowty (1986). The Effects of the Aspectual Class
on the Temporal Structure of Discourse, Tense and
Aspect in Discourse, Linguistics and Philosophy
9(1). 37–61.
[16] Y. Shoham (1987). Temporal logics in Al:
semantical and ontological considerations. Artificial
Intelligence; 33(1). 89-104.
[17] C. Bettini, C.E. Dyreson, W.S. Evans, R.T.
Snodgrass, and X. Wang (1997). A glossary of time
granularity concepts. Temporal Databases, Dagstuhl,
Springer, 406–413.
[18] C. Bettini (2009). Temporal granularities. In [4].
[19] M. Soo and R. Snodgrass (1993). Multiple Calendar
Support for Conventional Database Management
Systems. Proc. Int’l Workshop an Infrastructure for
Temporal Databases.
[20] P. Terenziani (2012). Temporal aggregation on user-
defined granularities. Journal of Intelligent
Information Systems, 38(3), pp 785-813.
[21] P. Chamoni and S. Stock (1999). Temporal
Structures in Data Warehousing. Proc. DaWaK’99,
LNCS 1676, 353-358.
[22] J. Clifford and A. Rao (1987). A simple, general
structure for temporal domains. In Proc. Conf. on
Temporal Aspects in Information Systems, 23–30.
[23] C. Combi, M. Franceschet, A. Peron (2004).
Representing and reasoning about temporal
granularities, J. Logic and Computation 14 (1), 1–77.
[24] P. Terenziani (2009). Temporal periodicity. In [4].
[25] J. Ben-Zvi (1982). The Time Relational Model. PhD
Thesis, Computer Science Department, UCLA.
[26] J. Gamper, M. Bohlen, and C. Jensen (2009).
Temporal aggregation. In [4].
[27] P.A. Tuma (1992). Implementing Hystorical
Aggregates in TempIS. PhD thesis, Wayne State
University, USA.
[28] M. Kline and R.T. Snodgrass (1995). Computing
Temporal Aggregates. In Proc. Internat. Conf. on
Data Engineering, 222-231.
[29] B. Moon, I.F. Vega Lopez, and V. Immanuel (2003).
Efficient algorithms for large-scale temporal
aggregation. IEEE TKDE 15(3). 744-759.
[30] J. Yang and J. Widom (2003). Incremental
computation and maintenance of temporal
aggregates. VLDB Journal 12(3). 262-283.
[31] D. Zhang, D. Gunopulos, V.J. Tsotras, and B. Seeger
(2003). Temporal and spatio-temporal aggregations
over data streams using multiple time granularities.
Information Systems 28. 61-84.
[32] H-J Lenz, Arie Shoshani (1997). Summarizability in
OLAP and statistical data bases. Proceedings., Ninth
International Conference on Scientific and Statistical
Database Management, 132-143.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC 36
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Optimal Health Care Inventory Management Using Analytics
Neeraj Agrawal Prakarsh Paritosh Ashish Paralikar Dibyajyoti Pati
General Electric
JFWTC, 122 EPIP Bangalore, India 560066
neeraj.agrawal@ge.com
ABSTRACT
Inventory Management is an important issue for
health-care systems because it influences clinical
and financial decisions. Before selecting, adapt-
ing and implementing a process for Inventory man-
agement, it is important that the various factors af-
fecting Inventory management are carefully consid-
ered. The objective of this paper is twofold. First,
it proposes an analytic model for hospital inventory
management commodities, which would be able
to predict the future demands of various inventory
commodities. The model takes into account previ-
ous demand, population and geographic Location
and other factors to successfully predict the future
demand. Second, the paper suggests an optimiza-
tion model that would minimize the cost involved
in supply chain & logistics management so that the
required commodities can be made available to the
hospitals at the minimum possible cost. A web
based portal is proposed to facilitate the hospitals to
optimally distribute future inventory requirements
among the different vendors.
KEYWORDS
Health-care Information System, Analytic Mod-
eling, Supply Chain Management,Inventory Opti-
mization, Hospital Vendor Portal
1 INTRODUCTION
For the optimal use of hospital resources and
the subsequent need to restore budgets, man-
agements are responsible for finding solutions
to achieve more operational efficiency in hos-
pital processes [1]. Some of the areas of hos-
pital management are that of controlling the
inventory of medical gases, drugs and medi-
cal supplies. According to the classical inven-
tory models the optimum inventory levels are
maintained with the objective of minimizing
the sum of cost of excess and under stocking.
It is observed that in case of hospital invento-
ries, concepts of holding and ordering cost is
not very suitable [4].
Awareness of logistics management is becom-
ing more widespread and many initiatives and
studies dealing with supply chain integration
have been undertaken [1]. However, internal
supply chain remains the weak link in pro-
cess integration and optimization [1]. The lack
of systemic approach to internal supply chain
management is reflected in huge costs in mate-
rials management and low service quality de-
livered to patients [1]. In the current scenario
of increasing health care costs, systems inven-
tory must be optimized without sacrificing the
level of service provided.
Figure 1. Proposed Inventory Management Model.
Good inventory management is essential to the
successful operation of any health care organi-
zation. One of the most important is the pro-
portion of the organizations budget that rep-
resents money spent for inventory. Although
the amounts and dollar values of the invento-
ries carried by different types of health care
providers vary widely, in a typical hospital’s
budget 25 to 30 percent goes for medical sup-
plies and their handling. Because the inventory
of medical supplies may comprise a signifi-
cant portion of a health care organization’s to-
tal assets, reducing its inventories significantly
ISBN: 978-1 -941968-20-8 ©2015 SDIWC1 37
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Analyze Demand and
Usage of Hospital
Inventory Data
Develop and Analytical
Model to predict future
demand.
Optimally divide the
demand among different
vendors
Web based portal to
display, order inventory
to the different vendors.
Vendors
Vendor1 Vendor2
Deamand 1 Deamand 2
Figure 2. Different stages of proposed inventory model.
raises its return on investment, and hence its
position in the financial markets [2].
Since the revenues of hospitals are impacted by
the carrying costs and ordering costs of phar-
macy, medical-surgical supplies, and medical
gases, it is important to create a mathematical
model to minimize the total cost and simul-
taneously ensuring that the commodities are
available in time [4]. Proposed inventory man-
agement model ant its stages are shown in Fig-
ure 1 and Figure 2 respectively. The objective
of this paper is to come up with a predictive
analytical model to optimize the inventory of
pharmacy, medical-surgical supplies, and med-
ical gases from the database available in the
hospital information system. For this purpose
the usage pattern of various commodities were
analyzed and a model was designed to predict
the usage for the upcoming month and give
economical solution to choose among avail-
able vendors keeping in mind their capacity
and storage capabilities. Model has been de-
signed using various parameters such as previ-
ous demand of surgical supplies, oxygen and
other anesthetic gases, geographical location,
population density and population distribution
of the area, seasonal variation for different hos-
pitals. A web based interface was developed
for different hospitals to display the optimal fu-
ture demand of inventory. This portal will also
aid in placing the order to the different avail-
able vendors.
2 FACTORS TO BE CONSIDERED FOR
DESIGNING ANALYTIC MODEL
The Present system of Inventory management
suffers with several shortcomings resulting in
its inefficiency.
Availability of consumable on time is of
utmost importance in health-care industry,
where delay of a few seconds can cost a
life. The challenge is even greater as the
number of expected patients are unpre-
dictable; suppliers are unreliable and costs
are rising [3]. Analytic model should be
able to reduce this uncertainties and make
consumable available on time
Cost of medical supplies has been spiral-
ing up, greater numbers of patients are
demanding high quality and reasonably
priced health-care services. State-of-the-
art inventory management analytic model
should automate the inventory reconcili-
ation and visibility process. This would
reduce the amount of work performed by
hospital staff to maintain correct levels of
inventory contrary to an archaic inventory
management system.
ISBN: 978-1 -941968-20-8 ©2015 SDIWC2 38
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Figure 3. The overall ranking of top 25 drugs from the HDU across all years with the inclusion of Methotrexate (MTX)
and Tacrolimus (TAC).
Inventory standards can be measured by
the ability to deliver utmost quality ser-
vice to the patients. It can be measured
in terms of services offered, price charged
and perceived value [3]. A hospital will
have to make decisions regarding the stor-
age of supplies or direct delivery at the
point of use. It deals with the implemen-
tation of CPFR (Collaborative Planning,
Forecasting and Replenishment) among
hospitals and suppliers and manufacturers
of medical supplies.
Good Inventory Management also de-
pends on the warehouse’s location, pro-
duction level of goods, etc. It also
deals with finalizing the distribution net-
work between the suppliers and hospitals
to minimize transportation and inventory
costs. Hence, Optimization of distribution
network between suppliers and hospitals
is needed for an efficient system of Inven-
tory Management.
3 INVENTORY MANAGEMENT
MODEL
To accommodate the factors explained in Sec-
tion 2, analytics has been used to predict the
future requirements well in advance. Predicted
inventory requirement is distributed among
available vendors using an optimization model.
3.1 Medical Inventory Analytic Model
For designing the analytical model, following
features from hospital database were taken into
consideration:
Previous demands of various pharmacy
such as surgical supplies, Oxygen and
other anesthetic gases
Geographical location of hospital
Population density and population distri-
bution of the area
Month of year
No of surgery and patient admitted in an
year
ISBN: 978-1 -941968-20-8 ©2015 SDIWC3 39
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Figure 4. Trends in the percentage of persons using pre-
scription drugs: United States, 1999-2008
For model designing, the results of various sur-
veys involving the demand of inventory com-
modities against various factors were studied
and analyzed. For instance, The graph in
Figure3 shows the ranking of top 25 drugs
based on their uses [5]. It can be clearly seen
from graph, usage shows linear relationship
with year. In Figure4, graph shows trends
in the percentage of persons using drugs, this
graph also shows linear increase in usage with
time [6].
The trend of various other hospital drugs, com-
modities and anesthetic gases were plotted
against time for different graphical location.
It was seen that the usage pattern of major-
ity of hospital commodities varies linearly with
time. The pattern of various hospital drugs
used varied linearly with logarithm of popu-
lation growth. Hence, a regression model can
be developed on hospital database after cluster-
ing. Clustering can be done to group different
population density region and to cluster data
depending on graphical location.
A random sample data with 500 observation
was created to predict the demand of different
gases. This set was used to train the model.
Sample data set is created for a region with less
population density. Attributes used in this data
set are year and the corresponding gas demand
in last month. The Results of regression model
developed for predicting the oxygen demand is
shown in Fig 5.
Figure 5. Summary of regression model developed to
predict the demand of oxygen gases.
Different regression model were build for dif-
ferent commodities and different gases. Gener-
alized equation for commodities prediction is
given in (1)
P redictedV alues =A×LastM on.Demand
+B×Y ear
+C×Month
+D×log(P opulation).
(1)
3.2 Medical Inventory Optimization Solu-
tion
The primary function of Medical Inventory
Optimization solution is to allow hospitals to
effectively fulfill demand and identify how to
gain additional profits. Improved efficiency
through effective resource management and
optimization lead to an increase in service
level, improved performance against patients
requests.
Manufacturer
Distributer 1
Distributer 2
Distributer 3
Hospitals
Figure 6. Stages of optimization model for inventory
distribution among vendor for a hospitals
Hence, an important aspect of efficient inven-
tory management requires that the forecast de-
ISBN: 978-1 -941968-20-8 ©2015 SDIWC4 40
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
mand of hospital commodities is optimally dis-
tributed among the different distribution cen-
ters so that the cost incurred by the hospitals is
minimum.
Hospital medical optimization model can be
broadly divided into three stages. The stages
are shown in the Figure 6
The members participating in the model are n
manufacturers as M1, M2, M3, . . . , Mn and i
distribution centers as D1, D2, D3, . . . , Di for
j products as P1, P2, P3, . . . , Pj and k trans-
portation costs as T1, T2, T3, . . . , Tk. The
value of each product is V1, V2, V3, . . . , Vj.
The database holds the information about the
stock levels as Q1, Q2, Q3, . . . , Qj of each
product and lead time of products at each level.
The Lower and upper bounds for each product
are calculated based on the regression analysis
model explained in Section 3.1.
The Optimization is utilized to predict the
emerging excess/shortage of stock levels
which are vital information to be maintained in
the future to minimize the cost and maximize
the availability. Multi-echelon model has been
used as a building block in the formulation of
the entire model. The multi-echelon model
will consider uncertainty, non-stationary de-
mand and lead times of the products. The
evaluation function is determined for each ran-
domly generated individual.
The cost function is given by 2
Cost =XDi X(T i +V i)×Qi (2)
Constraints for optimization model are given
by 3
Qi <=Qmax, Qi >=Qmin (3)
Evolutionary Algorithms (EA) like Genetic Al-
gorithm (GA) is used for on the above cost
equation to minimize the cost distributing the
cost across various Distributors. Given the
non-stationary nature of the demand process,
there may be more inventories left over at the
end of a period than is desired in the next pe-
riod. This is because it is assumed that inven-
tories may be reduced only through demand.
Minimizing takes the interdependence of deci-
sions between time periods through the use of
cumulative supply variables and demand ran-
dom variables.
Figure 7. Authentication page for hospital and vendors.
3.3 A Web Based Portal Design
A web based portal is designed to provide easy
access to hospitals to order their inventory as
suggested by optimization model. Portal will
need authentication from hospitals and vendors
as shown in Figure 7.
After the authentication, hospitals can see the
future demand of various commodities on por-
tal as predicted by analytical model and dis-
tributed among vendors optimally by optimiza-
tion model running on hospital server. Pre-
dicted demand for various gases is shown in
Figure 8.
When vendor logs into their account using por-
tal, same demand will be shown to vendor.
Vendors will have option to either decline or
accept the requirement from hospitals. If ven-
dor declines the request, optimization model
will automatically calculate the next best best
way to distribute the demands. Vendor page to
accept or decline the request is shown in Figure
9.
4 CONCLUSION
The Analytic model that was developed, pro-
vided a solution how much of the inventory
commodities was needed to be order next
month. The output from the analytic model
allowed eliminating placing orders for com-
modities whose demand was not expected in
ISBN: 978-1 -941968-20-8 ©2015 SDIWC5 41
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
Figure 8. Next month demands of different gases predicted by model.
Figure 9. Vendor’s page to decline or accept the hospital demand.
the near future. The objective function and the
constraints used to define the parameters for
the Optimization Model found the optimal so-
lution by distributing the demand among dif-
ferent vendors by taking their distance from
the respective hospitals, cost of supply and
storage capacities such that the cost incurred
by the hospital was minimum. Finally a web
based portal has been designed and proved to
be effective and fast tool to order optimum in-
ventory. The benefit of this model in an im-
plementable manner, along with accurate and
real-time data can provide a health care system
an opportunity to reduce their inventory sur-
plus and allow them to track what the optimal
stock levels for different inventory commodi-
ties should be. Reduction in storage room and
optimization of ordered items provides more
visibility in supply chain management and it
allows a health care system to be more re-
sponsive towards their inventory ordering sys-
tem. The lead time to locate items was reduced
by reduction in storage rooms and the storage
rooms that were freed up of inventory can be
utilized for other hospital purposes.
REFERENCES
[1] Raffaele Iannone, Alfredo Lambiase, Salvatore Mi-
randa, Stefano Riemma and Debora Sarno. ”Mod-
elling Hospital Materials Management Processes.”
International Journal Engineering Business Man-
agement, 2013, 5:15. doi: 10.5772/56607
[2] Varghese, Vijith, Manuel Rossetti, Edward Pohl,
Server Apras, and Douglas Marek. ”Applying Ac-
tual Usage Inventory Management Best Practice in
a Health Care Supply Chain.” International Journal
of Supply Chain Management 1, no. 2 (2012).
[3] Sarda, Anil N., and Yogesh J. Gharpure. ”Trends
in Inventory Management in health care sector in
IndiaIssues & Challenges.”
[4] Kyong Kim,Yusuf Ahmed,Dr. Alper Murat,Dr.
Dean Pichette, ”Inventory Management and Storage
Optimization for Health Care System”.
[5] Jayaraman B and Barrett JS, ”A SAS-based Query
and Exploitation Interface to Hospital Drug Utiliza-
tion Data.” ACCP Annual Meeting, 2008.
[6] Gu, Qiuping, Charles F. Dillon, and Vicki L. Burt.
”Prescription drug use continues to increase: US
prescription drug data for 2007-2008.” NCHS data
brief 42 (2010): 1-8. .
ISBN: 978-1 -941968-20-8 ©2015 SDIWC6 42
Proceedings of the International Conference on Database, Data Warehouse, Data Mining and Big Data (DDDMBD2015), Jakarta, Indonesia 2015
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Data summarization is a data mining technique to summarize huge data in few understandable knowledge. Attribute-Oriented Induction(AOI) is a data summarization algorithm, it suffer from overgeneralization problem. In this paper, we use an entropy measure to enhance generalization process, feature selection, and stop condition. Experimental results show that the proposed technique will reduce the effect of overgeneralization problem.
Article
Article
Two systems developed in the Centre for Systems Science at Simon Fraser University over the past several years are described briefly. These systems permit easy ad hoc access to information in databases, often stored implicitly, for decision-makers. We argue for their relevance and utility and explain their architectural characteristics, providing pointers to appropriate references for specific theoretical and operational details. Examples are given to illustrate the power and usefulness of these systems and these examples are drawn from actual databases in use with Rogers Cablesystems Ltd. and the Natural Sciences and Engineering Research Council of Canada (NSERC).
Article
A prototyped data mining system, DBLearn, has been developed, which efficiently and effectively extracts different kinds of knowledge rules from relational databases. It has the following features: high level learning interfaces, tightly integrated with commercial relational database systems, automatic refinement of concept hierarchies, efficient discovery algorithms and good performance. Substantial extensions of its knowledge discovery power towards knowledge mining in object-oriented, deductive and spatial databases are under research and development.
Article
This paper lays a foundation for the better understanding of the application and acceptance of Actual Usage Inventory Management within the health care supply chain. Actual Usage Inventory Management consist advanced inventory control practices driven by actual usage data. To determine the possible savings from using Actual Usage Inventory Management, a case study was performed on pharmaceutical products in a large healthcare system in the mid-west. The case study used the (r, Q) inventory policy to model the current inventory system and to propose a more cost-effective inventory control system at each echelon. A multi-echelon inventory control system is also proposed and the cost benefits are measured. Demand forecasting algorithms were applied to forecast demand for inventory control procedures. The results indicate that there is great potential for significant cost savings within the healthcare provider network. It is likely that if other providers adopt such practices that they will be able to better control material supply costs.
Article
Data clustering is one of the most important data mining techniques which groups data supported on their similarity. A number of approaches are existing for clustering numerical data and the problem of clustering mixed data is still unresolved. The standard clustering techniques are in general used for numeric data and are not probable to handle mixed data for the reason that of their computational incompetence. The requisite for an enhanced mixed data clustering approach is becoming vital and it is turning out to be a hot research area. By the sort of resolving this issue, Growing Hierarchical Self-Organizing Map (GHSOM) and Extended Attribute-Oriented Induction (EAOI) for clustering mixed data type is previously projected except it does not have any capability to control the growth of the map and in addition the structure of GHSOM is static. To overcoming this issue, a Dynamic Growing Hierarchical Self-Organizing Map (DGHSOM) with EAOI is projected in this paper for handling the mixed data. The main importance of DGHSOM is that it has the ability to grow or modify the structure to represent the application enhanced. The experimentation for the proposed technique is approved with the help of UCI Adult Data Set and Cleve Dataset and it is fond that it is superior to previous approaches based on the number of resultant clusters and outliers with substantial reduction in the processing time. The Clustering error also reduced.
Article
Inductive algorithms produce rules that are used to evaluate and interpret their interestingness while some use rule querying as a post data mining technique for query optimisation purposes. The need for interactive and user-query tools demand a data mining query language. The problem here addresses post rule based querying in a distributed environment. The generalization method of Attribute-Oriented Induction (AOI) with key-preservation (AOI-KP) is used to associate every rule with a collection of tuple keys at each host processor and utilizes a star schema database design. The key-preservation aspect seems to address the information loss problem in AOI and may help to perform efficient data queries using tuple keys. An SQL-like data and rule query operator that utilises data mining primitives in a distributed environment is presented with appropriate rule filtering mechanisms. Initial performance results indicate good scalability and speed-up in a distributed environment.