ArticlePDF Available

Abstract

The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a) first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin.
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 1
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
ABSTRACT
The software processes that facilitate the original loading and the periodic refreshment of the data warehouse
contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of
this survey is to present the research work in the eld of ETL technology in a structured way. To this end,
we organize the coverage of the eld as follows: (a) rst, we cover the conceptual and logical modeling of
ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine
problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL
process, and, (d) we review some research prototypes of academic origin. [Article copies are available for
purchase from InfoSci-on-Demand.com]
Keywords: ETL, data warehouse refreshment
INTRODUCTION
A data warehouse typically collects data from
several operational or external systems (also
known as the sources of the data warehouse)
in order to provide its end-users with access
to integrated and manageable information. In
practice, this task of data collection (also known
as data warehouse population) has to overcome
several inherent problems, which can be shortly
summarized as follows. First, since the differ-
ent sources structure information in completely
different schemata the need to transform the
incoming source data to a common, “global”
data warehouse schema that will eventually
be used by end user applications for querying
is imperative. Second, the data coming from
the operational sources suffer from quality
problems, ranging from simple misspellings
in textual attributes to value inconsistencies,
database constraint violations and conicting or
missing information; consequently, this kind of
“noise” from the data must be removed so that
end-users are provided data that are as clean,
complete and truthful as possible. Third, since
the information is constantly updated in the
production systems that populate the warehouse,
it is necessary to refresh the data warehouse
contents regularly, in order to provide the users
with up-to-date information. All these problems
require that the respective software processes
are constructed by the data warehouse develop-
ment team (either manually, or via specialized
tools) and executed in appropriate time intervals
A Survey of Extract–Transform–
Load Technology
Panos Vassiliadis, University of Ioannina, Greece
2 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
for the correct and complete population of the
data warehouse.
The software processes that facilitate the
population of the data warehouse are commonly
known as Extraction-Transformation-Loading
(ETL) processes. ETL processes are responsible
for (i) the extraction of the appropriate data from
the sources, (ii) their transportation to a special-
purpose area of the data warehouse where they
will be processed, (iii) the transformation of
the source data and the computation of new
values (and, possibly records) in order to obey
the structure of the data warehouse relation
to which they are targeted, (iv) the isolation
and cleansing of problematic tuples, in order
to guarantee that business rules and database
constraints are respected and (v) the loading of
the cleansed, transformed data to the appropri-
ate relation in the warehouse, along with the
refreshment of its accompanying indexes and
materialized views.
A naïve, exemplary ETL scenario imple-
mented in MS SQL Server Integration Ser-
vices is depicted in Figure 1. The scenario is
organized in two parts. The rst part, named
Extraction task (Figure 1a), is responsible for
the identication of the new and the updated
rows in the source table LINEITEM. The idea
is that we have an older snapshot for line items,
stored in table LINEITEM which is compared
to the new snapshot coming from the sources of
the warehouse (depicted as NEW_LINEITEM)
in Figure 1b. Depending on whether a row is
(a) newly inserted, or, (b) an existing tuple
that has been updated, it is routed to the table
LINEITEM, via the appropriate transformation
(remember that insertions and updates cannot
be uniformly handled by the DBMS). Once the
table LINEITEM is populated, the second part
of the scenario, named Transform & Load task
(Figure 1a) is executed. This task is depicted
in Figure 1c and its purpose is to populate
with the update information several tables in
the warehouse that act as materialized views.
The scenario rst computes the value for the
attribute Prot for each tuple and then sends the
transformed rows towards four “materialized”
views that compute the following aggregated
measures (keep in mind that ExtendedPrice
refers to the money clients pay per line item,
PartKey is the primary key for items and Sup-
pKey is the primary key for suppliers):
View A: aggregate prot and average dis-
count grouped by PartKey and SuppKey
View B: average prot and extended price
grouped by PartKey and LineStatus
View C: aggregate prot and extended price
grouped by LineStatus and PartKey
View D: aggregate prot and extended price
grouped by LineStatus
As one can observe, an ETL process is
the synthesis of individual tasks that perform
extraction, transformation, cleaning or loading
of data in an execution graph – also referred
to as a workow. Also, due to the nature of the
design artifact and the user interface of ETL
tools, an ETL process is accompanied by a
plan that is to be executed. For these reasons,
in the rest of this survey we will use the terms
ETL process, ETL workow and ETL scenario
interchangeably.
The historical background for ETL pro-
cesses goes all the way back to the birth of
information processing software. Software for
transforming and ltering information from one
(structured, semi-structured, or even unstruc-
tured) le to another has been constructed since
the early years of data banks, where the relational
model and declarative database querying were
not invented. Data and software were considered
an inseparable duo for data management by that
time and thus, this software was not treated as a
stand-alone, special purpose module of the in-
formation system’s architecture. As Vassiliadis
and Simitsis (2009) mention “since then, any
kind of data processing software that reshapes
or lters records, calculates new values, and
populates another data store than the original
one is a form of an ETL program.”
After the relational model had been born
and the declarative nature of relational database
querying had started to gain ground, it was quite
natural that the research community would
try to apply the declarative paradigm to data
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 3
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
till then, the research community had typically
hidden the internals of ETL process “under the
carpet” by treating the data warehouse as a set
of materialized views over the sources. At the
same time, the industrial vendors were focused
on providing fast querying and reporting facili-
ties to end users. Still, once data warehouses
were established as a practice, it was time to
focus on the tasks faced by the developers. As
a result, during the ‘00s, the industrial eld is
ourishing with tools from both the major da-
tabase vendors and specialized companies and,
at the same time, the research community has
abandoned the treatment of data warehouses as
collections of materialized views and focuses
on the actual issues of ETL processes.
The intention of this survey is to present the
research work in the eld of ETL in a structured
way. The reader is assumed to be familiar with
transformations. The EXPRESS system (Shu,
Housel, Taylor, Ghosh, & Lum, 1977) is the
rst attempt that we know with the purpose of
producing data transformations, taking as input
data denitions or the involved nonprocedural
statements. During the later years, the emphasis
on the data integration problem was signicant,
and wrapper-based exchange of data between
integrated database systems was the closest
thing to ETL that we can report – for example,
see Roth and Schwarz (1997).
ETL has taken its name and existence as a
separate set of tools and processes in the early
‘00s. Despite the fact that data warehouses had
become an established practice in large organi-
zations since the latest part of the ‘90s, it was
only in the early ‘00s that both the industrial
vendors and the research community cared to
deal seriously with the eld. It is noteworthy that
Figure 1. The environment of Extraction-Transformation-Loading processes
(a) Control ow of an ETL scenario (b) Simple extraction part of an ETL scenario
(c) Computation of extra values and multiple aggregations as part of an ETL scenario
4 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
the fundamental concepts of databases and
data warehousing. Since the focus is on ETL
processes, we will avoid the detailed cover-
age of research topics like materialized view
refreshment and data cleaning that are close to
ETL processes (in fact, in practical situations,
these tasks can be important constituents of
an ETL workow) but have an existence of
their own in the research literature. Therefore,
the coverage of these topics mainly includes a
discussion of the problems under investigation
and refers the reader to dedicated surveys with
a broad discussion of the state of the art.
The discussion will start with the examina-
tion of design issues and then, it will proceed
to cover technical issues for ETL processes. In
Section 2, we cover the conceptual and logical
modeling of ETL processes, along with some
design methods. In Section 3, we visit each stage
of the E-T-L triplet, and examine problems that
fall within each of these stages. Then, in Sec-
tion 4, we discuss problems that pertain to the
entirety of an ETL process (and not just in one
of its components), such as issues around the
optimization, resumption and benchmarking of
ETL processes, along with a discussion of the
newest trend in ETL technology, near-real time
ETL. In Section 5, we review some research
prototypes with academic origin. Finally, in
Section 6, we conclude our coverage of the
topic with an eye for the future.
STATE OF THE ART FOR THE
DESIGN AND MODELING OF
ETL PROCESSES
Traditionally, a large part of the literature, the
research activity, and the research community of
data warehouses is related to the area of concep-
tual modeling and design. To a large extent, this
is due to the fact that data warehouse projects
are highly costly and highly risky endeavors;
therefore, careful design and preparation are
necessary. Moreover, due to their complexity,
data warehouse environments should be care-
fully documented for maintenance purposes.
ETL processes could not escape the above rule
and, therefore, they have attracted the attention
of the research community. A typical reason for
this attention is also the fact that mainstream
industrial approaches –see for example Kim-
bal, Reeves, Ross & Thornthwaite (1998), or
Kimball & Caserta (2004)– focus on the physi-
cal-level details and lack a principled approach
towards the problem of designing a data ware-
house refreshment process. In this section, we
will discuss the appearance of research efforts
for the conceptual modeling of ETL processes
with a chronological perspective and also cover
some efforts concerning the logical modeling
and design methods for this task.
UML Meta Modeling for Data
Warehouses
The rst approaches that are related to the
conceptual design of data warehouses were
hidden in discussions pertaining to data ware-
house metadata. Stöhr, Müller, & Rahm (1999)
propose an UML-based metamodel for data
warehouses that covers both the back stage
and the front-end of the data warehouse. Con-
cerning the back stage of the data warehouse,
which is the part of the paper that falls in the
scope of this survey, the authors cover the
workow from the sources towards the target
data stores with entities like Mapping (among
entities) and Transformation (further classi-
ed to aggregations, lters, etc.) that are used
to trace the inter-concept relationships in this
workow environment. The overall approach
is a coherent, UML-based framework for data
warehouse metadata, dened at a high-level
of abstraction. The main contribution of the
authors is that they provide a framework where
specialized ETL activities (e.g., aggregations,
cleanings, pivots, etc) can be plugged in easily
via some kind of specialization.
First Attempts towards a
Conceptual Model
The rst attempt towards a conceptual model
dedicated to the design and documentation of the
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 5
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
data warehouse refreshment was presented by
Vassiliadis, Simitsis & Skiadopoulos (DOLAP
2002). The main motivation for the model was
the observation that during the earliest stages
of the data warehouse design, the designer is
concerned with the analysis of the structure
and content of the existing data sources and
their mapping to the common data warehouse
model. Therefore, a formal model for this task
was necessary at the time.
The model of Vassiliadis et al (DOLAP
2002) involves concepts (standing for source
and warehouse data holders) and their attributes,
which dene their internal structure. Part-of
relationships correlate concepts and their
constituent attributes. Attributes of source and
warehouse concepts are related to each other
with provider relationships. If a transformation
takes place during the population of a target
attribute, then the provider relationship passes
through a transformation node in the model.
Multiple transformations are connected via
serial composition relationships. The model
allows the denition of ETL constraints that
signify the need for certain checks at the data
(e.g., a certain eld must be not null or within
a certain range value, etc). Finally, multiple
source candidates for the population of a ware-
house concept are also tracked via candidate
relationships. This is particularly useful for the
early stages of the design, where more than one
sources can be chosen for the population of a
warehouse fact table. If one of them is eventu-
ally chosen (e.g., due to its higher data quality),
then this source concept is characterized as the
active candidate for the model.
A second observation of Vassiliadis et al
(DOLAP 2002) was that is practically impos-
sible to forecast all the transformations and
cleanings that a designer might ever need. So,
instead of proposing a closed set of transforma-
tions, the authors discuss the extensibility of the
model with template transformations that are
dened by the designer.
UML Revisited for ETL Processes
Trujillo & Luján-Mora (2003) revisit the con-
ceptual modeling of ETL workows from the
view point of UML with the basic argument
that the previous modeling by Vassiliadis et al
is done via an ad-hoc model. So, the authors
try to facilitate the modeling effort for ETL
workows with standard methods and they
employ UML for this purpose.
It is interesting that the authors employ
class diagrams and not activity diagrams for
their modeling. The participating entities are
UML packages; this is a powerful feature of the
model, since it allows the arbitrary nesting of
tasks. This nesting mechanism, quite common
to UML, alleviates the complexity of the model
of Vassiliadis et al., since it allows a gradual
zooming in and out of tasks at different levels
of detail. The main reason for dealing with class
diagrams is that the focus of the modeling is
on the interconnection of activities and data
stores and not on the actual sequence of steps
that each activity performs. Under this prism,
whenever an activity A1 populates an activity
A2, then A2 is connected to A1 with a dependency
association.
Then, Trujillo & Luján-Mora (2003) pro-
vide a short description for a set of commonly
encountered activities. Each such template
activity is graphically depicted as an icon. The
activities covered by the authors are: aggrega-
tion, conversion, logging, ltering, join, and
loading of data, as well as checks for incorrect
data, merging of data coming from different
sources, wrapping of various kinds of external
data sources and surrogate key assignment.
The authors build their approach on a ge-
neric structure for the design process for ETL
workows. So, they structure their generic
design process in six stages, specically, (i)
source selection, (ii) source data transformation,
(iii) source data join, (iv) target selection, (v)
attribute mappings between source and target
data and (vi) data loading.
6 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
UML and Attribute Mappings
One of the main assumptions of the approach of
Trujillo & Luján-Mora (2003) was that the user
must not be overwhelmed with the multitude
of attribute mappings between sources, ETL
activities and target warehouse tables. Still, as
already mentioned, such detail is important for
the back-stage of the data warehouse. Without
capturing the details of the inter-attribute
mappings, important transformations, checks
and contingency actions are not present in
the documentation of the process and can be
ignored at the construction/generation of code.
Despite the effort needed, this documentation
can be useful at the early stages of the project,
where the designer is familiarized with the
internal structure and contents of the sources
(which include data quality problems, cryptic
codes, conventions made by the administrators
and the programmers of the sources and so on).
Moreover, this documentation can also be useful
during later stages of the project where the data
schemata as well as the ETL tasks evolve and
sensitive parts of the ETL workow, both at the
data and the activities, need to be highlighted
and protected.
It is interesting that no standard formalism
like the ER model or the UML treats attributes
as rst-class citizens –and as such, they are un-
able to participate in relationships. Therefore,
Luján-Mora, Vassiliadis & Trujillo (2004) stress
the need to devise a mechanism for capturing
the relationships of attributes in a way that
is (a) as standard as possible and (b) allows
different levels of zooming, in order to avoid
overloading the designer with the large amount
of attribute relationships that are present in a
data warehouse setting.
To this end, the authors devise a mechanism
for capturing these relationships, via a UML
data mapping diagram. UML is employed
as a standard notation and its extensibility
mechanism is exploited, in order to provide a
standard model to the designers. Data mapping
diagrams treat relations as classes (like the
UML relational prole does). Attributes are
represented via proxy classes, connected to
the relation classes via stereotyped “Contain”
relationships. Attributes can be related to each
other via stereotyped “Map” relationships.
A particular point of emphasis made by
Luján-Mora et al. (2004) is the requirement
for multiple, complementary diagrams at dif-
ferent levels of detail. The authors propose four
different layers of data mappings, specically,
(a) the database level, where the involved data-
bases are represented as UML packages, (b) the
dataow level, where the relationships among
source and target relations are captured, each
in a single UML package, (c) the table level,
where the dataow diagram is zoomed in and
each individual transformation is captured as
a package, and (d) the attribute level, which
offers a zoom-in to a table-level data mapping
diagram, with all the attributes and the individual
attribute level mappings captured.
State-of-the-Art at the Logical
Level
Apart from the conceptual modeling process
that constructs a rst design of the ETL process,
once the process has been implemented, there
is a need to organize and document the meta-
information around it. The organization of the
metadata for the ETL process constitutes its
logical level description much like the system’s
catalog acts as the logical level description of
a relational database.
Davidson & Kosky (1999) present WOL,
a Horn-clause language, to specify transforma-
tions between complex types. The transforma-
tions are specied as rules in a Horn-clause
language. An interesting idea behind this ap-
proach is that a transformation of an element
can be decomposed to a set of rules for its ele-
ments, thus avoiding the difculty of employing
complex denitions.
As already mentioned, the rst attempt
towards a systematic description for the meta-
data of the ETL process go back to the works
by Stöhr et al. (1999) and Vassiliadis, Quix,
Vassiliou & Jarke (2001). This research has
been complemented by the approach of Vas-
siliadis, Simitsis & Skiadopoulos (DMDW
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 7
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
2002) where a formal logical model for ETL
process is proposed. The main idea around this
model concerns the capturing of the data ow
from the sources to the warehouse. This means
that meta-information concerning relations or
les and their schemata is kept, along with the
meta-information on the activities involved and
their semantics. Naturally, the interconnection
of all these components in a workow graph that
implements the ETL process is also captured.
The proposal of Vassiliadis et al (DMDW 2002)
models the ETL workow as a graph. The nodes
of the graph are activities, recordsets (uniformly
modeling les and relations) and attributes;
note that recordsets have a schema in the typi-
cal way and activities have input schemata, as
well as an output and a rejection schema (the
latter serving the routing of offending rows
to quarantine). Part-of edges connect the at-
tributes with their encompassing nodes and
provider relationships show how the values are
propagated among schemata (in other words,
provider relationships trace the dependencies
in terms of data provision between one or more
data provider attributes and a populated, data
consumer attribute). Derived relationships also
capture the transitive dependency of attributes
that are populated with values that are computed
via functions and parameters. An ETL workow
is a serializable combination of ETL activities,
provider relationships and data stores. The
overall modeling of the environment is called
Architecture Graph; in other words, an archi-
tecture graph is the logical level description of
the data ow of an ETL process.
The above mentioned model suffered from
the lack of concrete modelling of the semantics
of the activities as part of the graph. In other
words, the provider relationships capture only
the dependencies of a consumer attribute to its
data providers, without incorporating the actual
lterings and transformations that take place on
the way from the provider to the consumer. This
shortcoming was complemented by the work of
Vassiliadis, Simitsis, Georgantas, Terrovitis, &
Skiadopoulos (CAiSE 2003, IS 2005, DaWaK
2005, ER 2005). Due to the complicated nature
of the internal semantics of the activities, the
fundamental idea of these papers is to describe
the meta-information via a series of intermediate
schemata. Coarsely speaking, the semantics of
the activity are related to this graph of internal,
intermediate schemata via a simple convention
that a schema corresponds to a predicate in
rule-based language. LDL, a Datalog variant
is the chosen language for this line of papers.
Then, each rule of the form
OUTPUT<-INPUT, lters, functions,
input-to-output mappings
practically stands for a combination of inputs,
outputs, comparison nodes and functions that
connect the respective schemata via provider
edges (see the long version of Vassiliadis et al,
ER 2005 for a detailed description).
The works by Vassiliadis et al. (CAiSE
2003, IS 2005) also present a template language
that allows ETL designers to dene reusable
templates of LDL programs via the appropriate
macros (see also the discussion in the section
“Systems” for the ARKTOS tool).
From the very beginning, this line of
work was concerned with the exploitation of
the meta-information for ETL processes. The
papers by Vassiliadis et al. (DMDW 2002;
ER 2005) are concerned with metrics for the
identication of important properties of an ETL
design. The metrics proposed by Vassiliadis et
al. (DMDW 2002) are simple but quite power-
ful and capture the degree of dependence of a
node to other nodes and vice versa. The metrics
proposed by Vassiliadis et al. (ER 2005) are
based on a rigorous framework for metrics of
graph-based software constructs and show the
size, cohesion, coupling, and complexity of a
constructed ETL process.
Semantics-aware Design Μethods
for ETL
A semi-automatic transition from the conceptual
to the logical model for ETL processes has been
proposed rst by Simitsis (2005) and later by
Simitsis & Vassiliadis (DSS 2008). Simple
mappings involve the mapping of concepts to
8 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
relations, and the mapping of transformations
and constraint checks to the respective activi-
ties. The hardest problem that is dealt with is
the mapping of the declarative requirements for
transformations and checks at the conceptual
level to a sequence of activities at the logical
level. The paper proposes an algorithm that
groups transformations and checks in stages,
with each stage being a set of activities whose
execution order can be transposed. Stage deriva-
tion is based on the idea of dependency: if the
input attributes of a certain activity a depend
on the (output) attributes of another activity or
recordset b, then a should be on a higher order
stage than b. Binary activities and persistent
record sets typically act as boundaries for stages.
Since the presented algorithms start with sim-
ple cases of single source – single target pairs,
binary activities used as pivotal points of the
scenario for the derivation of common sub-paths
(after the output of the binary activity). Overall,
since the ordering of stages is quite simple, the
determination of the execution orders for all the
activities is greatly simplied.
Skoutas & Simitsis (DOLAP 2006,
IJSWIS 2007) use ontologies to construct the
conceptual model of an ETL process. Based
on domain knowledge of the designer and user
requirements about the data sources and the
data warehouse, an appropriate OWL ontol-
ogy is constructed and used to annotate the
schemas of these data stores. Then, according
to the ontology and these annotations, an OWL
reasoner is employed to infer correspondences
and conicts between the sources and the tar-
get, and to propose conceptual ETL operations
for transferring data between them. Skoutas
and Simitsis (NLDB 2007) use ontologies to
document the requirements of an ETL process.
The proposed method takes a formal, OWL
description of the semantic descriptions of (a)
the data sources, (b) the data warehouse, as
well as (c) the conceptual specication of an
ETL process and translates them to a textual
format that resembles natural language. This
is facilitated via a template-based approach for
the constructs of the formal description.
INDIVIDUAL OPERATORS AND
TASKS IN ETL SCENARIOS
In this section, we organize the review
of the literature based on the constituents of
the ETL process, specically, the extraction,
transformation (&cleaning), and loading phases.
For each phase, we discuss practical problems
and solutions proposed by the research com-
munity.
Research Efforts Concerning Data
Extraction Tasks
The extraction is the hardest part of the refresh-
ment of the data warehouse. This is due to two
facts. First, the extraction software must incur
minimum overheads to the source system both
at runtime and at the nightly time window that
is dedicated to the refreshment of the ware-
house. Second, the extraction software must be
installed at the source side with minimum effect
to the source’s software conguration. Typical
techniques for the task include taking the dif-
ference of consecutive snapshots, the usage of
timestamps (acting as transaction time) in source
relations, or the “replaying” the source’s log le
at the warehouse side. Non-traditional, rarely
used techniques require the modication of the
source applications to inform the warehouse on
the performed alterations at the source, or, the
usage of triggers at the source side.
Snapshot difference between a newer and
an older snapshot of a relation seems straightfor-
ward: In principle, the identication of records
that are present in one snapshot and not in the
other gives us the insertions and deletions
performed; updates refer to two tuples that
are present in the two snapshots and share the
same primary key, but different non-key values.
Despite this theoretical simplicity, performance
considerations are very important and pose a
research problem.
The research community has dealt with
the problem from the mid ‘80s. Lindsay, Haas,
Mohan, Pirahesh, & Wilms (1986) propose a
timestamp based algorithm for detecting inser-
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 9
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
tions, deletions and updates in two snapshots.
A simple algorithm annotating changes with
timestamps and comparing the values for tuples
changed after the last comparison is proposed.
Improvements concerning the identication
of “empty areas” in the table speed up the
process further. Labio & Garcia-Molina (1996)
have presented the state of the art method on
the topic of snapshot difference. The paper
assumes two snapshots as input and produces
a sequence of insertion, deletion and update
actions to tuples, with each tuple being identi-
ed by its key. Several algorithms are discussed
in the paper, including sort-merge outerjoins
and partitioned hash joins. These algorithms
can be extended with compression techniques
in order to reduce the size of the processed
data and the incurred I/Os as well as in order
to exploit the opportunity of tting one of the
two snapshots in main memory. Compression
is applied to each tuple individually (or to a
whole block) and interestingly, the employed
compression function is orthogonal to the al-
gorithm of choice, with the extra observation
that lossy compression methods might introduce
erroneously identied modications with very
small probability. Overall, the investigated
compression techniques include two methods,
(a) simple tuple compression and (b) simple
tuple compression with a pointer to the original
uncompressed record. Then, the authors pro-
pose several algorithms for the identication
of modications. The rst algorithm performs
a sort-merge outer-join. Due to the anticipated
periodic execution of the algorithm it is safe to
assume that the previous snapshot is sorted. The
algorithm sorts the new snapshot into runs and
proceeds as the typical sort-merge join with the
extra fundamental checks that are mentioned
above, concerning the characterization of a tuple
as an insertion (deletion), update, or existing
entry. A variant of the algorithm involves the
usage of compressed snapshot. In this case, the
pointer to the original tuple can help with the
identication of updates. Another variant of the
algorithm concerning the hash join is also dis-
cussed. The main proposal of the paper, though,
is the so-called window algorithm, which ex-
ploits the idea that the tuples that are present in
both snapshots are found in approximately the
same place in the two snapshots. The algorithm
uses an input buffer and a buffer for candidate
modied tuples (ageing buffer in the paper’s
terminology) per snapshot. The algorithm lls
the input buffers and compares their contents.
Tuples found identical in the two buffers are no
longer considered. Misses are tested over the
ageing buffer of the other snapshot. If a match
occurs, the tuples are further ignored. The tuples
that remain in the input buffers after these tests
are candidates to be insertions or deletions and
they are pushed to the ageing buffer of their
snapshot. Due to space requirements, if an
ageing buffer is full, it must be emptied. To this
end, a queue of pointers is maintained, keeping
track of the order in which the tuples entered
the buffer; if the buffer is full, the oldest tuples
are emptied. The algorithm is quite efcient and
safe if the same records are physically placed
in nearby areas in the two snapshots and the
experimental results have proved that this is a
realistic assumption.
Research Efforts Concerning Data
Transformation Tasks
Although naïve data transformations are inher-
ently built inside SQL and relational algebra,
transformations that are used in ETL scenarios
have not really found their way in the research
literature. The main reason for this is probably
due to the fact that transformations of this kind
are typically ad-hoc and rather straightforward if
studied individually; the complexity arises when
their combination in a workow is introduced.
Nevertheless, there are some approaches that
try to deal with the way to transform input to
output data efciently, and elegantly in terms
of semantics.
The Pivoting Problem
Pivoting refers to a common spreadsheet opera-
tion for the tabular presentation of data to the
end user, which is also quite common in ETL
processes. Assume that a user wants to repre-
10 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
sent information about employees’ revenues,
with the revenues of each employee split into
categories like salary, tax, work bonus, and
family bonus. There are two possible organiza-
tions for this kind of data. First, a row-oriented
organization is based on a table with the above
categories represented as attributes -- e.g., con-
sider a relation of the form EMPr(EID, ESAL,
ETAX, EWBonus, EFBonus). A second, attri-
bute-value representation splits the revenues
of each employee into different rows – e.g.,
consider a relation of the form EMPav(EID,
RID, Amount) with RID being a foreign key to a
table REVENUE_CATEGORIES(RID, RDescr)
that contains values {(10, Sal), (20,Tax), …}.
Pivoting is the operation that transforms relation
EMPav to EMPr; unpivot is the inverse opera-
tion. Cunningham, Galindo-Legaria & Graefe
(2004) discuss efcient ways to implement
these two operations in a DBMS environment.
The authors start by suggesting compact exten-
sions of SQL to allow the end user to directly
express the fact that this is the requested op-
eration (as opposed to asking the user perform
pivot/unpivot through complicated expressions
that the optimizer will fail to recognize as one
operation). Special treatment is taken for data
collisions and NULL values. The authors build
upon the fact that pivot is a special case of ag-
gregation, whereas unpivot is a special case of
a correlated nested loop self join and explore
query processing and execution strategies, as
well as query optimization transformations,
including the possibilities of pushing projec-
tions and selections down in execution plans
that directly involve them.
Data Mappers
A sequence of papers by Carreira et al. (DaWaK
2005, DKE 2007, ICEIS 2007) explore the
possibilities imposed by data transformations
that require one-to-many mappings, i.e., trans-
formations that produce several output tuples
for each input tuple. This kind of operations is
typically encountered in ETL scenarios. Since
relational algebra is not equipped with an op-
erator that performs this kind of input-output
mapping, Carreira et al. extend it by proposing
a new operator called data mapper and explore
its semantics and properties. In this discussion,
we mainly focus on the work of Carreira et
al (DKE, 2007) which is a long version of a
previous work (DaWaK 2005). The authors
dene the data mapper operator as a comput-
able function mapping the space of values of an
input schema to the space of values of an output
schema. Mappers are characterized as single or
multi-value mappers, depending on whether
one or more tuples occur at the output given
an arbitrary tuple at the input of the operator.
Mappers under investigation are minimalistic
operators and should be highly cohesive (i.e.,
they should do exactly one job); to this end,
a mapper is dened to be in normal form if it
cannot be expressed as the composition of two
other mappers. Carreira et al. propose algebraic
rewritings to speed up the execution of compos-
ite expressions of the extended relation algebra.
Specically, a data mapper can be combined
with a (originally subsequent) selection condi-
tion, if the condition operates on a parameter of
a mapper function. A second rule directs how
a selection condition that uses attributes of the
output of a data mapper can be translated to
the attributes that generate them and thus be
pushed through the mapper. A third rule deals
with how projection can help avoid unnecessary
computations of mappers that will be subse-
quently projected-out later. Finally, Carreira
et al (ICEIS 2007) report some rst results on
their experimentation with implementing the
data mappers in a real RDBMS. A more detailed
description of alternative implementations is
given by Carreira et al (QDB 2007) concern-
ing unions, recursive queries, table functions,
stored procedures and pivoting operations as
candidates for the implementation of the data
mapper operator. The rst four alternatives
where used for experimentation in two different
DBMSs and table functions appear to provide
the highest throughput for data mappers.
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 11
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
The Data Lineage Problem
Apart from efciently performing the transfor-
mations in an ETL process, it is also important to
be able to deal with the inverse problem. Given
a certain warehouse tuple (or, set of tuples), the
data lineage problems involves the identica-
tion of the input data that are the originators of
the given tuple (or, tuples).
Cui and Widom have investigated the prob-
lem of data lineage in a set of papers (Cui &
Widom, 2001 & 2003). Here, we mostly focus on
Cui & Widom (2003) that describes the authors’
method in more detail. This approach treats
each transformation as a procedure that can be
applied to one or more datasets and produces
one or more datasets as output. Cui and Widom
present a set of properties that a transformation
can have, specically (a) stability, if it never
produces datasets as output without taking
any datasets as input, (b) determinism and (c)
completeness, if each input data item always
contributes to some output data item. The au-
thors assume that all transformations employed
in their work are stable and deterministic and
proceed to dene three major transformation
classes of interest: dispatchers, aggregators
and black-boxes.
A transformation is a dispatcher, if each
input data item produces zero or more out-
put items (with each output item possibly
being derived by more than one input data
items). A special category of dispatchers are
lters. A dispatcher is a lter if each input
item produces either itself or nothing.
A transformation is an aggregator, if it is
complete and there exists a unique disjoint
partition of the input data set that contrib-
utes to some output data item. An aggre-
gator is context-free if the output group to
which an input data item is mapped can be
found by observing the value of this data
item alone, independently of the rest of the
data items of the input. An aggregator is
key preserving if all the input originators of
an output tuple have the same key value.
A transformation is a black-box, if it is
neither a dispatcher nor an aggregator.
Several other sub-categories concerning the
mapping of input to output values on the basis
of the keys of the input and the output are also
dened. The main idea is that if a certain value
of an output (input) tuple can directly lead to its
respective input (output) tuple(s), then this can
be exploited during lineage determination. For
example, backward key transformations have
the property that given an output tuple, one can
determine the key of the input tuples that pro-
duced it; in this case, lineage is straightforward.
In the general case, the lineage determination for
dispatchers requires one pass of the input; the
lineage determination of aggregators requires
several full scans of the input, whereas black
boxes have the entire input as their lineage.
Given a sequence of transformations in an
ETL scenario, it is necessary to keep a set of
intermediate results between transformations,
in order to be able to determine the data lin-
eage of the output. Appropriate indexes may
be used to relate each output to its input. To
avoid storing all the intermediate results, the
authors propose a normalization process that
allows the reordering of the transformations in
such a way that adjacent transformations share
similar properties. Assuming two adjacent
transformations a and b, the idea is that their
grouping is benecial if one can determine the
lineage of a tuple in the output of b at the input
of a, without storing any intermediate results.
To this end, Cui and Widom propose a greedy
algorithm, called Normalize, which repeatedly
discovers benecial combinations of adjacent
transformations and combines the best pair of
transformations.
Theoretical Foundations for ETL
Processes
The theoretical underpinnings concerning the
internal complexity of simple ETL transforma-
tions are investigated by research that concerns
the data exchange problem. Assume a source
and a target schema, a set of mappings that
12 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
specify the relationship between the source and
the target schema and a set of constraints at the
target schema. The problem of data exchange
studies whether we can materialize an instance
at the target, given an instance at the source,
such that all mappings and constraints are re-
spected. Data exchange is a specic variant of
the general data integration meta-problem, that
requires the materialization of the result at the
target, with the ultimate purpose of being able
to subsequently pose queries to it, without the
luxury of referring back to the source. Due to
this inherent characteristic of the problem, we
believe that the data exchange is the closest
theoretical problem to ETL that we know of.
As Fagin, Kolaitis & Popa (2005) mention,
several problems arise around the data exchange
problem, with particular emphasis on (a) ma-
terializing the solution that reects the source
data as accurately as possible and (b) doing so
in an efcient manner. A rst important result
comes in the identication of the best possible
solutions to the problem, called universal solu-
tions (Fagin, Kolaitis, Miller, & Popa, 2005).
Universal solutions have the good property of
having exactly the data needed for the data ex-
change and can be computed in polynomial time
via the chase. It is interesting that if a solution
exists, then a universal solution exists, too, and
every other solution has a homomorphism to it.
At the same time, Fagin et al (TODS 2005) deal
with the problem of more than one universal
solution for a problem and introduce the no-
tion of a core, which is the universal solution
of the smallest size. For practical cases of data
exchanges, polynomial time algorithms can be
used for the identication of the core. It is worth
noting that the schema mappings explored are
investigated for the case of tuple-generating
dependencies, which assure that for each source
tuple x, whenever a conjunctive formula applies
to it, then there exists a target tuple y, such that
a conjunctive formula over x and y applies too
(in other words, we can assure that mappings
and constraints are respected). To our point of
view, this is a starting point for the investiga-
tion of more complex mappings that arise in
ETL settings.
Further results have to do with the abil-
ity to pose queries and obtain certain answers
from a data exchange setting as well as with the
management of inverse mappings relationships
(Fagin, 2007).
Data Cleaning
The area of data cleaning, although inherently
related to the ETL process, practically consti-
tutes a different eld on its own. Covering this
eld adequately is well beyond the scope of
this paper. The topic that has been in the center
of attention of the research community in the
area of data cleaning concerns record matching,
with a particular emphasis on textual attributes.
Record matching refers to the problem of iden-
tifying records that represent the same object
/ fact in the real world, with different values.
Essentially, it is the case of textual elds that
presents the major research challenge, since
their unstructured nature allows users to per-
form data entry in arbitrary ways. Moreover,
value discrepancies due to problems in the data
entry or data processing, as well as different
snapshots of the representation of the real world
fact in the database contribute to making the
problem harder. Typically, the identication of
duplicates requires an efcient algorithm for
deciding which tuples are to be compared and
a distance (or, similarity) metric based on the
values of the compared tuples (and possibly,
some knowledge by an expert).
For recent, excellent surveys of the eld,
the reader is encouraged to rst refer to a survey
by Elmagarmid, Ipeirotis & Verykios (2007)
as well as to a couple of tutorials by Koudas
& Srivastava (2005) and Koudas, Sarawagi &
Srivastava (2006) in the recent past.
Research Efforts Concerning Data
Loading Tasks
Typically, warehouse loading tasks take place
in a periodic fashion, within a particular time
window during which, the system is dedicated
to this purpose. Bulk loading is performed (a)
during the very rst construction of the ware-
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 13
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
house and (b) in an incremental way, during its
everyday maintenance. During the latter task, a
set of new insertions, deletions and updates ar-
rive at the warehouse, after being identied at the
extraction phase and, subsequently transformed
and cleaned. This set of data is typically called
the ‘delta’ relation and has to be loaded to the
appropriate destination table. In all occasions,
loading is performed via vendor-specic bulk
loaders that provide maximum performance (see
Vassiliadis & Simitsis, 2009 for a discussion on
the topic). Also, apart from the dimension tables
and the fact tables, indexes and materialized
views must also be maintained.
It is interesting that only the loading of
views and indexes has been investigated by
the research community; in this subsection,
we give an overview of important references
in the bibliography.
Maintenance of Indexes
Concerning the general setting of index main-
tenance, Fenk, Kawakami, Markl, Bayer &
Osaki (2000) highlight the critical parameters
of index bulk loading that include the minimi-
zation of random disk accesses, disk I/O, CPU
load, and the optimization of data clustering and
page lling. As typically happens with spatial
indexes, Fenk et al., identify 3 stages in the
bulk loading of a multidimensional warehouse
index: (a) key calculation for each tuple of the
data set, (b) sorting of the tuples on the basis
of their keys and (c) loading of the sorted data
into the index.
Roussopoulos, Kotidis & Roussopoulos
(1997) discuss the efcient construction and
bulk load of cube trees. Cube trees and their
variants are based on the fundamental idea of
nding an efcient data structure for all the
possible aggregations of these data. Assume
a relation with M attributes, out of which N,
N<M, can act as grouping attributes for ag-
gregate queries. The idea behind cube trees is
the mapping of all the data of all these possible
aggregations to a single index that can efciently
answer aggregate range queries. The packing of
the empty space is crucial for the compression
and efcient query answering of the cube tree.
Roussopoulos et al (1997) discuss the bulk
loading and maintenance of cube trees that are
implemented over packed R-trees. The authors
make the sharp observation that individual
updates are practically impossible in terms of
performance and thus, bulk updates are neces-
sary. The approach is based on sorting the delta
increment that is to be loaded and merging it
with the existing cube tree. Since all possible
aggregates are kept by cube trees, for every delta
tuple, all its possible projections are computed
and stored in the appropriate buffer. When a
buffer is full, it is sorted and then it is staged
for the subsequent merge that takes place once
all the delta increment has been processed. The
authors also highlight that as time passes, the
points of all the possible cubes of the multidi-
mensional space are covered with some values,
which makes the incremental update even
easier. Moreover, since the merging involves
three parts (i) the old cube tree, (ii) the delta
and (iii) the new cube tree, their merging can be
efciently obtained by using three disks, one for
each part. Roussopoulos, Kotidis & Sismanis
(1999) discuss this possibility.
Fenk et al. (2000) propose two bulk loading
algorithms for the UB-Tree, one for the initial
and one for the incremental bulk loading of a
UB-tree. The UB-Tree is a multidimensional
index, which is used along with specialized
query algorithms for the sequential access to
the stored data. The UB-Tree clusters data ac-
cording to a space lling Z–curve. Each point of
the multidimensional space is characterized by
its Z-address and Z-addresses are organized in
disjoined Z-regions mapped to the appropriate
disk pages. This way, a tree can be formed and
the location of specic points at the disk can
be computed. Concerning the abovementioned
generic method for bulk loading that Fenk et al
have highlighted, in the case of UB-trees, the
key is computed by taking the primary key of a
tuple and calculating its Z-value and the sorting
is performed with external merge sorting. To
achieve high page lling, the construction of
the UB-tree uses the idea of organizing data in
buffers twice as large as disk pages; these buffers
14 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
are split in two and one of the two halves is stored
as a UB-tree page. The incremental variant of
the algorithm tries to identify the correct page
for inserting a tuple under consideration.
Dwarfs (Sismanis, Deligiannakis, Rous-
sopoulos & Kotidis, 2002) are indexes built by
exploiting an appropriate ordering of the dimen-
sions and the exploitation of common prexes
and sufxes in the data. A dwarf resembles a
trie, with one level for each dimension of a cube
and appropriate pointers among internal levels.
High cardinality dimensions are placed highly
in the Dwarf, to quickly reduce the branching
factor. Dwarf construction exploits a single,
appropriate sorting of data and proceeds in a
top-down fashion. Identical sub-dwarfs (i.e.,
dwarfs generated from the same set of tuples
of the fact table) are pinpointed and coalesced.
Dwarfs constitute the state-of-the-art, both in
querying and in updating sets of aggregate
views and, when size and update time are
considered, they signicantly outperform other
approaches.
Materialized View Maintenance
View maintenance is a vast area by itself;
covering the topic to the full of its extent is
far beyond the focus of this paper. The main
idea around view maintenance is developed
around the following setting: Assume a set of
base tables upon which a materialized view is
dened via a query. Changes (i.e., insertions,
deletions and updates) occur at the base tables
resulting in the need to refresh the contents of
the materialized view (a) correctly and (b) as
efciently as possible. The parameters of the
problem vary and include:
a. The query class: The area started by deal-
ing with Select-Project-Join views, but the
need for aggregate views, views with nested
queries in their denition, outerjoins and
other more complicated components has
signicantly extended the eld. The query
class (along with several assumptions) can
also determine whether the materialized
view can be updated solely with the changes
and its current state, without accessing the
base tables.
b. The nature of the changes: apart from
simple insertions and deletions, it is an
issue whether updates are treated per se
or as a pair (delete old value, insert new
value). Also, sets of updates as opposed to
individual updates can be considered.
c. The number of materialized views that
are concurrently been updated: in the
context of data warehousing, and ETL in
particular, simply dealing with one view
being updated is too simplistic. Typically,
several materialized views (possibly related
in some hierarchical fashion, where one
view can be derived from the other) have
to be simultaneously refreshed.
d. The way the update is physically imple-
mented: Typically, there are three ways
to refresh views, (a) on-update, i.e., the
instant a change occurs at the sources, (b)
on-demand, i.e., in a deferred way, only
when someone poses a query to the view
and (c), periodically, which is the typical
case for data warehousing, so far.
There are several surveys that give point-
ers for further reading. The earliest of them
was authored by Gupta & Mumick (1995).
One can also refer to a note by Roussopoulos
(1998) and a chapter by Kotidis (2002). Gupta
& Mumick (2006) is a recent paper that dis-
cusses maintenance issues for a complicated
class of views. The reader is also referred to
papers by Mumick, Quass & Mumick (1997),
Colby, Kawaguchi, Lieuwen, Mumick & Ross
(1997), Labio, Yerneni & Garcia-Molina (1999),
and Stanoi, Agrawal & El Abbadi (1999), for
the update of groups of views in the presence
of updates.
HOLISTIC APPROACHES
In the previous section, we have dealt with
operators facilitating tasks that are located in
isolation in one of the three main areas of the ETL
process. In this section, we take one step back
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 15
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
and give a holistic view to research problems
that pertain to the entirety of the ETL process.
First, we discuss the problems of optimization
and resumption of entire ETL workows. Sec-
ond, we visit the practical problem of the lack
of a reference benchmark for ETL processes.
Finally, we cover the newest topic of research in
the area, concerning the near real time refresh-
ment of a data warehouse.
Optimization
The minimization of the execution time of an
ETL process is of particular importance, since
ETL processes have to complete their task
within specic time windows. Moreover, in the
unfortunate case where a failure occurs during
the execution of an ETL process, there must
be enough time left for the resumption of the
workow. Traditional optimization methods are
not necessarily applicable to ETL scenarios. As
mentioned by Tziovara, Vassiliadis & Simitsis
(2007) “ETL workows are NOT big queries:
their structure is not a left-deep or bushy tree,
black box functions are employed, there is a
considerable amount of savepoints to aid faster
resumption in cases of failures, and different
servers and environments are possibly involved.
Moreover, frequently, the objective is to meet
specic time constraints with respect to both
regular operation and recovery (rather than the
best possible throughput).For all these reasons,
the optimization of the execution of an ETL
process poses an important research problem
with straightforward practical implications.
Simitsis, Vassiliadis & Sellis (ICDE 2005,
TKDE 2005) handle the problem of ETL op-
timization as a state-space problem. Given an
original ETL scenario provided by the ware-
house designer, the goal of the paper is to nd
a scenario which is equivalent to the original
and has the best execution time possible. Each
state is a directed acyclic graph with relations
and activities as the nodes and data provider
relationships as edges. The authors propose a
method that produces states that are equivalent
to the original one (i.e., given the same input,
they produce the same output) via transitions
from one state to another. A transition involves
a restructuring of the graph mainly in one of
the following ways:
swapping of two consecutive activities if
this is feasible, with the goal of bringing
highly selective activities towards the be-
ginning of the process (in a manner very
similar to the respective query optimization
heuristic in relational DBMS’s),
factorization of common (or, in the paper’s
terminology, homologous) activities in a
workow that appear in different paths
that end in a binary transformation, with
the goal of applying a transformation only
once, later in the workow, possibly to
fewer or sorted data,
distribution of common activities (the
inverse of factorization) by pushing a
transformation that appears late in the
workow towards its start, with the hope
that a highly selective activity found after
a binary transformation is pushed early
enough in the workow.
Two other transitions, merge and split are
used for special cases. The important problem
behind the proposed transitions is that a restruc-
turing of the graph is not always possible. For
example, it is important to block the swapping
of activities where the operation of the second
requires an attribute computed in the rst. At
the same time, the detection of homologous
activities requires the identication of activi-
ties with common functionality over data with
similar semantics. An ontological mapping
of attributes to a common conceptual space
facilitates this detection.
The paper uses a very simple cost model to
assess the execution cost of a state and presents
three algorithms to detect the best possible al-
gorithm. Apart from the exhaustive algorithm
that explores the full search space, heuristics
are also employed to reduce the search space
and speed up the process.
The work of Simitsis et al (ICDE 2005,
TKDE 2005) for the optimization of an ETL
process at the logical level was complemented
16 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
with a paper on the optimization of ETL
workows at the physical level by Tziovara
et al (2007). In this paper, the authors propose
a method that produces an optimal physical
level scenario given its logical representation
as an input. Again, the input ETL workow
is considered to be a directed acyclic graph
constructed as mentioned above. A logical-
level activity corresponds to a template i.e.,
an abstract operation that is customized with
schema and parameter information for the ETL
scenario under consideration. Each logical-level
template is physically implemented by a variety
of implementations (much like a relational join
is implemented by nested loops, merge-sort,
or hash join physical-level operators). Each
physical operator is sensitive to the order of
the incoming stream of tuples and has a dif-
ferent cost and needs of system resources (e.g.,
memory, disk space, etc). Again, the problem
is modeled as a state-space problem with states
representing full implementations of all the
logical level activities with their respective
physical-level operators. Transitions are of two
kinds: (a) replacement of a physical implemen-
tation and (b) introduction of sorter activities
which apply on stored recordsets and sort their
tuples according to the values of some, critical
for the sorting, attributes. The main idea behind
the introduction of sorters is that order-aware
implementations can be much faster than their
order-neutral equivalents and possibly outweigh
the cost imposed by the sorting of data. As with
classical query optimization, existing ordering
in the storage of incoming data or the reuse of
interesting orders in more than one operator
can prove benecial. Finally, this is the rst
paper where butteries have been used as an
experimental tool for the assessment of the
proposed method (see the coming section on
benchmarking for more details).
The Resumption Problem
An ETL process typically processes several
MB’s or GB’s of data. Due to the complex
nature of the process and the sheer amount of
data, failures create a signicant problem for
warehouse administrators – mainly due to the
time limits (typically referred to as the time
window) within which the loading process must
be completed. Therefore, the efcient resump-
tion of the ETL process in the case of failures
is very important.
Labio, Wiener, Garcia-Molina & Gorelik
(2000) are concerned with the issue of ef-
ciently resuming an interrupted workow.
Instead of redoing the workow all over
from the beginning, the authors propose a
resumption algorithm, called DR, based on
the fundamental observation that whenever
an activity outputs data in an ordered fashion,
then its resumption can start right where it was
interrupted. Activities are practically treated as
black boxes (where only the input and output
are of interest) and a tree of activities is used
to model the workow. Each path of the tree is
characterized on the relationship of output to
input tuples and on the possibility of ignoring
some tuples. Each transformation in an ETL
process is characterized with respect to a set of
properties that concern (a) the extent to which
an input tuple produces more than one output
tuples (if not, this can be exploited at resump-
tion time), (b) the extent to which a prex or a
sufx of a transformation can be produced by
a prex or a sufx of the input (in which case,
resumption can start from the last tuple under
process at the time of failure), (c) the order
produced by the transformation (independently
of the input’s order) and the deterministic nature
of the transformation. Combinations of these
properties are also considered by the authors.
Moreover, these properties are not dened only
for transformations in isolation, but also, they
are generalized for sequences of transforma-
tions, i.e., the whole ETL process.
The resumption algorithm has two phases:
(a) design, where the activities of the workow
are characterized with respect to the aforemen-
tioned properties and (b) resumption, which
is based on the previous characterization and
invoked in the event of failure.
Design constructs a workow customized
to execute the resumption of the original
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 17
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
workow. To this end, re-extraction pro-
cedures are assigned to extractors; these
procedures regulate whether all or part of
the input will be extracted from the sources.
These re-extraction procedures are comple-
mented with lters that are responsible for
blocking source or intermediate tuples that
have already been processed and most
importantly, stored – during the regular
operation of the ETL process.
Resume consists of the assignment of the
correct values that must be assigned to
the resumption workow’s lters, so that
the tuples that are already stored in the
warehouse are blocked. Then, Resume
performs the application of the re-extrac-
tion procedures. Subsequently, the load
of the warehouse continues as in regular
operation.
Benchmarking
Unfortunately, the area of ETL processes suffers
from the absence of a thorough benchmark that
puts tools and algorithms to the test, taking into
consideration parameters like the complexity
of the process, the data volume, the amount of
necessary cleaning and the computational cost of
individual activities of the ETL workows.
The Transaction Processing Council has
proposed two benchmarks for the area of deci-
sion support. The TPC-H benchmark (TPC-H,
2008) is a decision support benchmark that
consists of a suite of business-oriented ad-hoc
queries and concurrent data modications. The
database describes a sales system, keeping in-
formation for the parts and the suppliers, and
data about orders and the supplier’s customers.
The relational schema of TPC-H is not a typi-
cal warehouse star or snowake schema; on
the contrary, apart from a set of tables that are
clearly classied as dimension tables, the facts
are organized in a combination of tables acting
as bridges and fact tables. Concerning the ETL
process, the TPC-H requires the existence of
very simple insertion and deletion SQL state-
ments that directly modify the contents of the
LINEITEM and ORDERS warehouse tables.
Clearly, TPC-H is not related to ETL, since there
is no workow of cleanings or transformations,
no value computations and no routing of data
from the sources to the appropriate targets in
the warehouse.
TPC-DS is a new Decision Support (DS)
workload being developed by the TPC (TPC-
DS, 2005). This benchmark models the decision
support system of a retail product supplier,
including queries and data maintenance. The
relational schema of this benchmark is more
complex than the schema presented in TPC-
H. TPC-DS involves six star schemata (with a
large overlap of shared dimensions) standing for
Sales and Returns of items purchased via three
sales channels: a Store, a Catalog and the Web
channel. The structure of the schemata is more
natural for data warehousing than TPC-H; still,
the schemata are neither pure stars, nor pure
snowakes. The dimensions follow a snowake
pattern, with a different table for each level;
nevertheless, the fact table has foreign keys to
all the dimension tables of interest (resulting
in fast joins with the appropriate dimension
level whenever necessary). TPC-DS provides
a signicantly more sophisticated palette of
refreshment operations for the data warehouse
than TPC-H. There is a variety of maintenance
processes that insert or delete facts, maintain
inventories and refresh dimension records,
either in a history keeping or in a non-history
keeping method. To capture the semantics of
the refreshment functions, warehouse tables
are pseudo-dened as views over the sources.
The refreshment scenarios of TPC-DS require
the usage of functions for transformations and
computations (with date transformations and
surrogate key assignments being very popular).
Fact tables are also populated via a large number
of inner and outer joins to dimension tables.
Overall, TPC-DS is a signicant improvement
over TPC-H in terms of benchmarking the ETL
process; nevertheless, it still lacks the notion of
large workows of activities with schema and
value transformations, row routing and other
typical ETL features.
A rst academic effort for the benchmark-
ing of warehouses is found in Darmont, Ben-
18 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
tayeb, & Boussaïd (2005). Still, the benchmark
explicitly mentions ETL processes as future
work and does not address the abovemen-
tioned problems. A second approach towards
a benchmark for ETL processes is presented
in Vassiliadis, Karagiannis, Tziovara, Simitsis
(2007). The authors provide a rough classi-
cation of template structures, which are called
butteries, due to their shape. In the general
case, a buttery comprises three elements. First,
a buttery comprises a left wing where source
data are successively combined (typically via
join operations) towards a central point of stor-
age (called the body of the buttery). Finally,
the right wing comprises the routing and ag-
gregation of the tuples that arrived at the body
towards materialized views (also simulating
reports, spreadsheets etc.). Depending on the
shape of the workow (where one of the wings
might be missing), the authors propose several
template workow structures and a concrete set
of scenarios that are materializations of these
templates.
Near-Real-Time ETL
Traditionally, the data warehouse refreshment
process has been executed in an off-line mode,
during a nightly time window. Nowadays, busi-
ness needs for on-line monitoring of source
side business processes drive the request for
100% data freshness at the user’s reports at
all times. Absolute freshness (or real time
warehousing as abusively mentioned) practi-
cally contradicts with one of the fundamental
reasons for separating the OLTP systems from
reporting tasks for reasons of load, lock conten-
tion and efciency (at least when large scale is
concerned). Still, due to the user requests, data
warehouses cannot escape their transforma-
tion to data providers for the end users with
an increasing rate of incoming, fresh data. For
all these reasons, a compromise is necessary
and as a result, the refreshment process moves
to periodic refresh operations with a period of
hours or even minutes (instead of days). This
brings near real time data warehousing in the
stage. In this subsection, we will review some
research efforts towards this direction.
Karakasidis, Vassiliadis & Pitoura (2005),
propose a framework for the implementation of
near real time warehouse (called “active” data
warehousing by the authors). Several goals are
taken into consideration by the authors, and
specically: (a) minimal changes in the soft-
ware conguration of the source, (b) minimal
overhead on the source due to the continuity of
data propagation, (c) the possibility of smoothly
regulating the overall conguration of the envi-
ronment in a principled way. The architecture of
the system is based on pipelining: each activity
behaves as a queue (and thus, it called an ETL
queue) that periodically checks the contents of
its queue buffers and passes a block of tuples
to a subsequent queue once they are appropri-
ately processed (i.e., ltered or transformed).
The queues of an ETL workow form a queue
network and pipelining takes place. The paper
explores a few other possibilities: (a) queue
theory is used for the prediction of the behavior
of the queue network, (b) a legacy application
over ISAM les was modied with minimal
software changes and (c) web services were
employed at the warehouse end to accept the
nal blocks of tuples and load them to the
warehouse. The latter performed reasonably
well, although the lack of lightweightness in
web service architectures poses an interesting
research challenge.
The work by Luo, Naughton, Ellmann &
Watzke (2006) deals with the problem of con-
tinuous maintenance of materialized views. The
continuous loading of the warehouse with new
or updated source data is typically performed
via concurrent sessions. In this case, the exis-
tence of materialized views that are dened as
joins of the source tables may cause deadlocks
(practically, the term ‘source tables’ should be
understood as referring to cleansed, integrated
“replicas” of the sources within the warehouse).
This is due to the fact that the maintenance of
the view due to an update of source R1 may
require a lookup to source relation R2 for the
match of new delta. If R2 is updated concurrently
with R1, then a deadlock may occur. To avoid
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 19
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
deadlock, Luo et al. propose the reordering
of transactions in the warehouse. The authors
assume that the refreshment process operates
over join or aggregate join materialized views.
The authors suggest several rules for avoiding
deadlocks. Specically, the authors require that
(a) at any time, only one of the relations of a
join view is updated, (b) data coming from the
sources are not randomly assigned to update
sessions, but rather, all the modications of a
tuple are routed to the same refreshment ses-
sion, and, (c) the traditional 2PL, tuple-level
locking mechanism is replaced by some higher
level protocol. Individual modications to the
same relation are grouped in transactions and
a scheduling protocol for these transactions is
proposed, along with some starvation avoidance
heuristics. A second improvement that Luo et
al. propose is the usage of “pre-aggregation” for
aggregate materialized views, which practically
involves the grouping of all the modications
of a certain view tuple in the same transaction
to one modication with their net effect. This
can easily be achieved by sorting the individual
updates over their target tuple in the aggregate
materialized view. The experimental assess-
ment indicates that the throughput is signi-
cantly increased as an effect of the reordering
of transactions especially as the number of
modications per transaction increases.
Thiele, Fischer & Lehner (2007) discuss
the problem of managing the workload of the
warehouse in a near real time warehouse (called
real time warehouse by the authors). The au-
thors discuss a load-balancing mechanism that
schedules the execution of query and update
transactions according to the preferences of the
users. There are two fundamental, conicting
goals that the scheduling tries to reconcile. On
the one hand, users want efcient processing of
their queries and low response time. This goal is
referred to as Quality of Service by the authors.
On the other hand, the users also want the ware-
house data to be as up-to-date as possible with
respect to their originating records at the sources.
This goal is referred to as Quality of Data by
the authors. Two queues are maintained by the
algorithm, one for the queries and one for the
update transactions. Since a reconciliation must
be made between two conicting requirements,
the authors assume that queries are tagged with
two scores, one for the requirement for fresh-
ness and another for the requirement of query
efciency. To enable the near real time loading
of the warehouse, Thiele et al. propose a two-
level scheduling mechanism. The rst level of
scheduling is dedicated in deciding whether
a user query or an update transaction will be
executed; this decision is based on the sum of
the scores of both requirements for all the que-
ries. The winner sum determines if an update
or a query will be executed. The second level
of scheduling resolves which transaction will
be executed. To avoid implications with data
correctness and to serve both goals better, the
authors resolve in two scheduling guidelines.
If an update transaction is to be executed, then,
it should be the one related to the data that are
going to be queried by a query at the beginning
of the query queue. On the other hand, if a query
is to be executed (i.e., efciency has won the
rst level of the scheduling contest), then the
query with a higher need for Quality of Service
is picked. To avoid starvation of queries with
low preference to quality of service, the QoS
tags of all queries that remain in the queue are
increased after each execution of a query.
Thomsen, Pedersen & Lehner (2008) dis-
cuss a loader for near real time, or right time
data warehouses. The loader tries to synchronize
the loading of the warehouse with queries that
require source data with a specic freshness
guarantee. The idea of loading the data when
they are needed (as opposed to before they are
needed) produces the notion of right time ware-
housing. The architecture of the middleware
discussed by the authors involves three tiers.
The rst tier concerns the data producer, at the
source side. The middleware module provided
for the producer captures the modication
operations “insert” and “commit” for JDBC
statements. The user at the source side can
decide whether committed data are to become
available for the warehouse (in this case, this
is referred to as materialization by the authors).
On insert, the new source values are cached in
20 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
a source buffer and wait to be propagated to the
next tier, which is called catalyst. Apart from
that, the source middleware provides callbacks
for regulating its policy as steal or no steal with
respect to the ushing of committed data to the
hard disk at the source side, as well as a third
option for regulating its policy on the basis of
its current load or elapsed idle time. The cata-
lyst side, at the same time, also buffers data for
the consumer. An interesting property of the
catalyst is the guarantee of freshness for the
user. So, whenever a warehouse user requests
data, he can set a time interval for the data
he must have; then the catalyst check its own
buffers and communicates with the source-side
middleware to compile the necessary data. A
precise synchronization of the data that must
be shed from main memory via appropriate
main memory indexing is also discussed by the
authors. The third tier, at the warehouse side,
operates in a REPEATABLE READ isolation
mode and appropriately locks records (via a
shared lock) so that the catalyst does not shed
them during their loading to the warehouse.
Finally, the authors discuss a simple program-
matic facility via appropriate view denitions
to allow the warehouse user retrieve the freshest
data possible in a transparent way.
Polyzotis, Skiadopoulos, Vassiliadis,
Simitsis & Frantzell (2007, 2008) deal with
an individual operator of the near real time
ETL process, namely the join of a continuous
stream of updates generated at the sources with
a large, disk resident relation at the warehouse
side, under the assumption of limited memory.
Such a join can be used in several occasions,
such as surrogate key assignment, duplicate
detection, simple data transformations etc. To
this end, a specialized join algorithm, called
MeshJoin is introduced. The main idea is that
the relation is continuously brought to main
memory in scans of sequential blocks that
are joined with the buffered stream tuples. A
precise expiration mechanism for the stream
tuples guarantees the correctness of the result.
The authors propose an analytic cost model that
relates the stream rate and the memory budget.
This way, the administrator can tune the opera-
tion of the algorithm, either in order to maximize
throughput for a given memory budget, or in
order to minimize the necessary memory for a
given stream rate. In the case of thrashing, an
approximate version of the algorithm with load
shedding strategies that minimize the loss of
output tuples is discussed. MeshJoin makes no
assumption on the order, indexing, join condi-
tion and join relationship of the joined stream
and relation; at the same time it relates the
stream rate with the available memory budget
and gives correctness guarantees for an exact
result if this is possible, or, allows a lightweight
approximate result otherwise.
SYSTEMS
Industrial systems for ETL are provided both by
the major database vendors and the individual
ETL-targeted vendors. Popular tools include
Oracle Warehouse Builder (2008), IBM Datast-
age (2008), Microsoft Integration Services
(2008) and Informatica PowerCenter (2008).
There is an excellent survey by Barateiro &
Galhardas (2005) that makes a thorough discus-
sion and feature comparison for ETL tools both
of academic and industrial origin. Friedman,
Beyer & Bitterer (2007) as well as Friedman &
Bitterer (2007) give two interesting surveys of
the area from a marketing perspective. In this
section, we will discuss only academic efforts
related to ETL systems.
AJAX
The AJAX system (Galhardas, Florescu, Sha-
sha & Simon, 2000) is a data cleaning tool
developed at INRIA France that deals with
typical data quality problems, such as duplicate
identication, errors due to mistyping and data
inconsistencies between matching records. This
tool can be used either for a single source or
for integrating multiple data sources. AJAX
provides a framework wherein the logic of a
data cleaning program is modeled as a directed
graph of data transformations that start from
some input source data. Four types of data
transformations are supported:
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 21
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
Mapping transformations standardize data
formats (e.g., date format) or simply merge
or split columns in order to produce more
suitable formats.
Matching transformations nd pairs of
records that most probably refer to same
object. These pairs are called matching
pairs and each such pair is assigned a
similarity value.
Clustering transformations group together
matching pairs with a high similarity value
by applying a given grouping criteria (e.g.,
by transitive closure).
Merging transformations are applied to
each individual cluster in order to eliminate
duplicates or produce new records for the
resulting integrated data source.
AJAX also provides a declarative language
for specifying data cleaning programs, which
consists of SQL statements enriched with a
set of specic primitives to express mapping,
matching, clustering and merging transforma-
tions. Finally, a interactive environment is
supplied to the user in order to resolve errors
and inconsistencies that cannot be automatically
handled and support a stepwise renement de-
sign of data cleaning programs. The linguistic
aspects and the theoretic foundations of this tool
can be found in Galhardas, Florescu, Shasha,
Simon & Saita (2001), and Galhardas, Florescu,
Shasha & Simon (1999), where apart from the
presentation of a general framework for the
data cleaning process, specic optimization
techniques tailored for data cleaning applica-
tions are discussed.
ARKTOS
Arktos (Vassiliadis et al., Information Systems
2005) is an ETL tool that has prototypically
been implemented with the goal of facilitating
the design, the (re-)use, and the optimization
of ETL workows. Arktos is based on the
metamodel of Vassiliadis et al., (Information
Systems 2005) for representing ETL activities
and ETL workows and its two key features
are (a) the extensibility mechanisms for reus-
ing transformations and (b) the close linkage
to formal semantics and their representation.
Arktos is accompanied by an ETL library that
contains template code of built-in functions
and maintains template code of user-dened
functions. Template activities are registered
in the system and they can be reused for the
specication of a scenario (either graphically,
or via forms and declarative languages). The
customization process results in producing an
ETL scenario which is a DAG of ETL activi-
ties, each specied as a parameterized software
module, having instantiated input and output
schemata, concrete parameters, and a special-
purpose schema for problematic records. The set
of templates is extensible to allow users register
their own frequently used transformations.
Arktos also offers zoom-in/zoom-out ca-
pabilities. The designer can deal with a scenario
in two levels of granularity: (a) at the entity or
zoom-out level, where only the participating
recordsets and activities are visible and their
provider relationships are abstracted as edges
between the respective entities, or (b) at the
attribute or zoom-in level, where the user can
see and manipulate the constituent parts of an
activity, along with their respective providers
at the attribute level.
Finally, it is noteworthy that the model of
Vassiliadis et al., (Information Systems 2005)
comes with a mechanism for expressing the
semantics of activities in LDL. The expression
of semantics can be done both at the template
and the instance level and a macro language is
discussed in the paper for this purpose. The ap-
proach is based on the fundamental observation
that the LDL description can be mapped to a
useful graph representation of the internals of
an activity, which allows both the visualization
and the measurement of interesting properties
of the graph.
HumMer - Fusion
HumMer (Naumann, Bilke, Bleiholder & Weis,
2006) is a tool developed in the Hasso-Plattner
Institute in Potsdam that deals with the problem
of data fusion. Data fusion is the task of identify-
22 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
ing and resolving different representations of the
same object of the real world. HumMer splits the
process in three steps: (a) schema alignment, (b)
duplicate detection and (c) the core of the data
fusion process. Schema alignment deals with the
problem of schema inconsistencies. Assuming
the user is in possession of different data sets
representing the same entities of the real world,
HumMer is equipped with a dedicated module
that detects a small sample of duplicates in these
data sets and tries to nd schema similarities.
Duplicate detection is performed via compar-
ing tuples over a similarity threshold. Once
inconsistencies at the schema and tuple level
have been resolved, it is time for the resolution
of value inconsistencies, which is referred to as
data fusion by Naumann et al (2006). FuSem
(Bleiholder, Draba & Naumann, 2007) is an
extension of HumMer with the purpose of
interactively facilitating the data fusion pro-
cess. Assuming that the user has different data
sets representing the same real world entities,
FuSem allows the SQL querying of these data
sources, their combination through outer join
and union operations and the application of dif-
ferent tuple matching operators. The extension
of SQL with the FYSE BY operator, proposed
by Bleiholder & Naumann (2005) is also part
of FuSem. FYSE BY allows both the alignment
of different relations in terms of schemata and
the identication of tuples representing the
same real world object. An important feature
of FuSem is the visualization of results, which
is primarily based on different representations
of Venn diagrams for the involved records and
the interactive exploration of areas where two
data sets overlap or differ.
Potter’s Wheel
Raman & Hellerstein (2000, 2001) present the
Potter’s Wheel system, which is targeted to
provide interactive data cleaning to its users.
The system offers the possibility of performing
several algebraic operations over an underlying
data set, including format (application of a func-
tion), drop, copy, add a column, merge delimited
columns, split a column on the basis of a regular
expression or a position in a string, divide a col-
umn on the basis of a predicate (resulting in two
columns, the rst involving the rows satisfying
the condition of the predicate and the second
involving the rest), selection of rows on the basis
of a condition, folding columns (where a set of
attributes of a record is split into several rows)
and unfolding. Optimization algorithms are also
provided for the CPU usage for certain classes
of operators. The general idea behind Potter’s
Wheel is that users build data transformations
in iterative and interactive way. Specically,
users gradually build transformations to clean
the data by adding or undoing transformations
on a spreadsheet-like interface; the effect of
a transformation is shown at once on records
visible on screen. These transformations are
specied either through simple graphical op-
erations, or by showing the desired effects on
example data values. In the background, Potter’s
Wheel automatically infers structures for data
values in terms of user-dened domains, and
accordingly checks for constraint violations.
Thus, users can gradually build a transformation
as discrepancies are found, and clean the data
without writing complex programs or enduring
long delays.
CONCLUDING REMARKS
This survey has presented the research work in
the eld of Extraction-Transformation-Loading
(ETL) processes and tools. The main research
goals around which research has been organized
so far can be summarized as follows.
a. The rst goal concerns the construction of
commonly accepted conceptual and logical
modeling tools for ETL processes, with a
view to a standardized approach.
b. The second goal concerns the efciency of
individual ETL operators. To structure the
discussion better, we have organized the
discussion around the three main parts of
the E-T-L triplet, and examined problems
that fall within each of these stages. So far,
research has come up with interesting solu-
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 23
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
tions for the detection of differences in two
snapshots of data, specic data transforma-
tions, duplicate detection mechanisms and
the practical and theoretical investigation
of the lineage of warehouse data.
c. A third research goal has been to devise
algorithms for the efcient operation of the
entire ETL process. Specically, research
has some rst results for the optimization
and resumption of entire ETL processes as
well as some rst investigations towards
(near) real time ETL.
d. A fourth goal of the academic world has
been the construction of tools for the fa-
cilitation of the ETL process.
Apparently, the eld is quite new, and there
is still too much to be done. In the sequel, we
present a personal viewpoint on the possible
advancements that can be made in the eld.
Starting with the traditional ETL setting,
there are quite a few research opportunities.
Individual operators are still far from being
closed as research problems. The extraction
of data still remains a hard problem, mostly
due to the closed nature of the sources. The
loading problem is still quite unexplored with
respect to its practical aspects. The optimiza-
tion and resumption problems -along with the
appropriate cost models- are far from maturity.
The absence of a benchmark is hindering future
research (that lacks a commonly agreed way
to perform experiments). The introduction
of design patterns for warehouse schemata
that take ETL into consideration is also open
research ground.
At the same time, the presence of new
heavily parallelized processors and the near
certainty of disrupting effects in hardware and
disk technology in the immediate future put
all the database issues of efciency back on
the table. ETL cannot escape the rule and both
individual transformations as well as the whole
process can be reconsidered in the presence of
these improvements. Most importantly, all these
research opportunities should be viewed via the
looking glass of near real time ETL, with the
need for completeness and freshness of data to
be pressing from the part of the users.
As an overall conclusion, we believe that
design, algorithmic and theoretical results in the
eld of ETL processes are open to exploration
both due to the present problems and on the
basis of the changing environment of computer
science, databases and user needs.
REFERENCES
Barateiro, J., & Galhardas, H. (2005). A Survey
of Data Quality Tools. Datenbank-Spektrum 14,
15-21
Bleiholder, J., & Naumann, F. (2005). Declarative
Data Fusion - Syntax, Semantics, and Implementa-
tion. 9th East European Conference on Advances
in Databases and Information Systems (ADBIS
2005), pp.: 58-73, Tallinn, Estonia, September 12-
15, 2005.
Bleiholder, J., Draba, K., & Naumann, F. (2007).
FuSem - Exploring Different Semantics of Data
Fusion. Proceedings of the 33rd International Confer-
ence on Very Large Data Bases (VLDB 2007), pp.:
1350-1353, University of Vienna, Austria, September
23-27, 2007.
Carreira P., Galhardas, H., Pereira, J., Martins,
F., & Silva, M. (2007). On the performance of
one-to-many data transformations. Proceedings
of the Fifth International Workshop on Quality in
Databases (QDB 2007), pp.: 39-48, in conjunction
with the VLDB 2007 conference, Vienna, Austria,
September 23, 2007
Carreira, P., Galhardas, H., Lopes A., & Pereira J.
(2007). One-to-many data transformations through
data mappers. Data Knowledge Engineering, 62,
3, 483-503
Carreira, P., Galhardas, H., Pereira, J., & Lopes,
A. (2005). Data Mapper: An Operator for Ex-
pressing One-to-Many Data Transformations. 7th
International Conference on Data Warehousing and
Knowledge Discovery (DaWaK 2005), pp.: 136-145,
Copenhagen, Denmark, August 22-26, 2005
Carreira, P., Galhardas, H., Pereira, J., & Wichert,
A. (2007). One-to-many data transformation opera-
tions - optimization and execution on an RDBMS.
Proceedings of the Ninth International Conference
24 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
on Enterprise Information Systems, Volume DISI,
ICEIS (1) 2007, pp.: 21-27, Funchal, Madeira, Por-
tugal, June 12-16, 2007
Colby, L., Kawaguchi, A., Lieuwen, D., Mumick, I.,
& Ross, K. (1997). Supporting Multiple View Main-
tenance Policies. In Proceedings ACM SIGMOD
International Conference on Management of Data
(SIGMOD 1997), pp.: 405-416, May 13-15, 1997,
Tucson, Arizona, USA
Cui, Y., & Widom, J. (2001). Lineage Tracing for
General Data Warehouse Transformations. Proceed-
ings of 27th International Conference on Very Large
Data Bases (VLDB 2001), pp.: 471-480, September
11-14, 2001, Roma, Italy
Cui, Y., & Widom, J. (2003). Lineage tracing for
general data warehouse transformations. VLDB
Journal 12, 1, 41-58
Cunningham, C., Galindo-Legaria, C., & Graefe, G.
(2004). PIVOT and UNPIVOT: Optimization and
Execution Strategies in an RDBMS. Proceedings of
the Thirtieth International Conference on Very Large
Data Bases (VLDB 2004), pp. 998-1009, Toronto,
Canada, August 31 - September 3 2004.
Darmont, J., Bentayeb, F., & Boussaïd, O. (2005).
DWEB: A Data Warehouse Engineering Benchmark.
Proceedings 7th International Conference Data
Warehousing and Knowledge Discovery (DaWaK
2005), pp. 85–94, Copenhagen, Denmark, August
22-26 2005
Davidson, S., & Kosky, A. (1999). Specifying
Database Transformations in WOL. Bulletin of the
Technical Committee on Data Engineering, 22, 1,
25-30.
Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007).
Duplicate Record Detection: A Survey. IEEE Trans-
actions on Knowledge and Data Engineering, 19,
1, 1-16
Fagin, R. (2007). Inverting schema mappings.
ACM Transactions on Database Systems, 32, 4,
25:1-25:53
Fagin, R., Kolaitis, P., & Popa, L. (2005). Data
exchange: getting to the core. ACM Transactions
on Database Systems, 30, 1, 174-210
Fagin, R., Kolaitis, P., Miller, R., & Popa, L. (2005).
Data exchange: semantics and query answering.
Theoretical Computer Science, 336, 1, 89-124
Fenk, R., Kawakami, A., Markl, V., Bayer, R., &
Osaki, S. (2000) Bulk Loading a Data Warehouse
Built Upon a UB-Tree. Proceedings 2000 Inter-
national Database Engineering and Applications
Symposium (IDEAS 2000),pp.: 179-187, September
18-20, 2000, Yokohoma, Japan
Friedman, T., & Bitterer, A. (2007). Magic Quad-
rant for Data Quality Tools, 2007. Gartner RAS
Core Research Note G00149359, 29 June 2007.
Available at http://mediaproducts.gartner.com/re-
prints/businessobjects/149359.html (last accessed
23 July 2008)
Friedman, T., Beyer, M., & Bitterer, A. (2007). Magic
Quadrant for Data Integration Tools, 2007. Gartner
RAS Core Research Note G00151150, 5 October
2007. Available at http://mediaproducts.gartner.
com/reprints/oracle/151150.html (last accessed 23
July 2008)
Galhardas, H., Florescu, D., Shasha, D., & Simon, E.
(1999). An Extensible Framework for Data Cleaning.
Technical Report INRIA 1999 (RR-3742).
Galhardas, H., Florescu, D., Shasha, D., & Simon, E.
(2000). Ajax: An Extensible Data Cleaning Tool. In
Proc. ACM SIGMOD International Conference on
the Management of Data, pp. 590, Dallas, Texas.
Galhardas, H., Florescu, D., Shasha, D., Simon,
E. & Saita, C. (2001). Declarative Data Cleaning:
Language, Model, and Algorithms. Proceedings of
27th International Conference on Very Large Data
Bases (VLDB 2001), pp.: 371-380, September 11-14,
2001, Roma, Italy
Gupta, A., & Mumick, I. (1995). Maintenance of
Materialized Views: Problems, Techniques, and
Applications. IEEE Data Engineering Bulletin,
18, 2, 3-18
Gupta, H., & Mumick, I. (2006). Incremental main-
tenance of aggregate and outerjoin expressions.
Information Systems, 31, 6, 435-464
IBM. WebSphere DataStage. Product’s web page
at http://www-306.ibm.com/software/data/integra-
tion/datastage/Last accessed 21 July 2008.
Informatica. PowerCenter. Product’s web page at
http://www.informatica.com/products_services/
powercenter/Pages/index.aspxLast accessed 21
July 2008.
Karakasidis, A., Vassiliadis, P., & Pitoura, E. (2005).
ETL Queues for Active Data Warehousing. In Proc.
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 25
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
2nd International Workshop on Information Qual-
ity in Information Systems (IQIS 2005), co-located
with ACM SIGMOD/PODS 2005, June 17, 2005,
Baltimore, MD, USA.
Kimbal, R., Reeves, L., Ross, M., & Thornthwaite, W.
(1998). The Data Warehouse Lifecycle Toolkit: Expert
Methods for Designing, Developing, and Deploying
Data Warehouses. John Wiley & Sons.
Kimball, R., & Caserta, J. (2004). The Data Ware-
house ETL Toolkit. Wiley Publishing, Inc.
Kotidis, Y. (2002). Aggregate View Management
in Data Warehouses. In Handbook of Massive Data
Sets (pages 711-742). Kluwer Academic Publishers,
ISBN 1-4020-0489-3
Koudas, N., & Srivastava, D. (2005). Approximate
Joins: Concepts and Techniques. Tutorial at the
31st International Conference on Very Large Data
Bases (VLDB 2005), pp.: 1363, Trondheim, Norway,
August 30 - September 2, 2005. Slides available at
http://queens.db.toronto.edu/~koudas/docs/AJ.pdf
(last accessed 23 July 2008)
Koudas, N., Sarawagi, S., & Srivastava, D. (2006).
Record linkage: similarity measures and algorithms.
Tutorial at the ACM SIGMOD International Con-
ference on Management of Data (SIGMOD 2006),
pp.:802-803, Chicago, Illinois, USA, June 27-29,
2006. Slides available at http://queens.db.toronto.
edu/~koudas/docs/aj.pdf (last accessed 23 July
2008)
Labio, W., & Garcia-Molina, H. (1996). Efcient
Snapshot Differential Algorithms for Data Warehous-
ing. In Proceedings of 22nd International Conference
on Very Large Data Bases (VLDB 1996), pp. 63-74,
September 3-6, 1996, Mumbai (Bombay), India
Labio, W., Wiener, J., Garcia-Molina, H., & Gore-
lik, V. (2000). Efcient Resumption of Interrupted
Warehouse Loads. In Proceedings of the 2000 ACM
SIGMOD International Conference on Manage-
ment of Data (SIGMOD 2000), pp. 46-57, Dallas,
Texas, USA
Labio, W., Yerneni, R., & Garcia-Molina, H. (1999).
Shrinking the Warehouse Update Window. Proceed-
ings ACM SIGMOD International Conference on
Management of Data, (SIGMOD 1999), pp.: 383-394,
June 1-3, 1999, Philadelphia, Pennsylvania, USA
Lindsay, B., Haas, L., Mohan, C., Pirahesh, H., &
Wilms, P. (1986). A Snapshot Differential Refresh
Algorithm. Proceedings of the 1986 ACM SIGMOD
International Conference on Management of Data
(SIGMOD 1998), pp: 53-60, Washington, D.C.,
May 28-30, 1986.
Luján-Mora, S., Vassiliadis, P., & Trujillo, J. (2004).
Data Mapping Diagrams for Data Warehouse Design
with UML. In Proc. 23rd International Conference
on Conceptual Modeling (ER 2004), pp. 191-204,
Shanghai, China, 8-12 November 2004.
Luo, G., Naughton, J., Ellmann, C., & Watzke, M.
(2006). Transaction Reordering and Grouping for
Continuous Data Loading. First International Work-
shop on Business Intelligence for the Real-Time
Enterprises (BIRTE 2006), pp. 34-49. Seoul, Korea,
September 11, 2006
Microsoft. SQL Server Integration Services.
Product’s web page at http://www.microsoft.com/
sql/technologies/integration/default.mspxLast ac-
cessed 21 July 2008.
Mumick, I., Quass, D., & Mumick, B. (1997)
Maintenance of Data Cubes and Summary Tables
in a Warehouse. In Proceedings ACM SIGMOD
International Conference on Management of Data
(SIGMOD 1997), pp.: 100-111, May 13-15, 1997,
Tucson, Arizona, USA
Naumann, F., Bilke, A., Bleiholder, J., & Weis,
M. (2006). Data Fusion in Three Steps: Resolving
Schema, Tuple, and Value Inconsistencies. IEEE
Data Engineering Bulletin, 29, 2, 21-31
Oracle. Oracle Warehouse Builder. Product’s web
page at http://www.oracle.com/technology/products/
warehouse/index.html Last accessed 21 July 2008.
Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Sim-
itsis, A., & Frantzell, N. (2007). Supporting Stream-
ing Updates in an Active Data Warehouse. In Proc.
23rd International Conference on Data Engineering
(ICDE 2007), pp 476-485, Constantinople, Turkey,
April 16-20, 2007.
Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Sim-
itsis, A., & Frantzell, N. (2008). Meshing Streaming
Updates with Persistent Data in an Active Data
Warehouse. IEEE Transactions on Knowledge and
Data Engineering, 20, 7, 976-991
Raman, V., & Hellerstein, J. (2000). Potters Wheel:
An Interactive Framework for Data Cleaning and
Transformation. Technical Report University of
California at Berkeley, Computer Science Divi-
26 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
sion, 2000. Available at http://www.cs.berkeley.
edu/~rshankar/papers/pwheel.pdf
Raman, V., & Hellerstein, J. (2001). Potter’s Wheel:
An Interactive Data Cleaning System. In Proceedings
of 27th International Conference on Very Large Data
Bases (VLDB 2001), pp. 381-390, Roma, Italy.
Roth M., & Schwarz P. (1997). Don’t Scrap It, Wrap
It! A Wrapper Architecture for Legacy Data Sources.
In Proceedings of 23rd International Conference on
Very Large Data Bases (VLDB 1997), pp. 266-275,
August 25-29, 1997, Athens, Greece
Roussopoulos, N. (1998). Materialized Views and
Data Warehouses. SIGMOD Record, 27, 1, 21-26
Roussopoulos, N., Kotidis, Y., & Roussopoulos, M.
(1997). Cubetree: Organization of and Bulk Updates
on the Data Cube. Proceedings ACM SIGMOD
International Conferenceon Management of Data
(SIGMOD 1997), pp. 89-99, May 13-15, 1997,
Tucson, Arizona, USA
Roussopoulos, N., Kotidis, Y., & Sismanis, Y. (1999).
The Active MultiSync Controller of the Cubetree
Storage Organization. Proceedings ACM SIGMOD
International Conference on Management of Data
(SIGMOD 1999), pp.: 582-583, June 1-3, 1999,
Philadelphia, Pennsylvania, USA
Shu, N., Housel, B., Taylor, R., Ghosh, S., & Lum, V.
(1977). EXPRESS: A Data EXtraction, Processing,
and REStructuring System. ACM Transactions on
Database Systems, 2, 2, 134-174.
Simitsis, A. (2005). Mapping conceptual to logical
models for ETL processes. In Proceedings of ACM
8th International Workshop on Data Warehousing
and OLAP (DOLAP 2005), pp.: 67-76 Bremen,
Germany, November 4-5, 2005
Simitsis, A., & Vassiliadis, P. (2008). A Method
for the Mapping of Conceptual Designs to Logical
Blueprints for ETL Processes. Decision Support
Systems, 45, 1, 22-40.
Simitsis, A., Vassiliadis, P., & Sellis, T. (2005). Opti-
mizing ETL Processes in Data Warehouses. Proceed-
ings 21st Int. Conference on Data Engineering (ICDE
2005), pp. 564-575, Tokyo, Japan, April 2005.
Simitsis, A., Vassiliadis, P., & Sellis, T. (2005).
State-Space Optimization of ETL Workows. IEEE
Transactions on Knowledge and Data Engineering,
17, 10, 1404-1419
Simitsis, A., Vassiliadis, P., Terrovitis, M., &
Skiadopoulos, S. (2005). Graph-Based Modeling
of ETL activities with Multi-level Transformations
and Updates. In Proc. 7th International Conference
on Data Warehousing and Knowledge Discovery
2005 (DaWaK 2005), pp. 43-52, 22-26 August 2005,
Copenhagen, Denmark.
Sismanis, Y., Deligiannakis, A., Roussopoulos,
N., & Kotidis, Y. (2002). Dwarf: shrinking the
PetaCube. roceedings of the 2002 ACM SIGMOD
International Conference on Management of Data
(SIGMOD 2002), 464-475, Madison, Wisconsin,
June 3-6, 2002
Skoutas, D., & Simitsis, A., (2006). Designing ETL
processes using semantic web technologies. In Pro-
ceedings ACM 9th International Workshop on Data
Warehousing and OLAP (DOLAP 2006), pp.:67-74,
Arlington, Virginia, USA, November 10, 2006
Skoutas, D., & Simitsis, A., (2007). Flexible and
Customizable NL Representation of Requirements
for ETL processes. In Proceedings 12th International
Conference on Applications of Natural Language to
Information Systems (NLDB 2007), pp.: 433-439,
Paris, France, June 27-29, 2007
Skoutas, D., & Simitsis, A., (2007). Ontology-Based
Conceptual Design of ETL Processes for Both
Structured and Semi-Structured Data. Int. Journal
of Semantic Web Information Systems (IJSWIS) 3,
4, 1-24
Stanoi, I., Agrawal, D., & El Abbadi, A. (1999).
Modeling and Maintaining Multi-View Data Ware-
houses. Proceedings 18th International Conference
on Conceptual Modeling (ER 1999), pp.: 161-175,
Paris, France, November, 15-18, 1999
Stöhr, T., Müller, R., & Rahm, E. (1999). An integra-
tive and Uniform Model for Metadata Management
in Data Warehousing Environments. In Proc. Intl.
Workshop on Design and Management of Data
Warehouses (DMDW 1999), pp. 12.1 – 12.16, Hei-
delberg, Germany, (1999).
Thiele, M., Fischer, U., & Lehner, W. (2007).
Partition-based workload scheduling in living
data warehouse environments. In Proc. ACM 10th
International Workshop on Data Warehousing and
OLAP (DOLAP 2007), pp. 57-64, Lisbon, Portugal,
November 9, 2007.
Thomsen, C., Pedersen, T., & Lehner, W. (2008).
RiTE: Providing On-Demand Data for Right-Time
International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 27
Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
Data Warehousing. Proceedings of the 24th Interna-
tional Conference on Data Engineering (ICDE 2008),
pp 456 – 465, April 7-12, 2008, Cancun, Mexico.
TPC. The TPC-DS benchmark. Transaction Process-
ing Council. Available at http://www.tpc.org/tpcds/
default.asp (last accessed at 24 July 2008)
TPC. The TPC-H benchmark. Transaction Process-
ing Council. Available at http://www.tpc.org/ (last
accessed at 24 July 2008)
Trujillo, J., & Luján-Mora, S. (2003). A UML Based
Approach for Modeling ETL Processes in Data
Warehouses. In Proceedings of 22nd International
Conference on Conceptual Modeling (ER 2003), pp.
307-320, Chicago, IL, USA, October 13-16, 2003.
Tziovara, V., Vassiliadis, P., & Simitsis, A. (2007).
Deciding the Physical Implementation of ETL Work-
ows. Proceedings ACM 10th International Workshop
on Data Warehousing and OLAP (DOLAP 2007), pp.
49-56, Lisbon, Portugal, 9 November 2007.
Vassiliadis, P., & Simitsis, A. (2009). Extraction-
Transformation-Loading. In Encyclopedia of Da-
tabase Systems, L. Liu, T.M. Özsu (eds), Springer,
2009.
Vassiliadis, P., Karagiannis, A., Tziovara, V., &
Simitsis, A. (2007). Towards a Benchmark for ETL
Workows. 5th International Workshop on Quality
in Databases (QDB 2007), held in conjunction with
VLDB 2007, Vienna, Austria, 23 September 2007.
Vassiliadis, P., Quix, C., Vassiliou, Y., & Jarke,
M. (2001). Data Warehouse Process Management.
Information Systems, 26, 3, pp. 205-236
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S.
(2002). Modeling ETL activities as graphs. In Proc.
4th International Workshop on the Design and Man-
agement of Data Warehouses (DMDW 2002), held in
conjunction with the 14th Conference on Advanced
Information Systems Engineering (CAiSE’02), pp.
52-61, Toronto, Canada, May 27, 2002.
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S.
(2002). Conceptual Modeling for ETL Processes.
In Proc. ACM 5th International Workshop on Data
Warehousing and OLAP (DOLAP 2002), McLean,
VA, USA November 8, 2002.
Vassiliadis, P., Simitsis, A., Georgantas, P., & Ter-
rovitis, M. (2003). A Framework for the Design of
ETL Scenarios. In Proc. 15th Conference on Advanced
Information Systems Engineering (CAiSE 2003),
pp. 520- 535, Klagenfurt/Velden, Austria, 16 - 20
June, 2003.
Vassiliadis, P., Simitsis, A., Georgantas, P., Ter-
rovitis, M., & Skiadopoulos, S. (2005). A generic
and customizable framework for the design of ETL
scenarios. Information Systems, 30, 7, 492-525.
Vassiliadis, P., Simitsis, A., Terrovitis, M., & Skia-
dopoulos, S. (2005). Blueprints for ETL workows.
In Proc. 24th International Conference on Con-
ceptual Modeling (ER 2005), pp. 385-400, 24-28
October 2005, Klagenfurt, Austria. Long version
available at
http://www.cs.uoi.gr/~pvassil/publications/2005_
ER_AG/ETL_blueprints_long.pdf (last accessed at
25 July 2008)
Panos Vassiliadis received the PhD degree from the National Technical University of Athens in 2000. Since
2002, he has been with the Department of Computer Science, University of Ioannina, Greece, where he is
also a member of the Distributed Management of Data (DMOD) Laboratory (http://www.dmod.cs.uoi.gr).
His research activity and published work concerns the area of data warehousing, with particular emphasis
on metadata, OLAP, and ETL, as well as the areas of database evolution and web services. He is a member
of the ACM, the IEEE, and the IEEE Computer Society.
... Meanwhile, we have witnessed another trend that the line between classic OLTP and OLAP databases started to blur: there is a growing need for a database to provide sufficient support for both transactional processing and analytical processing, especially in the fields of business intelligence [51], social media [5,38], fraud detection [7], and marketing [21,56]. To provide such capability, traditional solutions often deploy data and application logic into two databases, one specialized in OLTP and the other in OLAP (e.g., MySQL [40] for OLTP, and ClickHouse [16] for OLAP), and rely on data synchronization techniques (such as Extract-Transform-Load [50] (ETL) workflow) for ensuring consistencies between them, as shown in Figure 1. According to our statistics, nearly 30% of the customers of PolarDB, an OLTP database, synchronize data to an independent data warehouse system for data analytics needs. ...
... ClickHouse [16]) use column-based data formats, batch-at-a-time operators, and late materialization strategy, favoring scan-intensive analytical queries. As a result, modern database administrators often need to deploy both OLTP and OLAP databases, and conduct data shipping between two types of databases (e.g., ETL [50]). The emergence of HTAP databases eliminates the burden of maintaining multiple databases and simplifies data shipping. ...
Preprint
Cloud-native databases have become the de-facto choice for mission-critical applications on the cloud due to the need for high availability, resource elasticity, and cost efficiency. Meanwhile, driven by the increasing connectivity between data generation and analysis, users prefer a single database to efficiently process both OLTP and OLAP workloads, which enhances data freshness and reduces the complexity of data synchronization and the overall business cost. In this paper, we summarize five crucial design goals for a cloud-native HTAP database based on our experience and customers' feedback, i.e., transparency, competitive OLAP performance, minimal perturbation on OLTP workloads, high data freshness, and excellent resource elasticity. As our solution to realize these goals, we present PolarDB-IMCI, a cloud-native HTAP database system designed and deployed at Alibaba Cloud. Our evaluation results show that PolarDB-IMCI is able to handle HTAP efficiently on both experimental and production workloads; notably, it speeds up analytical queries up to $\times149$ on TPC-H (100 $GB$). PolarDB-IMCI introduces low visibility delay and little performance perturbation on OLTP workloads (< 5%), and resource elasticity can be achieved by scaling out in tens of seconds.
... This is based on the concept of ETL, and it is one of the essential components of the data warehouse. The major consumes of the ETL process in terms of time complexity and storage cost of the data warehouse [9]. The data sources in the data warehouse are ERP systems, online transaction processing systems, and Customer Relationship Management systems (CRM). ...
... As mentioned by the authors, maintaining minimal execution time is crucial. Therefore, controls over timelines are essential for optimization in [6]. Implementing controls for comparing data sources with final data warehouse tables will also prove useful. ...
Article
The paper explores the significance of internal audit controls at each phase in the ETL process. The subsequent sections delve into internal controls during the extract, transform, and load phases, addressing key aspects such as data accuracy, completeness, consistency, error handling, logging, and metadata management. The paper concludes by stressing the need for controls during the load phase to ensure optimal loading procedures and data integrity
... This is based on the concept of ETL, and it is one of the essential components of the data warehouse. The major consumes of the ETL process in terms of time complexity and storage cost of the data warehouse [9]. The data sources in the data warehouse are ERP systems, online transaction processing systems, and Customer Relationship Management systems (CRM). ...
Article
Full-text available
In big data, analysis data is collected from different sources in various formats, transforming into the aspect of cleansing the data, customization, and loading it into a Data Warehouse. Extracting data in other formats and transforming it to the required format requires transformation algorithms. This transformation stage has redundancy issues and is stored across any location in the data warehouse, which increases computation costs. The main issues in big data ETL are handling high-dimensional data and maintaining similar data for effective data warehouse usage. Therefore, Extract, Transform, Load (ETL) plays a vital role in extracting meaningful information from the data warehouse and trying to retain the users. This paper proposes hybrid optimization of Swarm Intelligence with a tabu search algorithm for handling big data in a cloud-based architecture-based ETL process. This proposed work overcomes many issues related to complex data storage and retrieval in the data warehouse. Swarm Intelligence algorithms can overcome problems like high dimensional data, dynamical change of huge data and cost optimization in the transformation stage. In this work for the swarm intelligence algorithm, a Grey-Wolf Optimizer (GWO) is implemented to reduce the high dimensionality of data. Tabu Search (TS) is used for clustering the relevant data as a group. Clustering means the segregation of relevant data accurately from the data warehouse. The cluster size in the ETL process can be optimized by the proposed work of (GWO-TS). Therefore, the huge data in the warehouse can be processed within an expected latency.
... This requires good information management capabilities to integrate, extract, transform, and access transactional data (Davenport & Harris, 2007). Whether we are speaking about the Knowledge Discovery in Databases (KDD) process in datamining (Fayyad et al., 1996) (Figure 3) or of the extract, transform and load (ETL) process in Business Intelligence & Analytics (Vassiliadis, 2009), the base of qualitative analytics is materialised by the analytical database on which visualisations, reports, Machine Learning Training, etc. for the data consumers can be built. Figure 3 Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al. 1996). ...
Conference Paper
Higher Education Institutes (HEIs) are undergoing a profound and widespread digital transformation with the introduction of digital technologies in education. The introduction of digitised educational environments produced huge data repositories that could serve learning analytics, for the purposes of understanding and optimising learning and the environments in which it occurs. Where the digital transformation of education made learning analytics better and more qualitative, learning analytics leverages the digital transformation of HEIs by taking better informed actions.This paper elaborates on the process of developing a conceptual learning analytics platform at KU Leuven (Belgium). Where the project started off as a pure data project, gradually several preconditions emerged, which caused a lot of side-tracking. The introduction of learning analytics needs a digital transformation of education. At the same time the digital transformation of education needs qualitative and quantitative learning analytics to take it to the next level. As if Learning Analytics and digital transformation live in symbiosis with each other.
... W jednym z wywiadów Justin Marsico, Chief Data Officer of the Bureau of the Fiscal Service 58 stwierdził, że wstępne wyniki badania wskazują, iż AI pozwala zaoszczędzić czas, zoptymalizować zasoby pracy, ułatwić proces wizualizacji danych, a także wykrywać oszustwa związane z zarządzaniem systemem treasury warrants. Dodał, że następne działania będą związane z ułatwieniem wyszukiwania danych w witrynach publicznych, integracją danych w chmurze w celu zwiększenia zdolności do automatyzacji procesu ETL (Extract-Transform-Load) 59 , który stanowi najistotniejszy element wdrożenia Business Intelligence 60 . ...
Article
Full-text available
In this article, the authors assesses the scope of the use of artificial intelligence technology in the management of the US federal public finance system in the years 2019–2022, with particular emphasis on public finance control. This assessment is indispensable in answering the question of whether artificial intelligence, which is a relatively new technology, is used in the area in question, and to what extent, and most importantly whether, it enables the US federal administration to control public finances. In order to answer this question, a holistic review was carried out of American law standards from the last decade and government documents that directly regulate the issue of artificial intelligence, including the definition of artificial intelligence adopted for the first time in American law, but also a number of other hard law and soft law standards. Law files that are directly or indirectly related to AI. The above allowed the determination of the scope of the political and legal approach to artificial intelligence in the United States of America, as well as the extraction of many conclusions about the use of artificial intelligence and the challenges facing its development. In the opinion of the author, the added value of this publication is the preliminary answer to the question of whether artificial intelligence can change the paradigm of the essence of managing the public finance system.
Article
Objective: This article proposes a framework to support the scientific research of standards so that they can be better measured, evaluated, and designed. Methods: Beginning with the notion of common models, the framework describes the general standard problem-the seeming impossibility of creating a singular, persistent, and definitive standard which is not subject to change over time in an open system. Results: The standard problem arises from uncertainty driven by variations in operating context, standard quality, differences in implementation, and drift over time. As a result, fitting work using conformance services is needed to repair these gaps between a standard and what is required for real-world use. To guide standards design and repair, a framework for measuring performance in context is suggested, based on signal detection theory and technomarkers. Based on the type of common model in operation, different conformance strategies are identified: (1) Universal conformance (all agents access the same standard); (2) Mediated conformance (an interoperability layer supports heterogeneous agents); and (3) Localized conformance (autonomous adaptive agents manage their own needs). Conformance methods include incremental design, modular design, adaptors, and creating interactive and adaptive agents. Discussion: Machine learning should have a major role in adaptive fitting. Research to guide the choice and design of conformance services may focus on the stability and homogeneity of shared tasks, and whether common models are shared ahead of time or adjusted at task time. Conclusion: This analysis conceptually decouples interoperability and standardization. While standards facilitate interoperability, interoperability is achievable without standardization.
Article
Full-text available
The aim of this article is to investigate the possibilities for tax offices in the Czech Republic to control recurrent property tax returns and the data contained therein. The hypothesis that the recurrent property tax control in the Czech Republic is unproblematic and that there is no need for any amendments in related tax law regulation has been mostly confirmed. However, there are still several minor issues (e.g. the renewal of cadastral records) where amendments to the existing legal regulation would be helpful There are also other much more problematic issues in the property tax regulation to be improved. The role of the immovable property tax in the Czech Republic is mostly marginal; the revenue is very low. The paper follows the IMRaD structure; in the research part, we firstly summarise recurrent property tax regulation de lege lata from the theoretical perspective. Secondly, we describe existing practical problems of recurrent property tax control as these problematic issues have been identified from structured interviews with tax office clerks. In the discussion, we critically analyse substantive and procedural tax law concerning the recurrent property tax with regard to controls carried out by the tax office, combining the theoretical background with practical experience. Synthesising the knowledge gained, we identify the strengths and weaknesses of de lege lata regulation and suggest amendments de lege ferenda .
Preprint
Crafting, adhering to, and maintaining standards is an ongoing challenge. This paper uses a framework based on common models to explore the standard problem: the impossibility of creating, implementing or maintain definitive common models in an open system. The problem arises from uncertainty driven by variations in operating context, standard quality, differences in implementation, and drift over time. Fitting work by conformance services repairs these gaps between a standard and what is required for interoperation, using several strategies: (a) Universal conformance (all agents access the same standard); (b) Mediated conformance (an interoperability layer supports heterogeneous agents) and (c) Localized conformance, (autonomous adaptive agents manage their own needs). Conformance methods include incremental design, modular design, adaptors, and creating interactive and adaptive agents. Machine learning should have a major role in adaptive fitting. Choosing a conformance service depends on the stability and homogeneity of shared tasks, and whether common models are shared ahead of time or are adjusted at task time. This analysis thus decouples interoperability and standardization. While standards facilitate interoperability, interoperability is achievable without standardization.
Article
Full-text available
Data Pipeline is a series of actions which moves data from the one source to the destination, the complexity of Data Pipeline varies from use-case to use-case. The traditional data pipeline cleanups the data, aggregates the data and move it from one place to another, it sounds simple but it’s very complex as the organization deals with huge and complex data and the expectation from pipeline is that it should be robust, fast, notify about the status and it should do the same task repeatedly without failing. The modern data pipelines are slightly different in nature they are supposed to deal with Petabytes of data, they stores the data in various flavors of the cloud, should provide real-time data analysis. Apache Airflow is one such tool which simplifies the entire Data Pipeline creation to a great extent and the only prerequisite is the basic Python Knowledge. This paper focuses on the stock-exchange data pipeline creation by using the Airflow concepts such as DAGs and Operators.
Article
Full-text available
SYNONYMS ETL; ETL process; ETL tool; Data warehouse back stage; Data warehouse refreshment DEFINITION Extraction, Transformation, and Loading (ETL) processes are responsible for the operations taking place in the back stage of a data warehouse architecture. In a high level description of an ETL process, first, the data are extracted from the source data stores that can be On-Line Transaction Processing (OLTP) or legacy systems, files under any format, web pages, various kinds of documents (e.g., spreadsheets and text documents) or even data coming in a streaming fashion. Typically, only the data that are different from the previous execution of an ETL process (newly inserted, updated, and deleted information) should be extracted from the sources. After this phase, the extracted data are propagated to a special-purpose area of the warehouse, called the Data Staging Area (DSA), where their transformation, homogenization, and cleansing take place. The most frequently used transformations include filters and checks to ensure that the data propagated to the warehouse respect business rules and integrity constraints, as well as schema transformations that ensure that data fit the target data warehouse schema. Finally, the data are loaded to the central data warehouse (DW) and all its counterparts (e.g., data marts and views). In a traditional data warehouse setting, the ETL process periodically refreshes the data warehouse during idle or low-load, periods of its operation (e.g., every night) and has a specific time-window to complete. Nowadays, business necessities and demands require near real-time data warehouse refreshment and significant attention is drawn to this kind of technological advancement. HISTORICAL BACKGROUND Despite the fact that ETL obtained its separate existence during the first decade of the 21st century, ETL processes have been a companion to database technology for a lengthier period of time –in fact, from the beginning of its existence. During that period, ETL software was just silently hidden as a routine programming task without any particular name or individual importance. ETL was born on the first day that a programmer constructed a program that takes records from a certain persistent file and populates or enriches another file with this information. Since then, any kind of data processing software that reshapes or filters records, calculates new values, and populates another data store than the original one is a form of an ETL program. The earliest form of ETL system goes back to the EXPRESS system [13] that was intended to act as an engine that produces data transformations given some data definition and conversion nonprocedural statements. In later years, during the early days of data integration, the driving force behind data integration were wrapper-mediator schemes; the construction of the wrappers is a primitive form of ETL scripting [12]. In the mid '90's, data warehousing came in the central stage of database research and still, ETL was there, but hidden behind the lines. Popular books [3] do not mention the ETL triplet at all, although the different parts (transformation, cleansing, staging of intermediate data, and loading) are all covered -even if this is done very briefly at times (it is also noteworthy that the 3rd edition of the same book in 2003 mentions ETL, although briefly). At the same time, early research literature treated data warehouses as collections of materialized views since this abstraction was simple and quite convenient for the formulation of research problems.
Article
Full-text available
Materialized views and their potential have been recently rediscovered for the content of OLAP and data warehousing. A flurry of papers has been generated on how views can be used to accelerate ad-hoc computations over massive datasets. In this chapter we introduce and comment on some main-stream approaches for defining, computing, using and maintaining materialized views with aggregations in a large data warehouse.
Conference Paper
Full-text available
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the logical design of ETL scenarios. We describe a framework for the declarative specification of ETL scenarios with two main characteristics: genericity and customization. Moreover, we present a palette of several templates, representing frequently used ETL activities along with their semantics and their interconnection. Finally, we discuss implementation issues and we present a graphical tool, ARKTOS II that facilitates the design of ETL scenarios, based on our model.
Conference Paper
Full-text available
In today’s integrating information systems data fusion, i.e., the merging of multiple tuples about the same real-world object into a single tuple, is left to ETL tools and other specialized software. While much attention has been paid to architecture, query languages, and query execution, the final step of actually fusing data from multiple sources into a consistent and homogeneous set is often ignored. This paper states the formal problem of data fusion in relational databases and discusses which parts of the problem can already be solved with standard Sql. To bridge the final gap, we propose the SQL Fuse By statement and define its syntax and semantics. A first implementation of the statement in a prototypical database system shows the usefulness and feasibility of the new operator.
Article
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.
Article
EXPRESS is an experimental prototype data translation system which can access a wide variety of data and restructure it for new uses. The system is driven by two very high level nonprocedural languages: DEFINE for data description and CONVERT for data restructuring. Program generation and cooperating process techniques are used to achieve efficient operation. This paper describes the design and implementation of EXPRESS. DEFINE and CONVERT are summarized and the implementation architecture presented. The DEFINE description is compiled into a customized PL/1 program for accessing source data. The restructuring specified in CONVERT is compiled into a set of customized PL/1 procedures to derive multiple target files from multiple input files. Job steps and job control statements are generated automatically. During execution, the generated procedures run under control of a process supervisor, which coordinates buffer management and handles file allocation, deallocation, and all input/output requests. The architecture of EXPRESS allows efficiency in execution by avoiding unnecessary secondary storage references while at the same time allowing the individual procedures to be independent of each other. Its modular structure permits the system to be extended or transferred to another environment easily.
Conference Paper
The design of an Extract – Transform – Load (ETL) workflow for the population of a Data Warehouse is a complex and challenging procedure. In previous work, we have presented an ontology-based approach to facilitate the conceptual design of an ETL scenario. In this paper, we elaborate on this work, by investigating the application of Natural Language (NL) techniques to the ETL environment and we present a flexible and customizable template-based mechanism for generating natural language representations for the ETL process requirements and operations.
Article
Extraction–Transformation–Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In previous work, we presented a modeling framework for ETL processes comprised of a conceptual model that concretely deals with the early stages of a data warehouse project, and a logical model that deals with the definition of data-centric workflows. In this paper, we describe the mapping of the conceptual model to the logical model. First, we identify how conceptual entities are mapped to logical entities. Next, we determine the execution order in the logical workflow using information adapted from the conceptual model. Finally, we provide a method for the transition from the conceptual model to the logical model.
Article
Previous research has provided metadata models that enable the capturing of the static components of a data warehouse architecture, along with information on different quality factors over these components. This paper complements this work with the modeling of the dynamic parts of the data warehouse. The proposed metamodel of data warehouse operational processes is capable of modeling complex activities, their interrelationships, and the relationship of activities with data sources and execution details. Moreover, the metamodel complements the existing architecture and quality models in a coherent fashion, resulting in a full framework for quality-oriented data warehouse management, capable of supporting the design, administration and especially evolution of a data warehouse. Finally, we exploit our framework to revert the widespread belief that data warehouses can be treated as collections of materialized views. We have implemented this metamodel using the language Telos and the metadata repository system ConceptBase.