ArticlePDF Available

Efficient incremental loading in ETL processing for real-time data integration

Authors:

Abstract and Figures

ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.
Content may be subject to copyright.
Innovations in Systems and Software Engineering
https://doi.org/10.1007/s11334-019-00344-4
S.I. : CICBA 2018
Efficient incremental loading in ETL processing for real-time data
integration
Neepa Biswas1·Anamitra Sarkar1·Kartick Chandra Mondal1
Received: 12 February 2019 / Accepted: 8 May 2019
© Springer-Verlag London Ltd., part of Springer Nature 2019
Abstract
ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is
the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical
user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based
tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by
hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can
be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools
developed by the academic world have been studied in this article. Their architecture and implementation details are addressed
here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL
model is designed to meet the near real-time responsibility of the present days.
Keywords Data warehouse ·Code-based ETL ·ETL tools ·Pygrametl ·Petl ·Scriptella ·Incremental load ·Bulk load ·CDC
1 Introduction
Nowadays, data analysis has become an integral part of any
organization to achieve optimized performance. Data ware-
house (DW) [1] is a wide storage of data which is mainly
used for analytical reporting. To construct a DW, data are
generally collected from heterogeneous data sources, clean
and restructure as per the required standard and finally loaded
into the DW. This is a well-knownprocess called ETL (extract
transform load). ETL [2] is one of the important components
in DW environment. It is observed that the ETL process is
time-consuming, costly and complexity overhead of any DW
project [3].
ETL tools come as a solution provider with user-friendly
graphical user interface (GUI) to map data items between
the source and target system in a fast hassle-free manner. In
BKartick Chandra Mondal
kartickjgec@gmail.com
Neepa Biswas
biswas.neepa@gmail.com
Anamitra Sarkar
anamitra1992@gmail.com
1Department of Information Technology, Jadavpur University,
Kolkata, India
spite of developing and maintaining a custom hand-coded
ETL systems, it is easier and faster to select and use any ETL
tool. The user needs to configure the tool as per their require-
ment. Many open-source (e.g., Pentaho Kettle, Talend) [4,5]
and commercial (e.g., Informatica, SAS, ODI, IBM) ETL
tools [68] come with nice GUI which is easy to use for
non-programmers. Using this type of tool, developers design
the visual flow of data throughout the ETL process. One dis-
advantage of this kind visual approach is sometimes it is
difficult to design a specific ETL scenario with the limited
item available in the graphical tool.
Writing a few lines of code can be a better way for obtain-
ing high flexibility which can tackle complex transformations
for any projected function, because it is tricky to drag icons,
draw flow lines and set properties for a complex case design
with respect to writing own customizable ETL codes. Here,
one of the main consideration should be the productivity of
any system. Using any GUI-based tool cannot assure more
productivity compared to code-based approach. Generally,
ETL development is done by skilled technical people. So it
is justified to go for a code-based ETL option rather than GUI-
based choice. We agree that graphical program is effective
for self-documentation and standardized features. But still,
there is some aspect where code-based approach can give an
123
N. Biswas et al.
effective solution when the project as well as infrastructure
cost is low having experienced developers.
Some review article [911] over ETL tools is done so far.
However, they are typically done over commercial ETL tools
available in the market. Only high-level view is included by
those works without covering any technical details. But no
such work is noticed so far regarding the code-based ETL tool
developed by academic peoples. The focus of this paper is to
give an integrated analysis report in the research domain of
programmable ETL system. Afterward, a new solution model
is proposed to meet the near real-time ETL demand. The tar-
get is achieved by implementing incremental loading model
by the assistance of CDC (change data capture) approach.
The contribution of this paper is to highlight a new area
by programmable ETL development technique. For this pur-
pose, four prominent works on code-based ETL framework
are selected, namely Pygrametl, Petl, Scriptella and etl.
An in-depth experimental evaluation is done on each tool.
Subsequently, feature-wise and performance-wise analysis
report is provided. The continuation of work is done through
designing a new ETL-based data integration technique. The
proposed solution makes the data integration much efficiently
by incrementally populating only the changed data in the DW
at the right time.
Rest of the paper is organized in the following manner.
Section 2outlines the research development related to this
area. Each of the evaluated ETL tools is discussed with their
unique features in Sect. 3. Section 4states the experimental
detail and result following with a comparative performance
analysis report. In Sect. 5, firstly incremental loading and
CDC techniques are reviewed. Afterward, the proposed ETL
model is discussed with algorithmic details. Section 6holds
the result and discussion. Finally, Sect. 7summarizes the
overall contribution of this article with the future direction.
2 Related work
A number of commercial and open-source data integration
tools are available in the market. Every year, Gartner Inc.
publishes a market research reports [12] on these tools where
Informatica, IBM, SAS, SAP, Oracle and Microsoft are sug-
gested as leading commercial tools and Talend and Pentaho
are the open-source challengers in the market. All those tools
offer GUI-based ETL process design.
Thomsen and Pedersen [13] have done a survey on open-
source business intelligence tools. An overview of ETL tools
characteristics is discussed there without any performance
comparison.
A detailed survey is done by Vassiliadis [2] which mainly
addresses research work in each stage of the ETL process
with some academic ETL tools (Ajax, Arktos, Potter’s Wheel
and HumMer Fusion). These tools mainly offer data cleaning
or workflow designing task. Following that work, Vassiliadis
et al. addressed three commercial ETL tools (SQL SSIS,
OWB and DataStage) and made a taxonomy of distinct ETL
characteristics in article [14].
The focus of this article is programmable ETL tools.
Some survey was conducted regarding comparative study on
popular GUI-based ETL tools which are available in mar-
ket [911,15]. But, no such survey covering experimental
analysis work is found in this area from where features and
performance-based overview of code-based ETL tool can be
evaluated. For this purpose, we have selected four code-based
ETL tools that are well accepted by the academic world. Two
Python-based tools Pygrametl and Pet l, one Java-based tool
Scriptella and one R tool etl are evaluated. The architectural
and characteristic overviews of these tools are discussed in
the next section.
Regarding incremental loading in ETL, an algebraic opti-
mization of incremental DW maintenance is discussed in the
literature [16]. Zhang et al. [17] has projected automated
generation of incremental loading. A list of work has been
done by Jorg et al. targeting incremental loading process [18].
Article [19] consists of derivation of ETL job for incremental
loading. An optimized approach is discussed in the literature
[20]. An efficient real-time snapshot maintenance proposal
is given in Article [21].
In the context of CDC, some research works based on
log-based method are introduced in the literature [22,23]for
improving the quality of data. There is a Web service-based
approach [24] for modeling of efficient CDC mechanism.
A high-performance join on changed data is applied on
the literature [25]. A trigger-based approach for capturing
changed data from different data sources is mentioned in
Article [26,27]. The following section will cover some basic
overview of selected code-based ETL tools including some
comparative feature analyses.
3 ETL tools overview
The availability of less degree of customization facility for
modeling and integrating extension environment in GUI-
based ETL tools has led many organizations to go for
programmable solutions for ETL process. In this paper, some
code-based ETL tools are selected. All these tools are open-
source, and no graphical user interface is offered.
3.1 Pygrametl
Most remarkably, Pygrametl [28,29] is an open-source
Python-based ETL framework first released in 2009. This
software is licensed under BSD. Till now, continuous
upgradation is done on this tool. Pygrametl [30] suggests
performing ETL tasks by writing Python codes. It offers
123
Efficient incremental loading in ETL processing for real-time data integration
some commonly used ETL functionality to populate data
in DW. PostgreSQL, MySQL and Oracle are the supported
databases. Seamless integration of any new kind of data
source can be done using merge join, hash join and union
source functions. Both the batch and bulk load can be per-
formed as per the requirement.
3.2 Petl
Most notably, Petl [31] is a general-purpose Python package
which is able to perform conventional tasks of ETL. This
package is supported under MIT License. Petl provides sup-
port for both object-oriented and functional programming
style. Petl can handle a wide range of data sources with
structured file like CSV, Text and semi-structured file like
XML, JSON etc. PyMySQL, PostgreSQL and SQLite are
three compatible databases with this package. Petl supports
maximum transformation patterns required in any ETL pro-
cess. Besides timing, materialized view, lookup, etc., utility
function provided extra benefit to the developer. Addition of
any third-party package can be easily done within it. ETL
data flow is synchronized using ETL pipelines. However, it
does not have SCD or parallelism handling mechanism.
3.3 Scriptella
Scriptella [32,33] is another script-based ETL tool written in
Java. It is licensed under Apache Version 2.0. Plain SQL
queries are executed using JDBC bridge in this scripting
language. Non-JDBC connection can be added using mixed
SQL script. For describing various ETL task, XML script
is used. SQL or other scripting language can be used for
transformation purpose. The main application is focused on
executing script that is written in SQL, JEXL, Javascript and
velocity for the purpose of ETL operations to/from various
databases as well as file format like text, CSV, XML, LDAP,
etc. A thin wrapper created by XML script can give extra
facility to make dynamic SQL script.
3.4 R_etl
Nowadays, R is a promising language which is gaining popu-
larity in the field of data science. A newly developed package
for R [34] named etl is selected for this piece of work. It
is licensed under CC0 with version 0.3.7 and available in
CRAN [35,36]. It provides a pipeable framework to execute
core ETL operations suitable for working with medium-size
data. RPostgreSQL and RSQLite are the DBI drivers for R
which are compatible with this package. It is suitable to
handle data which can reside either in the local or in the
remote database. Database creation or management can be
done without having any expertise in SQL. Only some basic
ETL functionalities are enabled here.
Table 1 Feature comparison matrix on selected code-based ETL tools
Specifications Pygrametl Petl Scriptella R_etl
Easy usability N Y Y Y
Data centric Y Y Y Y
SOA-enabled N N N N
Reusable function Y Y N Y
Single install Y N N N
Big data handle N Y Y N
Data segregation Y Y Y N
Real-time triggers N N N N
Unstruct. data support N Y Y N
Multiple source join Y Y Y N
Complex transform N Y N N
Data validations N Y N Y
SCD support Y N N N
Parallelism support Y N N N
Bulk load Y N Y N
Data pipeline N Y N Y
Easy data map Y Y Y Y
Lookup support N N Y N
Code reusability Y Y N Y
Exception handling Y Y Y N
Proper documentation Y Y N Y
Third-party dependency N Y Y Y
Version control Y Y Y Y
Deploy in cloud N N N N
Licensed Y Y Y Y
Community forum Y Y Y Y
The comparison matrix is given in Table 1which rep-
resents a feature-wise brief overview of these tools. These
are the general characteristics which can be taken as criteria
before evaluating any ETL tool.
4 Experimental analysis
The availability of different functionality about these tools
makes it difficult to create a ranking, because all of them
has some special type of features. So, respective aspect is the
main point to choose any tool for use. General specification of
these tools is discussed in the previous section. This section
will discuss performance evaluation of each tool based on
their characteristics.
4.1 Performance analysis of code-based ETL tools
For evaluation purpose, all the selected ETL tools have been
deployed in the local machine. The hardware specifications
of the machine and software description are given below.
123
N. Biswas et al.
Hardware specifications is as follows: Processor: Intel(R)
Core(TM) i5-6500 CPU @3.20 GHz 3.20 GHz Installed
Memory(RAM): 4.00 GB System Type: 64-bit Operating
System, X64-based processor Operating System: Windows
8.1
Software specifications The above-mentioned ETL tools
have been installed in the machine with the following speci-
fications.
Pygrametl Version 2.6 has been installed with database
PostgreSQL and MySQLdb for deploying purpose. Python
3.6 version is employed in IDE Spyder.
Petl Version 1.1.1 is used. It does not have any installation
dependencies. SQLAlchemy and PostgreSQL databases
have been used. Here also, Python 3.6 is used in Spyder.
Scriptella Version 1.1 has been installed along with
HSQLDB database.
etl package version 0.3.7 and PostgreSQL is installed
along with DBI RPostgreSQL. For IDE RStudio is uti-
lized along with R version 3.4.4.
Before starting the assessment of any tools, some criteria
should be selected. We have evaluated the performance of
four code-based ETL tools on the basis of four criteria. They
are execution time, transformation support, throughput and
code length. After evaluating, the results have been graphi-
cally represented to justify the performance analysis. Here,
in the figures below, P1 represents pygrametl, P2 represents
petl, S represents Scriptella and R represents R_etl tool.
4.2 Execution timing
Quantifying the amount of time taken to execute any ETL
process is the most important case to analyze any tool. The
execution time taken for each tool is collected and plotted in
a graph, as shown in Fig. 1. Among the two sources input file,
one file contains 120 row elements and another file contains
Fig. 1 Execution timing
8 row elements which need to move in DW. The execution
time is calculated in a number of seconds. It is observed that
R_etl is much efficient than other three options with respect
to execution timing taken.
4.3 Transformation support
Performing required transformation is the most important
task in ETL process. Generation of correct data depends upon
the successful completion of the transformation process. In
this study, a list of transformation method supported by each
tool has been identified. Based on the range of transformation
support provided by each tool, a comparative graph is drawn,
as shown in Fig. 2. It is observed that Petl is the highest
number of transformation variety provider.
4.4 Throughput
To evaluate any system efficiency, throughput is one of the
important metrics. Here, throughput of each ETL tool is
measured after deploying each four systems on premise. Fig-
ure 3presents the throughput performance of them. Here, the
throughput is measured with respect to the number of rows
processed per second. Performance of R_etl is better com-
pared to other ones.
Fig. 2 Transform support
Fig. 3 Throughput rate
123
Efficient incremental loading in ETL processing for real-time data integration
Fig. 4 Line of code
4.5 Line of code
It is a measure of how many lines of code is required for
accomplishing the total ETL process. A comparative graph
is presented in Fig. 4will show the line of code required
for each tool to establish the task. An approximate value is
considered for this case. Less line of code means how easily
and in less time a code for implementing ETL can be written.
From the graph, it is visible that etl using R takes less number
of codes than the others.
5 Modeling ETL jobs
The main objective of the paper is to design a model for code-
based ETL tool by which we can reduce the data flow and
latency for achieving near real-time ETL processing in DW
environment with respect to incremental loading by utiliz-
ing CDC technique. This approach can noticeably cut down
the ETL processing latency by reducing the amount of data
propagation throughout the extraction, transformation and
loading stages of the ETL workflow.
5.1 Incremental loading
The loading task in the DW is executed as a background pro-
cess, in certain time interval. Initial load is performed for the
foremost DW population purpose. Afterward, if any modi-
fication or new data arrive at source side, DW refreshing is
done by full reloading process for making the repository up
to date. But nowadays, with the increasing data size, high
data rate and complex data structure, full reloading becomes
inefficient and inadequate. The more practical approach is
continually updating DW the changes made in the source
data since the previous reload. This approach is termed as
incremental loading [18,19]. This type loading is more effi-
cient compared to full load. The main idea of incremental
loading is that capturing only the changes happen at source
data and propagating it to the DW.
5.2 Change data capture (CDC)
It is a new real-time data integration approach, which identi-
fies and captures changes occur to data sources and delivers
only the changed data to operational system [37]. This
approach does not need DW downtime or batch windows of
ETL. Some CDC technologies operate in batch mode with
a pulling technique, meaning that the ETL tool periodically
receives a batch for all new changes made up to last receives
and executes them. The real-time CDC solutions apply a con-
tinuous streaming “push” approach for delivering data.
The advantage of CDC is that the latency can be cut down
to minutes or even seconds which makes the data instantly
available eliminating the use of batch windows. Besides,
it minimizes the amount of data flow; therefore, resource
requirement is minimized and data flow speed and efficiency
are maximized. CDC addresses some business needs like
building operational data stores (ODS), business activity
monitoring (BAM), application integration, real-time dash-
boards, data quality improvement, etc. There are several
techniques by which CDC techniques [38,39] can be imple-
mented for detecting the change.
Transactional log Most of the data sources maintain a
change log which keeps record of all changes performed on
it. This log keeps track of date and time for last modification
in the table. This transactional log can be used to capture the
change.
Database log scraping and log sniffing This technique
takes a snapshot of the transaction logs maintained by the
database system for backup and recovery at a scheduled time.
The later techniques involve “pooling” of the active log file
and identify changes on real time.
Snapshot differential During the extraction phase, a snap-
shot of complete source table is taken. Changed data can be
identified by comparing consecutive snapshots [40,41].
Timestamped index The operational system often main-
tains a time stamps column for keeping track of last update.
This column referred as audit column generates a new time
stamp for any modification in tuple. These audit columns can
be used to identify new changes since last loading cycle.
Database trigger It is a special type of activity in database
system which is fired based on some predefined func-
tions [26]. Trigger can be set on every (add, delete and
update) event for finding new data. The output of the trigger
program which is stored in another file can be used for
extracting data.
5.3 Proposed approach
For implementing incremental loading, the proposed work
is divided into three parts. In the first part of the work,
snapshot-based CDC is implemented which captures only the
changes in data from the input dataset and loads them into
123
N. Biswas et al.
the data warehouse. An algorithm for dimension process-
ing is described in the second part. Finally, in the third part,
an algorithm for fact processing is represented in a formal
way.
For the algorithm, four database tables and one input
dataset are taken. The first table is the main table which con-
tains previously loaded data. The second and third tables are
dimension tables which will be used at the time of dimension
processing. The fourth one is a fact table Tabfact and will
be used at the time of fact processing. Let us assume that
the name of the main table is Tab
main, and the name of the
first dimension table is Tab1dim and Tab2dim . Now, let us
focus on the input dataset. It is considered as a new record
Newdata which needs to be extracted from the source and
needs to be loaded in the main table.
Algorithm 1: Change Data Capture Algorithm
Result: Dataframe of new, updated and deleted records
1DF
newTak e Newdata dataset as Input from Machine;
2DFTab
main Fetch Tab
main table from Database;
3if DFTab
main is empty then
4Load DF
newin the Database;
5else
6IN
df Concat (DF
new,DFTab
main,DFTab
main).
drop_duplicate();
7UP
temp Concat
(DF
new,DFTab
main,DFTab
main).drop_duplicate
(subset=key);
8UP
df Concat (IN
df,UP
temp).drop_duplicate();
9DELdf Concat
(DF
new,DF
new,DFTab
main).drop_duplicate (subset=key);
10 end
For Algorithm 1, at first Newdata input dataset is taken as
a pandas DataFrame D Fnew. Next, previously loaded record
Tab
main table is fetched in a DataFrame DF T a bmain from
the database. If the main table DFTab
main is empty, load
data from the DataFrame D Fnewin Tab
main table using bulk
load.
If DFTab
main is not empty, then determine the changes
occurs in Newdata dataset. For this purpose, concatenate
DF
new,DFTab
main and DFTab
main. Then, duplicate rows
are deleted and DataFrame I Ndf of new elements stored in
the new record.
Afterward, for getting the updated values, again con-
catenate between DF
new,DFTab
main and DFTab
main.
Then, duplicate rows are removed and stored in a tempo-
rary DataFrame U Ptemp. Next, again concatenate between
UP
temp and DF
new. After deleting the duplicates finally
updated DataFrame U Pdf is established. At the last step,
duplicate records are deleted and got the DataFrame of
deleted elements DELdf.
5.4 Dimension processing algorithm
Algorithm 2: Dimension Processing Algorithm
Result: Dimension Table
1DFDimnewFetch dimension attributes from recently
updated main table Tab
main from database;
2DFDimold Fetch old dimension table Tab1dim from
database;
3DIMdf Concat (DFDimnew,DFDimold ,DFDimold ).
drop_duplicate();
4Load DIMdf DataFrame in old dimension table Tab1dim of the
database;
For Algorithm 2, in the first step, dimension attributes
are fetched from recently updated main table Tab
main into
DataFrame D F Di m new. Consider it as new record. In the
next step, fetch a old dimension table Tab1dim in a pandas
DataFrame D F Di m old from the database.
Afterward, for getting the changes in dimension table con-
catenate DataFrame D F D im newwith DFDimold .After
dropping the duplicate, changed data need to be loaded into
the database dimension table Tab1dim.
5.5 Fact processing algorithm
Algorithm 3: Fact Processing Algorithm
Result: Fact Table
1DFTab
newFetch recently updated main table Tab
main from
database;
2DFDim1newFetch recently updated dimension table
Tab1dim from database;
3DFDim2newFetch another recently updated dimension table
Tab2dim from database;
4DFFact
old Fetch Fact table Tabfact with old data from
database;
5Fact1df Merge(DFTab
new,DFDim1new) using common
attribute Com_attr1;
6Fact2df Merge(Fact1df,DFDim2new) using common
attribute Com_attr2;
7Fact3df Only select the Key attributes from Fact2df;
8Factdf Concat
(Fact3df,DFFact
old ,DFFact
old ).drop_duplicate();
9Load Factdf dataframe in old dimension table of the database;
Algorithm 3 describes the fact processing part of the algo-
rithm. At first, fetch recently updated main table Tab
main in
DataFrame D F D i mnewfrom the database. It is a new record.
In the next two steps, fetch two recently updated dimension
tables Tab1di m and Tab2dim in DataFrame D F D i m1new
and DFDim2newfrom the database. In the next step, fetch
a fact table Tabfact with old records.
123
Efficient incremental loading in ETL processing for real-time data integration
Now, for determining the changes and updating it in the
fact table Tabfact,firstmerge DFTab
newand DFDim1new
with respect to any common attribute Com_attr1. The result-
ing DataFrame is Fact1df. Then, merge Fact1df and
DFDim2newwith respect to a common attribute Com_attr2
with resulting dataframe Fact2df.FromFact2df, we need
to select the key attributes which can reference the dimen-
sion tables and make another DataFrame Fact3df. Then,
concatenate Fact3df and two DFFact
old and drop the
duplicates from the resulting DataFrame. In the last step,
load the changed data in the fact table Tabfact.
6 Example scenario
For our example scenario, a small DW is designed for book
store management system located at various cities. It contains
single fact table and three dimension tables for book details,
time of sale and location of book stores. Figure 5shows the
star schema constructed with corresponding fact tables and
dimension tables.
Fig. 5 DW schema design
Dimensional modeling is an established and a very popu-
lar methodology for DW design. It reflects the logical schema
design of a particular DW. Fact tables usually contain mea-
sures of business processes that are referred to as facts.
Dimension tables store textual descriptions of the business
entities. The dimension attributes are used to select, group
and aggregate facts of interest in the DW queries. Multiple
hierarchical relationships in a single table can often be rep-
resented by dimensions. Dimension tables are usually highly
denormalized.
6.1 Experimental analysis
Here, experimental results are provided to demonstrate the
advantage of incremental loading over full reloading. To
implement and maintain the DW, input data are taken from a
database table and an external CSV file. A dataset of 60,000
tuples first inserted in a database table. Then, six datasets
are created having 1, 5, 10, 20, 25, 30, 40, 45 percentage of
changed data and inserted in different steps. At each step,
equal numbers of tuples were inserted, deleted and updated.
Deletions are not propagated since historical data are kept in
the DW. The Hardware requirement to run the tests was PC
having Intel(R) Core(TM) i5-4200U (2.3GHz) processor, a
DDR2 4 GB main memory on a 1 TB hard disc, SATA stan-
dard, transfer rate of 3.0 Gbit/s and 5400 rpm (rotations per
minute). The equipment had Microsoft Windows 8.1 system
software, Anaconda Python 3.6 compiler and PostgreSQL
10.4 DBMS.
6.2 Results and discussion
The tests were done in two stages. At the first stage, the time
to compute change data for full reloading is measured. A
Fig. 6 Time comparison
between full load and
incremental load
123
N. Biswas et al.
fictitious database containing a single table was used. This
table contains 6000 tuples. Then, the previously mentioned
six datasets are given as an input for the time measurement
to compute change data for DW refreshment. At the second
stage, the time to compute change data for incremental load-
ing is measured. We have followed exactly the same steps
which we have done in the first stage for the measurement
of the time to compute change data for DW refreshment.
A comparative graph is shown in Fig. 6for both full load
and incremental load using CDC technique. As expected, the
time for full reload is considerably slower than the incremen-
tal loading where only the changes are captured and loaded
into the data warehouse. Incremental loading clearly outper-
forms full reloading. There is no effect on the performance
of the proposed method when the source relation is changing
dramatically.
The aim of this proposed model is to reduce processing
overhead throughout the ETL stages and also minimize the
time complexity compared to full loading technique. Perfor-
mance of incremental loading proves quite good compared
to full loading. It is observed that the CPU time grows lin-
early with the size of dataset grows for both cases. From
Fig. 6, it is clearly visible that how the incremental method
outperforms full reloading of data into the data warehouse.
The result is expected because when we are performing full
reloading, we need to load the whole table containing the
changed data as well as the old data in the data warehouse.
But for the incremental loading, only the changed data are
being identified and loaded into the data warehouse. That is
why the size of the data which needs to be loaded becomes
much small. For this reason, incremental loading manifests
to be very efficient for refreshing DW in near real time.
7 Conclusion
At present, the requirement of continuous and increasing
amount of data handling within more complex environment
is a great challenge in the research domain. It demands stan-
dardized ETL process which has a great business impact on
the BI industry. Most of the organizations opt for taking any
vendor-made GUI-based product for their ETL solution. But
still in some of the cases custom-coded ETL can be the best
option in respect of performance. This paper has chosen the
second option for ETL solution. Four promising code-based
ETL tools such as Pygrametl, Petl, Scriptella and etl have
been chosen in this paper. Both performance and feature-
based analysis are presented over these tools.
Main contribution of this work is to modeling of a near
real-time ETL process with the persuasion of incremental
loading. However, the development of ETL jobs for incre-
mental loading is not at all well supported by existing ETL
tools. In fact, separate ETL jobs for initial loading and incre-
mental loading have to be created by ETL programmers
so far. Since incremental load jobs are considerably more
complex and error prone, this approach proves much better
compared to the former one.
The future work aims to design a unified code-based ETL
framework supporting relational and NoSQL database with
rich transformation library hosted on Cloud platform. Also,
we will focus on advanced transformation operators such as
aggregation, outer joins and data restructuring such as pivot-
ing. Further, we have a plan to utilize the usage of the staging
area in a more better way. CDC limitations can be reduced
to some extent by utilizing the staging area. Moreover, we
expect performance improvements from persisting interme-
diary results.
References
1. Inmon W (2005) Building the data warehouse. Wiley, New York
2. Vassiliadis P (2009) A survey of extract—transform—load tech-
nology. Int J Data Warehous Min 5(3):1–27
3. Eckerson W, White C (2003) Evaluating ETL and data integration
platforms. Report of The Data Warehousing Institute 184
4. Data integration. http://www.pentaho.com/product/data-
integration. Accessed 06 Feb 2018
5. Data integration: talend enterprise data integration services.
http://www.talend.com/products/data-integration. Accessed 06
Feb 2018
6. Data integration tools and software solutions | informatica
India. https://www.informatica.com/in/products/data-integration.
html. Accessed 06 Feb 2018
7. Oracle Data Integrator. http://www.oracle.com/technetwork/
middleware/data-integrator/overview/index.html. Accessed 06
Feb 2018
8. IBM, InfoSphere Information Server. http://www-03.ibm.com/
software/products/en/infosphere-information-server/. Accessed
06 Feb 2018
9. Schmidt N, Rosa M, Garcia R, Molina E, Reyna R, Gonzalez J
(2011) Etl tool evaluation—a criteria framework. University of
Texas-Pan American, Texas
10. Majchrzak TA, Jansen T, Kuchen H (2011) Efficiency evaluation
of open source ETL tools. In: Proceedings of the 2011 ACM sym-
posium on applied computing. ACM, pp 287–294
11. Pall AS, Khaira JS (2013) A comparative review of extrac-
tion, transformation and loading tools. Database Syst J BOARD
4(2):42–51
12. 2017 Gartner magic quadrant for data integration tools. https://
www.informatica.com/in/data-integration- magic-quadrant.html.
Accessed 06 Dec 2017
13. Thomsen C, Pedersen T (2005) A survey of open source tools for
business intelligence. In: International conference on data ware-
housing and knowledge discovery. Springer, pp 74–84
14. Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL
activities. In: Proceedings of the ACM twelfth international work-
shop on Data warehousing and OLAP. ACM, pp 25–32
15. Kabiri A, Chiadmi D (2013) Survey on ETL processes. J Theor
Appl Inf Technol 54(2):219–229
16. Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (1999)
Performance issues in incremental warehouse maintenance. In:
Proceedings of the 26th international conference on very large data
123
Efficient incremental loading in ETL processing for real-time data integration
bases (VLDB’00), Cairo, Egypt, September 2000. Stanford Info-
Lab
17. Zhang X, Sun W, Wang W, Feng Y, Shi B (2006) Generat-
ing incremental etl processes automatically. In: First interna-
tional multi-symposiums on computer and computational sciences
(IMSCCS’06), vol 2. IEEE, pp 516–521
18. Jörg T, Dessloch S (2008) Towards generating ETL processes for
incremental loading. In: Proceedings of the 2008 international sym-
posium on database engineering applications (IDEAS’08). ACM,
pp 101–110
19. Jörg T, Dessloch S (2009) Formalizing etl jobs for incremental
loading of data warehouses. In: BTW, pp 327–346
20. Behrend A, Jörg T (2010) Optimized incremental etl jobs for
maintaining data warehouses. In: Proceedings of the fourteenth
international database engineering and applications symposium,
ACM, pp 216–224
21. Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time
snapshot maintenance with incremental ETL pipelines in data
warehouses. In: Big data analytics and knowledge discovery.
Springer, pp 217–228
22. Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data
capture and handling mechanism in real-time data warehouse. In:
2008 international conference on computer science and software
engineering, vol 4, IEEE, pp 478–481
23. Ma K, Yang B (2015) Log-based change data capture from schema-
free document stores using mapreduce. In: 2015 International
conference on cloud technologies and applications (CloudTech).
IEEE, pp 1–6
24. Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change
data capture with web service database encapsulation. In: 2010 6th
world congress on services (SERVICES-1). IEEE, pp 128–131
25. Tank DM, Ganatra A, Kosta YP, Bhensdadia CK (2010) Speeding
ETL processing in data warehouses using high-performance joins
for changed data capture (cdc). In: 2010 international conference on
advances in recent technologies in communication and computing
(ARTCom). IEEE, pp 365–368
26. Sukarsa IM, Wisswani NW, Darma IG (2012) Change data capture
on OLTP staging area for nearly real time data warehouse base on
database trigger. Int J. Comput. Appl. 52(11):32–37
27. Valêncio CR, Marioto MH, Zafalon GFD, Machado J, Momente
J (2013) Real time delta extraction based on triggers to support
data warehousing. In: International conference on parallel and dis-
tributed computing, applications and technologies (PDCAT’13).
IEEE, pp 293–297
28. Thomsen C, Pedersen T (2009) pygrametl: a powerful pro-
gramming framework for extract-transform-load programmers. In:
Proceedings of the ACM twelfth international workshop on Data
warehousing and OLAP. ACM, pp 49–56
29. Thomsen C, Pedersen T (2011) Easy and effective parallel pro-
grammable etl. In: Proceedings of the ACM 14th international
workshop on data warehousing and OLAP. ACM, pp 37–44
30. pygrametl, ETL programming in Python. http:// www.pygrametl.
org/. Accessed 25 Feb 2018
31. Petl - Extract, transform and load (tables of data). http:// petl.
readthedocs.io/en/latest/ . Accessed 06 Dec 2017
32. Welcome to Scriptella ETL Project. http://scriptella.org/. Accessed
06 Dec 2017
33. Scriptella/scriptella-etl. https://github.com/scriptella/scriptella-
etl/wiki. Accessed 06 Dec 2017
34. Baumer B (2017) A grammar for reproducible and painless
extract-transform-load operations on medium data. arXiv preprint
arXiv:1708.07073
35. Baumer B (2017) etl: extract-transform-load framework for
medium data. http://github.com/beanumber/etl, r package version
0.3.7
36. ETL. https:// cran.r-project.org/web/packages/etl/ README.html.
Accessed 10 Mar 2018
37. Efficient and real time data integration with change data capture.
White paper, Attunity Ltd. (2009). http://attunity.com
38. Ankorion I (2005) Change data capture efficient ETL for real-time
bi. Inf Manag 15(1):36
39. Bokade MB, Dhande SS, Vyavahare HR (2013) Framework of
change data capture and real time data warehouse. In: International
journal of engineering research and technology, vol2. ESRSA Pub-
lications
40. Lindsay B, Haas L, Mohan C, Pirahesh H, Wilms P (1986) A
snapshot differential refresh algorithm. In: Proceedings of the
ACM-SIGMOD conference, vol 15
41. Labio W, Garcia-Molina H (1996) Efficient snapshot differential
algorithms for data warehousing. In: Proceedings of the 22th inter-
national conference on very large data bases (VLDB’96). Morgan
Kaufmann Publishers Inc, pp 63–74
Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
123
... This testing method can be applied if the insertion, deletion, and update commands are executed in the correct order to achieve the desired outcome. This testing approach verifies both old and new data in a step-by-step manner [22]. ...
... The importance of system integration testing cannot be overstated. System integration may be classified into three types: hybrid, top down, and bottom up [22]. ...
... These technologies provide quicker access to information and shorter time to insights. These ETL technologies are getting more popular among organizations as the necessity to acquire and analyze data in the lowest amount of time has increased [22]. ...
... The efficiency of the ETL System is a high priority. According to Neepa Biswas et al., [46], GUI-based products sold by vendors are the go-to for most organizations for their ETL solution. However, they also discovered that custom-coded ETL provides better performance and efficiency in specific cases. ...
Preprint
Full-text available
A data warehouse efficiently prepares data for effective and fast data analysis and modelling using machine learning algorithms. This paper discusses existing solutions for the Data Extraction, Transformation, and Loading (ETL) process and automation for algorithmic trading algorithms. Integrating the Data Warehouses and, in the future, the Data Lakes with the Machine Learning Algorithms gives enormous opportunities in research when performance and data processing time become critical non-functional requirements.
... Overall, the process of extracting, transforming, and loading ETL data from homogeneous or heterogeneous sources was established to access our data (Biswas et al., 2020;Biswas et al., 2018). In short, it was an essential component in cleansing, customizing, reformatting, integrating, and inserting the prerequired data (Greiff et al., 2015;Tecnico, 2015). ...
Article
Full-text available
The results of gene expression analysis based on p-value can be extracted and sorted by their absolute statistical significance and then applied to multiple similarity scores of their gene ontology (GO) terms to promote the combination and adjustment of these scores as essential predictive tasks for understanding biological/clinical pathways. The latter allows the possibility to assess whether certain aspects of gene function may be associated with other varieties of genes, to evaluate regulation, and to link them into networks that prioritize candidate genes for classification by applying machine learning techniques. We then detect significant genetic interactions based on our algorithm to validate the results. Finally, based on specifically selected tissues according to their normalized gene expression and frequencies of occurrence from their different biological and clinical inputs, a reported classification of genes under the subject category has validated the abstract (glomerular diseases) as a case study.
Article
Full-text available
Many industries and academia have devoted a lot of effort and money to creating and/or using good extract-transform-load (ETL) software suitable for their data analysis purposes since it is considered a key to their success. As a result, we find the valuable interventions of research efforts based on ETL software are divided according to well-known approaches such as Business Intelligence, Big Data, and/or Semantic. As a result, problems arise in keeping up with changes and handling the significant diversity in features across these approaches. Which results in disorientation in the finding, evaluation, and choice of an ETL for industries and academia facing their approaches needs. These problems inspire us to provide a contribution that uses the systematic-literature-review (SLR) method to collect 207 papers from three databases, namely, ScienceDirect, Springer, and IEEE, dated from 2010 to 2022, grouped based on both ETL approaches and their commonly used criteria, afterwards using an existing method that automatically identifies the adequate multicriteria method for this study, which gives us the analytical-hierarchy-process method to provide the best research paper according to the requirements of scientific literature. The result implies the great significance of this study in multiple ways, providing a global idea of research papers about ETL approaches, allowing customers to eliminate uncertainty from selecting an ETL according to their specific approach needs, preferences, and interests, and also enabling future researchers and developers of ETL to decide when to focus and how to make innovative contributions to fill gaps in the literature.
Chapter
Business organizations are trying to focus from the traditional extract-transform-load (ETL) system towards real-time implementation of the ETL process. Traditional ETL process upgrades new data to the data warehouse (DW) at predefined time intervals when the DW is in off-line mode. Modern organizations want to capture and respond to business events faster than ever. Accessing fresh data is not possible using traditional ETL. Real-time ETL can reflect fresh data on the warehouse immediately at the occurrence of an event in the operational data store. Therefore, the key tool for business trade lies in real-time enterprise DW enabled with Business Intelligence. This study provides an overview of ETL process and its evolution towards real-time ETL. This chapter will represent the real-time ETL characteristics, its technical challenges, and some popular real-time ETL implementation methods. Finally, some future directions towards real-time ETL technology are explored.
Conference Paper
The monitoring data of power equipment exhibit characteristics of multiple sources, a large amount of data, and high real-time performance. To effectively extract information from these data, it is imperative to extract monitoring data from various equipment and subject it to a series of processing steps. Data extraction is a crucial stage in the ETL (extraction, transformation, loading) process. In this paper, we conduct a comprehensive analysis of various incremental extraction capture mechanisms, highlighting their respective advantages and disadvantages and applicable conditions. We then select CDC technology, which involves analysing the database’s log to identify changed data and using the incremental extraction method to extract the monitoring data of power equipment. Our analysis and experimental results demonstrate that the proposed method ensures the accuracy and integrity of data to a significant extent compared to other incremental extraction mechanisms. Furthermore, it addresses the incremental extraction challenges between databases in the power equipment monitoring environment and improves the efficiency and performance of data extraction.
Conference Paper
There is a significant increase in the generation of massive amounts of data worldwide. Various sources, such as social media applications, blockchain technology, and numerous other systems, are responsible for generating data. This information can be structured or unstructured, originating from different sources. In this context, the ETL (Extract, Transform, and Load) process plays a crucial role where demands for efficient business decisions in contemporary systems are grown. During the processing phase, the entire method is organized in a pipeline structure to ensure that the resulting data contains valuable and pertinent insights. Indeed, selecting an appropriate ETL technology can be challenging due to the abundance of options available. This article delves into several technologies used for performing ETL processes, with the aim of examining their strengths and weaknesses. This can provide assistance to researchers and industry professionals in making informed decisions and choosing the most suitable technologies to meet their specific requirements.
Article
Full-text available
Business today forces the enterprises to run different but coexisting information systems. However, data warehousing enterprises have a dilemma of choosing the right ETL process and the right ETL tool for their organization as one wrong step or choice may lead to a series of losses both monetarily and by time, not to mention the amount of laborious work that the workers would put in. The organization can choose from a variety of ETL tools but without exploring or the knowledge of their features this would again result in a bad decision making process. In this paper, we have tried to present a comparative review of some of the leading ETL tools just to acquaint the users with its features and drawbacks.
Conference Paper
Full-text available
Nowadays large corporations require integrated data from diverse sources, leading to the use of data warehouse architectures for this purpose. To bypass problems related to the use of computational resources to process large volumes of data, an ETL (Extract, Transform and Load) technique with zero latency can be used, that works by constantly processing small data loads. Among the extraction techniques of the zero latency ETL are the use of logs, triggers, materialized views and timestamps. This paper proposes a structure capable of performing this task by means of triggers and a tool developed for the automatic generation of the SQL (Structured Query Language) code to create these trigger, besides showing its performance and comparing it to other techniques. Said method is relevant for the extraction of portions of selected information as it permits to combine conventional and real time ETL techniques.
Article
Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable hub-and-spoke framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.
Conference Paper
Multi-version concurrency control method has nowadays been widely used in data warehouses to provide OLAP queries and ETL maintenance flows with concurrent access. A snapshot is taken on existing warehouse tables to answer a certain query independently of concurrent updates. In this work, we extend this snapshot with the deltas which reside at the source side of ETL flows. Before answering a query, relevant tables are first refreshed with the exact source deltas which are captured at the time this query arrives (so-called query-driven policy). Snapshot maintenance is done by an incremental recomputation pipeline which is flushed by a set of consecutive deltas belonging to a sequence of incoming queries. A workload scheduler is thereby used to achieve a serializable schedule of concurrent maintenance tasks and OLAP queries. Performance has been examined by using read-/update-heavy workloads.
Article
The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a)first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the L-T-L triplet, and examine problems that full within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]
Conference Paper
This research is investigating the claim that Change Data Capture (CDC) technologies capture data changes in real-time. Based on theory, our hypothesis states that real-time CDC is not achievable with traditional approaches (log scanning, triggers and timestamps). Traditional approaches to CDC require a resource to be polled, which prevents true real-time CDC. We propose an approach to CDC that encapsulates the data source with a set of web services. These web services will propagate the changes to the targets and eliminate the need for polling. Additionally we propose a framework for CDC technologies that allow changes to flow from source to target. This paper discusses current CDC technologies and presents the theory about why they are unable to deliver changes in real-time. Following, we discuss our web service approach to CDC and accompanying framework, explaining how they can produce real-time CDC. The paper concludes with a discussion on the research required to investigate the real-time capabilities of CDC technologies.