ArticlePDF Available

Efficient incremental loading in ETL processing for real-time data integration

March 2020
Innovations in Systems and Software Engineering 16(3)

March 2020
16(3)

DOI:10.1007/s11334-019-00344-4

Authors:

Neepa Biswas

Jadavpur University

Anamitra Sarkar

Jadavpur University

Dr-Kartick Chandra Mondal

Jadavpur University

ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.

Figures - uploaded by Dr-Kartick Chandra Mondal

Content may be subject to copyright.

Content uploaded by Dr-Kartick Chandra Mondal

Content may be subject to copyright.

Content uploaded by Dr-Kartick Chandra Mondal

Content may be subject to copyright.

Innovations in Systems and Software Engineering

https://doi.org/10.1007/s11334-019-00344-4

S.I. : CICBA 2018

Efﬁcient incremental loading in ETL processing for real-time data

integration

Neepa Biswas1·Anamitra Sarkar1·Kartick Chandra Mondal1

Received: 12 February 2019 / Accepted: 8 May 2019

Abstract

ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is

the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical

user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based

tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by

hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can

be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools

developed by the academic world have been studied in this article. Their architecture and implementation details are addressed

here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efﬁcient ETL

model is designed to meet the near real-time responsibility of the present days.

Keywords Data warehouse ·Code-based ETL ·ETL tools ·Pygrametl ·Petl ·Scriptella ·Incremental load ·Bulk load ·CDC

1 Introduction

Nowadays, data analysis has become an integral part of any

organization to achieve optimized performance. Data ware-

house (DW) [1] is a wide storage of data which is mainly

used for analytical reporting. To construct a DW, data are

generally collected from heterogeneous data sources, clean

and restructure as per the required standard and ﬁnally loaded

into the DW. This is a well-knownprocess called ETL (extract

transform load). ETL [2] is one of the important components

in DW environment. It is observed that the ETL process is

time-consuming, costly and complexity overhead of any DW

project [3].

ETL tools come as a solution provider with user-friendly

graphical user interface (GUI) to map data items between

the source and target system in a fast hassle-free manner. In

BKartick Chandra Mondal

kartickjgec@gmail.com

Neepa Biswas

biswas.neepa@gmail.com

Anamitra Sarkar

anamitra1992@gmail.com

1Department of Information Technology, Jadavpur University,

Kolkata, India

spite of developing and maintaining a custom hand-coded

ETL systems, it is easier and faster to select and use any ETL

tool. The user needs to conﬁgure the tool as per their require-

ment. Many open-source (e.g., Pentaho Kettle, Talend) [4,5]

and commercial (e.g., Informatica, SAS, ODI, IBM) ETL

tools [6–8] come with nice GUI which is easy to use for

non-programmers. Using this type of tool, developers design

the visual ﬂow of data throughout the ETL process. One dis-

advantage of this kind visual approach is sometimes it is

difﬁcult to design a speciﬁc ETL scenario with the limited

item available in the graphical tool.

Writing a few lines of code can be a better way for obtain-

ing high ﬂexibility which can tackle complex transformations

for any projected function, because it is tricky to drag icons,

draw ﬂow lines and set properties for a complex case design

with respect to writing own customizable ETL codes. Here,

one of the main consideration should be the productivity of

any system. Using any GUI-based tool cannot assure more

productivity compared to code-based approach. Generally,

ETL development is done by skilled technical people. So it

is justiﬁed to go for a code-based ETL option rather than GUI-

based choice. We agree that graphical program is effective

for self-documentation and standardized features. But still,

there is some aspect where code-based approach can give an

123

N. Biswas et al.

effective solution when the project as well as infrastructure

cost is low having experienced developers.

Some review article [9–11] over ETL tools is done so far.

However, they are typically done over commercial ETL tools

available in the market. Only high-level view is included by

those works without covering any technical details. But no

such work is noticed so far regarding the code-based ETL tool

developed by academic peoples. The focus of this paper is to

give an integrated analysis report in the research domain of

programmable ETL system. Afterward, a new solution model

is proposed to meet the near real-time ETL demand. The tar-

get is achieved by implementing incremental loading model

by the assistance of CDC (change data capture) approach.

The contribution of this paper is to highlight a new area

by programmable ETL development technique. For this pur-

pose, four prominent works on code-based ETL framework

are selected, namely Pygrametl, Petl, Scriptella and etl.

An in-depth experimental evaluation is done on each tool.

Subsequently, feature-wise and performance-wise analysis

report is provided. The continuation of work is done through

designing a new ETL-based data integration technique. The

proposed solution makes the data integration much efﬁciently

by incrementally populating only the changed data in the DW

at the right time.

Rest of the paper is organized in the following manner.

Section 2outlines the research development related to this

area. Each of the evaluated ETL tools is discussed with their

unique features in Sect. 3. Section 4states the experimental

detail and result following with a comparative performance

analysis report. In Sect. 5, ﬁrstly incremental loading and

CDC techniques are reviewed. Afterward, the proposed ETL

model is discussed with algorithmic details. Section 6holds

the result and discussion. Finally, Sect. 7summarizes the

overall contribution of this article with the future direction.

2 Related work

A number of commercial and open-source data integration

tools are available in the market. Every year, Gartner Inc.

publishes a market research reports [12] on these tools where

Informatica, IBM, SAS, SAP, Oracle and Microsoft are sug-

gested as leading commercial tools and Talend and Pentaho

are the open-source challengers in the market. All those tools

offer GUI-based ETL process design.

Thomsen and Pedersen [13] have done a survey on open-

source business intelligence tools. An overview of ETL tools

characteristics is discussed there without any performance

comparison.

A detailed survey is done by Vassiliadis [2] which mainly

addresses research work in each stage of the ETL process

with some academic ETL tools (Ajax, Arktos, Potter’s Wheel

and HumMer Fusion). These tools mainly offer data cleaning

or workﬂow designing task. Following that work, Vassiliadis

et al. addressed three commercial ETL tools (SQL SSIS,

OWB and DataStage) and made a taxonomy of distinct ETL

characteristics in article [14].

The focus of this article is programmable ETL tools.

Some survey was conducted regarding comparative study on

popular GUI-based ETL tools which are available in mar-

ket [9–11,15]. But, no such survey covering experimental

analysis work is found in this area from where features and

performance-based overview of code-based ETL tool can be

evaluated. For this purpose, we have selected four code-based

ETL tools that are well accepted by the academic world. Two

Python-based tools Pygrametl and Pet l, one Java-based tool

Scriptella and one R tool etl are evaluated. The architectural

and characteristic overviews of these tools are discussed in

the next section.

Regarding incremental loading in ETL, an algebraic opti-

mization of incremental DW maintenance is discussed in the

literature [16]. Zhang et al. [17] has projected automated

generation of incremental loading. A list of work has been

done by Jorg et al. targeting incremental loading process [18].

Article [19] consists of derivation of ETL job for incremental

loading. An optimized approach is discussed in the literature

[20]. An efﬁcient real-time snapshot maintenance proposal

is given in Article [21].

In the context of CDC, some research works based on

log-based method are introduced in the literature [22,23]for

improving the quality of data. There is a Web service-based

approach [24] for modeling of efﬁcient CDC mechanism.

A high-performance join on changed data is applied on

the literature [25]. A trigger-based approach for capturing

changed data from different data sources is mentioned in

Article [26,27]. The following section will cover some basic

overview of selected code-based ETL tools including some

comparative feature analyses.

3 ETL tools overview

The availability of less degree of customization facility for

modeling and integrating extension environment in GUI-

based ETL tools has led many organizations to go for

programmable solutions for ETL process. In this paper, some

code-based ETL tools are selected. All these tools are open-

source, and no graphical user interface is offered.

3.1 Pygrametl

Most remarkably, Pygrametl [28,29] is an open-source

Python-based ETL framework ﬁrst released in 2009. This

software is licensed under BSD. Till now, continuous

upgradation is done on this tool. Pygrametl [30] suggests

performing ETL tasks by writing Python codes. It offers

123

Efﬁcient incremental loading in ETL processing for real-time data integration

some commonly used ETL functionality to populate data

in DW. PostgreSQL, MySQL and Oracle are the supported

databases. Seamless integration of any new kind of data

source can be done using merge join, hash join and union

source functions. Both the batch and bulk load can be per-

formed as per the requirement.

3.2 Petl

Most notably, Petl [31] is a general-purpose Python package

which is able to perform conventional tasks of ETL. This

package is supported under MIT License. Petl provides sup-

port for both object-oriented and functional programming

style. Petl can handle a wide range of data sources with

structured ﬁle like CSV, Text and semi-structured ﬁle like

XML, JSON etc. PyMySQL, PostgreSQL and SQLite are

three compatible databases with this package. Petl supports

maximum transformation patterns required in any ETL pro-

cess. Besides timing, materialized view, lookup, etc., utility

function provided extra beneﬁt to the developer. Addition of

any third-party package can be easily done within it. ETL

data ﬂow is synchronized using ETL pipelines. However, it

does not have SCD or parallelism handling mechanism.

3.3 Scriptella

Scriptella [32,33] is another script-based ETL tool written in

Java. It is licensed under Apache Version 2.0. Plain SQL

queries are executed using JDBC bridge in this scripting

language. Non-JDBC connection can be added using mixed

SQL script. For describing various ETL task, XML script

is used. SQL or other scripting language can be used for

transformation purpose. The main application is focused on

executing script that is written in SQL, JEXL, Javascript and

velocity for the purpose of ETL operations to/from various

databases as well as ﬁle format like text, CSV, XML, LDAP,

etc. A thin wrapper created by XML script can give extra

facility to make dynamic SQL script.

3.4 R_etl

Nowadays, R is a promising language which is gaining popu-

larity in the ﬁeld of data science. A newly developed package

for R [34] named etl is selected for this piece of work. It

is licensed under CC0 with version 0.3.7 and available in

CRAN [35,36]. It provides a pipeable framework to execute

core ETL operations suitable for working with medium-size

data. RPostgreSQL and RSQLite are the DBI drivers for R

which are compatible with this package. It is suitable to

handle data which can reside either in the local or in the

remote database. Database creation or management can be

done without having any expertise in SQL. Only some basic

ETL functionalities are enabled here.

Table 1 Feature comparison matrix on selected code-based ETL tools

Speciﬁcations Pygrametl Petl Scriptella R_etl

Easy usability N Y Y Y

Data centric Y Y Y Y

SOA-enabled N N N N

Reusable function Y Y N Y

Single install Y N N N

Big data handle N Y Y N

Data segregation Y Y Y N

Real-time triggers N N N N

Unstruct. data support N Y Y N

Multiple source join Y Y Y N

Complex transform N Y N N

Data validations N Y N Y

SCD support Y N N N

Parallelism support Y N N N

Bulk load Y N Y N

Data pipeline N Y N Y

Easy data map Y Y Y Y

Lookup support N N Y N

Code reusability Y Y N Y

Exception handling Y Y Y N

Proper documentation Y Y N Y

Third-party dependency N Y Y Y

Version control Y Y Y Y

Deploy in cloud N N N N

Licensed Y Y Y Y

Community forum Y Y Y Y

The comparison matrix is given in Table 1which rep-

resents a feature-wise brief overview of these tools. These

are the general characteristics which can be taken as criteria

before evaluating any ETL tool.

4 Experimental analysis

The availability of different functionality about these tools

makes it difﬁcult to create a ranking, because all of them

has some special type of features. So, respective aspect is the

main point to choose any tool for use. General speciﬁcation of

these tools is discussed in the previous section. This section

will discuss performance evaluation of each tool based on

their characteristics.

4.1 Performance analysis of code-based ETL tools

For evaluation purpose, all the selected ETL tools have been

deployed in the local machine. The hardware speciﬁcations

of the machine and software description are given below.

123

N. Biswas et al.

Hardware speciﬁcations is as follows: Processor: Intel(R)

Core(TM) i5-6500 CPU @3.20 GHz 3.20 GHz Installed

Memory(RAM): 4.00 GB System Type: 64-bit Operating

System, X64-based processor Operating System: Windows

8.1

Software speciﬁcations The above-mentioned ETL tools

have been installed in the machine with the following speci-

ﬁcations.

–Pygrametl Version 2.6 has been installed with database

PostgreSQL and MySQLdb for deploying purpose. Python

3.6 version is employed in IDE Spyder.

–Petl Version 1.1.1 is used. It does not have any installation

dependencies. SQLAlchemy and PostgreSQL databases

have been used. Here also, Python 3.6 is used in Spyder.

–Scriptella Version 1.1 has been installed along with

HSQLDB database.

–etl package version 0.3.7 and PostgreSQL is installed

along with DBI RPostgreSQL. For IDE RStudio is uti-

lized along with R version 3.4.4.

Before starting the assessment of any tools, some criteria

should be selected. We have evaluated the performance of

four code-based ETL tools on the basis of four criteria. They

are execution time, transformation support, throughput and

code length. After evaluating, the results have been graphi-

cally represented to justify the performance analysis. Here,

in the ﬁgures below, P1 represents pygrametl, P2 represents

petl, S represents Scriptella and R represents R_etl tool.

4.2 Execution timing

Quantifying the amount of time taken to execute any ETL

process is the most important case to analyze any tool. The

execution time taken for each tool is collected and plotted in

a graph, as shown in Fig. 1. Among the two sources input ﬁle,

one ﬁle contains 120 row elements and another ﬁle contains

Fig. 1 Execution timing

8 row elements which need to move in DW. The execution

time is calculated in a number of seconds. It is observed that

R_etl is much efﬁcient than other three options with respect

to execution timing taken.

4.3 Transformation support

Performing required transformation is the most important

task in ETL process. Generation of correct data depends upon

the successful completion of the transformation process. In

this study, a list of transformation method supported by each

tool has been identiﬁed. Based on the range of transformation

support provided by each tool, a comparative graph is drawn,

as shown in Fig. 2. It is observed that Petl is the highest

number of transformation variety provider.

4.4 Throughput

To evaluate any system efﬁciency, throughput is one of the

important metrics. Here, throughput of each ETL tool is

measured after deploying each four systems on premise. Fig-

ure 3presents the throughput performance of them. Here, the

throughput is measured with respect to the number of rows

processed per second. Performance of R_etl is better com-

pared to other ones.

Fig. 2 Transform support

Fig. 3 Throughput rate

123

Efﬁcient incremental loading in ETL processing for real-time data integration

Fig. 4 Line of code

4.5 Line of code

It is a measure of how many lines of code is required for

accomplishing the total ETL process. A comparative graph

is presented in Fig. 4will show the line of code required

for each tool to establish the task. An approximate value is

considered for this case. Less line of code means how easily

and in less time a code for implementing ETL can be written.

From the graph, it is visible that etl using R takes less number

of codes than the others.

5 Modeling ETL jobs

The main objective of the paper is to design a model for code-

based ETL tool by which we can reduce the data ﬂow and

latency for achieving near real-time ETL processing in DW

environment with respect to incremental loading by utiliz-

ing CDC technique. This approach can noticeably cut down

the ETL processing latency by reducing the amount of data

propagation throughout the extraction, transformation and

loading stages of the ETL workﬂow.

5.1 Incremental loading

The loading task in the DW is executed as a background pro-

cess, in certain time interval. Initial load is performed for the

foremost DW population purpose. Afterward, if any modi-

ﬁcation or new data arrive at source side, DW refreshing is

done by full reloading process for making the repository up

to date. But nowadays, with the increasing data size, high

data rate and complex data structure, full reloading becomes

inefﬁcient and inadequate. The more practical approach is

continually updating DW the changes made in the source

data since the previous reload. This approach is termed as

incremental loading [18,19]. This type loading is more efﬁ-

cient compared to full load. The main idea of incremental

loading is that capturing only the changes happen at source

data and propagating it to the DW.

5.2 Change data capture (CDC)

It is a new real-time data integration approach, which identi-

ﬁes and captures changes occur to data sources and delivers

only the changed data to operational system [37]. This

approach does not need DW downtime or batch windows of

ETL. Some CDC technologies operate in batch mode with

a pulling technique, meaning that the ETL tool periodically

receives a batch for all new changes made up to last receives

and executes them. The real-time CDC solutions apply a con-

tinuous streaming “push” approach for delivering data.

The advantage of CDC is that the latency can be cut down

to minutes or even seconds which makes the data instantly

available eliminating the use of batch windows. Besides,

it minimizes the amount of data ﬂow; therefore, resource

requirement is minimized and data ﬂow speed and efﬁciency

are maximized. CDC addresses some business needs like

building operational data stores (ODS), business activity

monitoring (BAM), application integration, real-time dash-

boards, data quality improvement, etc. There are several

techniques by which CDC techniques [38,39] can be imple-

mented for detecting the change.

Transactional log Most of the data sources maintain a

change log which keeps record of all changes performed on

it. This log keeps track of date and time for last modiﬁcation

in the table. This transactional log can be used to capture the

change.

Database log scraping and log snifﬁng This technique

takes a snapshot of the transaction logs maintained by the

database system for backup and recovery at a scheduled time.

The later techniques involve “pooling” of the active log ﬁle

and identify changes on real time.

Snapshot differential During the extraction phase, a snap-

shot of complete source table is taken. Changed data can be

identiﬁed by comparing consecutive snapshots [40,41].

Timestamped index The operational system often main-

tains a time stamps column for keeping track of last update.

This column referred as audit column generates a new time

stamp for any modiﬁcation in tuple. These audit columns can

be used to identify new changes since last loading cycle.

Database trigger It is a special type of activity in database

system which is ﬁred based on some predeﬁned func-

tions [26]. Trigger can be set on every (add, delete and

update) event for ﬁnding new data. The output of the trigger

program which is stored in another ﬁle can be used for

extracting data.

5.3 Proposed approach

For implementing incremental loading, the proposed work

is divided into three parts. In the ﬁrst part of the work,

snapshot-based CDC is implemented which captures only the

changes in data from the input dataset and loads them into

123

N. Biswas et al.

the data warehouse. An algorithm for dimension process-

ing is described in the second part. Finally, in the third part,

an algorithm for fact processing is represented in a formal

way.

For the algorithm, four database tables and one input

dataset are taken. The ﬁrst table is the main table which con-

tains previously loaded data. The second and third tables are

dimension tables which will be used at the time of dimension

processing. The fourth one is a fact table Tabfact and will

be used at the time of fact processing. Let us assume that

the name of the main table is Tab

main, and the name of the

ﬁrst dimension table is Tab1dim and Tab2dim . Now, let us

focus on the input dataset. It is considered as a new record

Newdata which needs to be extracted from the source and

needs to be loaded in the main table.

Algorithm 1: Change Data Capture Algorithm

Result: Dataframe of new, updated and deleted records

1DF

new←Tak e Newdata dataset as Input from Machine;

2DFTab

main ←Fetch Tab

main table from Database;

3if DFTab

main is empty then

4Load DF

newin the Database;

5else

6IN

df ←Concat (DF

new,DFTab

main,DFTab

main).

drop_duplicate();

7UP

temp ←Concat

(DF

new,DFTab

main,DFTab

main).drop_duplicate

(subset=key);

8UP

df ←Concat (IN

df,UP

temp).drop_duplicate();

9DELdf ←Concat

(DF

new,DF

new,DFTab

main).drop_duplicate (subset=key);

10 end

For Algorithm 1, at ﬁrst Newdata input dataset is taken as

a pandas DataFrame D Fnew. Next, previously loaded record

Tab

main table is fetched in a DataFrame DF T a bmain from

the database. If the main table DFTab

main is empty, load

data from the DataFrame D Fnewin Tab

main table using bulk

load.

If DFTab

main is not empty, then determine the changes

occurs in Newdata dataset. For this purpose, concatenate

new,DFTab

main and DFTab

main. Then, duplicate rows

are deleted and DataFrame I Ndf of new elements stored in

the new record.

Afterward, for getting the updated values, again con-

catenate between DF

new,DFTab

main and DFTab

main.

Then, duplicate rows are removed and stored in a tempo-

rary DataFrame U Ptemp. Next, again concatenate between

temp and DF

new. After deleting the duplicates ﬁnally

updated DataFrame U Pdf is established. At the last step,

duplicate records are deleted and got the DataFrame of

deleted elements DELdf.

5.4 Dimension processing algorithm

Algorithm 2: Dimension Processing Algorithm

Result: Dimension Table

1DFDimnew←Fetch dimension attributes from recently

updated main table Tab

main from database;

2DFDimold ←Fetch old dimension table Tab1dim from

database;

3DIMdf ←Concat (DFDimnew,DFDimold ,DFDimold ).

drop_duplicate();

4Load DIMdf DataFrame in old dimension table Tab1dim of the

database;

For Algorithm 2, in the ﬁrst step, dimension attributes

are fetched from recently updated main table Tab

main into

DataFrame D F Di m new. Consider it as new record. In the

next step, fetch a old dimension table Tab1dim in a pandas

DataFrame D F Di m old from the database.

Afterward, for getting the changes in dimension table con-

catenate DataFrame D F D im newwith DFDimold .After

dropping the duplicate, changed data need to be loaded into

the database dimension table Tab1dim.

5.5 Fact processing algorithm

Algorithm 3: Fact Processing Algorithm

Result: Fact Table

1DFTab

new←Fetch recently updated main table Tab

main from

database;

2DFDim1new←Fetch recently updated dimension table

Tab1dim from database;

3DFDim2new←Fetch another recently updated dimension table

Tab2dim from database;

4DFFact

old ←Fetch Fact table Tabfact with old data from

database;

5Fact1df ←Merge(DFTab

new,DFDim1new) using common

attribute Com_attr1;

6Fact2df ←Merge(Fact1df,DFDim2new) using common

attribute Com_attr2;

7Fact3df ←Only select the Key attributes from Fact2df;

8Factdf ←Concat

(Fact3df,DFFact

old ,DFFact

old ).drop_duplicate();

9Load Factdf dataframe in old dimension table of the database;

Algorithm 3 describes the fact processing part of the algo-

rithm. At ﬁrst, fetch recently updated main table Tab

main in

DataFrame D F D i mnewfrom the database. It is a new record.

In the next two steps, fetch two recently updated dimension

tables Tab1di m and Tab2dim in DataFrame D F D i m1new

and DFDim2newfrom the database. In the next step, fetch

a fact table Tabfact with old records.

123

Efﬁcient incremental loading in ETL processing for real-time data integration

Now, for determining the changes and updating it in the

fact table Tabfact,ﬁrstmerge DFTab

newand DFDim1new

with respect to any common attribute Com_attr1. The result-

ing DataFrame is Fact1df. Then, merge Fact1df and

DFDim2newwith respect to a common attribute Com_attr2

with resulting dataframe Fact2df.FromFact2df, we need

to select the key attributes which can reference the dimen-

sion tables and make another DataFrame Fact3df. Then,

concatenate Fact3df and two DFFact

old and drop the

duplicates from the resulting DataFrame. In the last step,

load the changed data in the fact table Tabfact.

6 Example scenario

For our example scenario, a small DW is designed for book

store management system located at various cities. It contains

single fact table and three dimension tables for book details,

time of sale and location of book stores. Figure 5shows the

star schema constructed with corresponding fact tables and

dimension tables.

Fig. 5 DW schema design

Dimensional modeling is an established and a very popu-

lar methodology for DW design. It reﬂects the logical schema

design of a particular DW. Fact tables usually contain mea-

sures of business processes that are referred to as facts.

Dimension tables store textual descriptions of the business

entities. The dimension attributes are used to select, group

and aggregate facts of interest in the DW queries. Multiple

hierarchical relationships in a single table can often be rep-

resented by dimensions. Dimension tables are usually highly

denormalized.

6.1 Experimental analysis

Here, experimental results are provided to demonstrate the

advantage of incremental loading over full reloading. To

implement and maintain the DW, input data are taken from a

database table and an external CSV ﬁle. A dataset of 60,000

tuples ﬁrst inserted in a database table. Then, six datasets

are created having 1, 5, 10, 20, 25, 30, 40, 45 percentage of

changed data and inserted in different steps. At each step,

equal numbers of tuples were inserted, deleted and updated.

Deletions are not propagated since historical data are kept in

the DW. The Hardware requirement to run the tests was PC

having Intel(R) Core(TM) i5-4200U (2.3GHz) processor, a

DDR2 4 GB main memory on a 1 TB hard disc, SATA stan-

dard, transfer rate of 3.0 Gbit/s and 5400 rpm (rotations per

minute). The equipment had Microsoft Windows 8.1 system

software, Anaconda Python 3.6 compiler and PostgreSQL

10.4 DBMS.

6.2 Results and discussion

The tests were done in two stages. At the ﬁrst stage, the time

to compute change data for full reloading is measured. A

Fig. 6 Time comparison

between full load and

incremental load

123

N. Biswas et al.

ﬁctitious database containing a single table was used. This

table contains 6000 tuples. Then, the previously mentioned

six datasets are given as an input for the time measurement

to compute change data for DW refreshment. At the second

stage, the time to compute change data for incremental load-

ing is measured. We have followed exactly the same steps

which we have done in the ﬁrst stage for the measurement

of the time to compute change data for DW refreshment.

A comparative graph is shown in Fig. 6for both full load

and incremental load using CDC technique. As expected, the

time for full reload is considerably slower than the incremen-

tal loading where only the changes are captured and loaded

into the data warehouse. Incremental loading clearly outper-

forms full reloading. There is no effect on the performance

of the proposed method when the source relation is changing

dramatically.

The aim of this proposed model is to reduce processing

overhead throughout the ETL stages and also minimize the

time complexity compared to full loading technique. Perfor-

mance of incremental loading proves quite good compared

to full loading. It is observed that the CPU time grows lin-

early with the size of dataset grows for both cases. From

Fig. 6, it is clearly visible that how the incremental method

outperforms full reloading of data into the data warehouse.

The result is expected because when we are performing full

reloading, we need to load the whole table containing the

changed data as well as the old data in the data warehouse.

But for the incremental loading, only the changed data are

being identiﬁed and loaded into the data warehouse. That is

why the size of the data which needs to be loaded becomes

much small. For this reason, incremental loading manifests

to be very efﬁcient for refreshing DW in near real time.

7 Conclusion

At present, the requirement of continuous and increasing

amount of data handling within more complex environment

is a great challenge in the research domain. It demands stan-

dardized ETL process which has a great business impact on

the BI industry. Most of the organizations opt for taking any

vendor-made GUI-based product for their ETL solution. But

still in some of the cases custom-coded ETL can be the best

option in respect of performance. This paper has chosen the

second option for ETL solution. Four promising code-based

ETL tools such as Pygrametl, Petl, Scriptella and etl have

been chosen in this paper. Both performance and feature-

based analysis are presented over these tools.

Main contribution of this work is to modeling of a near

real-time ETL process with the persuasion of incremental

loading. However, the development of ETL jobs for incre-

mental loading is not at all well supported by existing ETL

tools. In fact, separate ETL jobs for initial loading and incre-

mental loading have to be created by ETL programmers

so far. Since incremental load jobs are considerably more

complex and error prone, this approach proves much better

compared to the former one.

The future work aims to design a uniﬁed code-based ETL

framework supporting relational and NoSQL database with

rich transformation library hosted on Cloud platform. Also,

we will focus on advanced transformation operators such as

aggregation, outer joins and data restructuring such as pivot-

ing. Further, we have a plan to utilize the usage of the staging

area in a more better way. CDC limitations can be reduced

to some extent by utilizing the staging area. Moreover, we

expect performance improvements from persisting interme-

diary results.

References

1. Inmon W (2005) Building the data warehouse. Wiley, New York

2. Vassiliadis P (2009) A survey of extract—transform—load tech-

nology. Int J Data Warehous Min 5(3):1–27

3. Eckerson W, White C (2003) Evaluating ETL and data integration

platforms. Report of The Data Warehousing Institute 184

4. Data integration. http://www.pentaho.com/product/data-

integration. Accessed 06 Feb 2018

5. Data integration: talend enterprise data integration services.

http://www.talend.com/products/data-integration. Accessed 06

Feb 2018

6. Data integration tools and software solutions | informatica

India. https://www.informatica.com/in/products/data-integration.

html. Accessed 06 Feb 2018

7. Oracle Data Integrator. http://www.oracle.com/technetwork/

middleware/data-integrator/overview/index.html. Accessed 06

Feb 2018

8. IBM, InfoSphere Information Server. http://www-03.ibm.com/

software/products/en/infosphere-information-server/. Accessed

06 Feb 2018

9. Schmidt N, Rosa M, Garcia R, Molina E, Reyna R, Gonzalez J

(2011) Etl tool evaluation—a criteria framework. University of

Texas-Pan American, Texas

10. Majchrzak TA, Jansen T, Kuchen H (2011) Efﬁciency evaluation

of open source ETL tools. In: Proceedings of the 2011 ACM sym-

posium on applied computing. ACM, pp 287–294

11. Pall AS, Khaira JS (2013) A comparative review of extrac-

tion, transformation and loading tools. Database Syst J BOARD

4(2):42–51

12. 2017 Gartner magic quadrant for data integration tools. https://

www.informatica.com/in/data-integration- magic-quadrant.html.

Accessed 06 Dec 2017

13. Thomsen C, Pedersen T (2005) A survey of open source tools for

business intelligence. In: International conference on data ware-

housing and knowledge discovery. Springer, pp 74–84

14. Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of ETL

activities. In: Proceedings of the ACM twelfth international work-

shop on Data warehousing and OLAP. ACM, pp 25–32

15. Kabiri A, Chiadmi D (2013) Survey on ETL processes. J Theor

Appl Inf Technol 54(2):219–229

16. Labio W, Yang J, Cui Y, Garcia-Molina H, Widom J (1999)

Performance issues in incremental warehouse maintenance. In:

Proceedings of the 26th international conference on very large data

123

Efﬁcient incremental loading in ETL processing for real-time data integration

bases (VLDB’00), Cairo, Egypt, September 2000. Stanford Info-

Lab

17. Zhang X, Sun W, Wang W, Feng Y, Shi B (2006) Generat-

ing incremental etl processes automatically. In: First interna-

tional multi-symposiums on computer and computational sciences

(IMSCCS’06), vol 2. IEEE, pp 516–521

18. Jörg T, Dessloch S (2008) Towards generating ETL processes for

incremental loading. In: Proceedings of the 2008 international sym-

posium on database engineering applications (IDEAS’08). ACM,

pp 101–110

19. Jörg T, Dessloch S (2009) Formalizing etl jobs for incremental

loading of data warehouses. In: BTW, pp 327–346

20. Behrend A, Jörg T (2010) Optimized incremental etl jobs for

maintaining data warehouses. In: Proceedings of the fourteenth

international database engineering and applications symposium,

ACM, pp 216–224

21. Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time

snapshot maintenance with incremental ETL pipelines in data

warehouses. In: Big data analytics and knowledge discovery.

Springer, pp 217–228

22. Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data

capture and handling mechanism in real-time data warehouse. In:

2008 international conference on computer science and software

engineering, vol 4, IEEE, pp 478–481

23. Ma K, Yang B (2015) Log-based change data capture from schema-

free document stores using mapreduce. In: 2015 International

conference on cloud technologies and applications (CloudTech).

IEEE, pp 1–6

24. Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change

data capture with web service database encapsulation. In: 2010 6th

world congress on services (SERVICES-1). IEEE, pp 128–131

25. Tank DM, Ganatra A, Kosta YP, Bhensdadia CK (2010) Speeding

ETL processing in data warehouses using high-performance joins

for changed data capture (cdc). In: 2010 international conference on

advances in recent technologies in communication and computing

(ARTCom). IEEE, pp 365–368

26. Sukarsa IM, Wisswani NW, Darma IG (2012) Change data capture

on OLTP staging area for nearly real time data warehouse base on

database trigger. Int J. Comput. Appl. 52(11):32–37

27. Valêncio CR, Marioto MH, Zafalon GFD, Machado J, Momente

J (2013) Real time delta extraction based on triggers to support

data warehousing. In: International conference on parallel and dis-

tributed computing, applications and technologies (PDCAT’13).

IEEE, pp 293–297

28. Thomsen C, Pedersen T (2009) pygrametl: a powerful pro-

gramming framework for extract-transform-load programmers. In:

Proceedings of the ACM twelfth international workshop on Data

warehousing and OLAP. ACM, pp 49–56

29. Thomsen C, Pedersen T (2011) Easy and effective parallel pro-

grammable etl. In: Proceedings of the ACM 14th international

workshop on data warehousing and OLAP. ACM, pp 37–44

30. pygrametl, ETL programming in Python. http:// www.pygrametl.

org/. Accessed 25 Feb 2018

31. Petl - Extract, transform and load (tables of data). http:// petl.

readthedocs.io/en/latest/ . Accessed 06 Dec 2017

32. Welcome to Scriptella ETL Project. http://scriptella.org/. Accessed

06 Dec 2017

33. Scriptella/scriptella-etl. https://github.com/scriptella/scriptella-

etl/wiki. Accessed 06 Dec 2017

34. Baumer B (2017) A grammar for reproducible and painless

extract-transform-load operations on medium data. arXiv preprint

arXiv:1708.07073

35. Baumer B (2017) etl: extract-transform-load framework for

medium data. http://github.com/beanumber/etl, r package version

0.3.7

36. ETL. https:// cran.r-project.org/web/packages/etl/ README.html.

Accessed 10 Mar 2018

37. Efﬁcient and real time data integration with change data capture.

White paper, Attunity Ltd. (2009). http://attunity.com

38. Ankorion I (2005) Change data capture efﬁcient ETL for real-time

bi. Inf Manag 15(1):36

39. Bokade MB, Dhande SS, Vyavahare HR (2013) Framework of

change data capture and real time data warehouse. In: International

journal of engineering research and technology, vol2. ESRSA Pub-

lications

40. Lindsay B, Haas L, Mohan C, Pirahesh H, Wilms P (1986) A

snapshot differential refresh algorithm. In: Proceedings of the

ACM-SIGMOD conference, vol 15

41. Labio W, Garcia-Molina H (1996) Efﬁcient snapshot differential

algorithms for data warehousing. In: Proceedings of the 22th inter-

national conference on very large data bases (VLDB’96). Morgan

Kaufmann Publishers Inc, pp 63–74

Publisher’s Note Springer Nature remains neutral with regard to juris-

dictional claims in published maps and institutional afﬁliations.

123

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Innovations in Systems and Software Engineering

This content is subject to copyright. Terms and conditions apply.

An Overview of ETL Techniques, Tools, Processes and Evaluations in Data Warehousing

Article

Jan 2024

Data Extraction, Transformation, and Loading Process Automation for Algorithmic Trading Machine Learning Modelling and Performance Optimization

Preprint

Full-text available

Dec 2023

A data warehouse efficiently prepares data for effective and fast data analysis and modelling using machine learning algorithms. This paper discusses existing solutions for the Data Extraction, Transformation, and Loading (ETL) process and automation for algorithmic trading algorithms. Integrating the Data Warehouses and, in the future, the Data Lakes with the Machine Learning Algorithms gives enormous opportunities in research when performance and data processing time become critical non-functional requirements.

Adjustment of p-value expression to ontology using machine learning for genetic prediction, prioritization, interaction, and its validation in glomerular disease

Article

Full-text available

Oct 2023

The results of gene expression analysis based on p-value can be extracted and sorted by their absolute statistical significance and then applied to multiple similarity scores of their gene ontology (GO) terms to promote the combination and adjustment of these scores as essential predictive tasks for understanding biological/clinical pathways. The latter allows the possibility to assess whether certain aspects of gene function may be associated with other varieties of genes, to evaluate regulation, and to link them into networks that prioritize candidate genes for classification by applying machine learning techniques. We then detect significant genetic interactions based on our algorithm to validate the results. Finally, based on specifically selected tissues according to their normalized gene expression and frequencies of occurrence from their different biological and clinical inputs, a reported classification of genes under the subject category has validated the abstract (glomerular diseases) as a case study.

Implementing Middleware Architecture for Automated Data Pipeline over Cloud Technologies

Conference Paper

Apr 2024

The multi-criteria evaluation of research efforts based on ETL software: from business intelligence approach to big data and semantic approaches

Article

Full-text available

Jan 2024

Many industries and academia have devoted a lot of effort and money to creating and/or using good extract-transform-load (ETL) software suitable for their data analysis purposes since it is considered a key to their success. As a result, we find the valuable interventions of research efforts based on ETL software are divided according to well-known approaches such as Business Intelligence, Big Data, and/or Semantic. As a result, problems arise in keeping up with changes and handling the significant diversity in features across these approaches. Which results in disorientation in the finding, evaluation, and choice of an ETL for industries and academia facing their approaches needs. These problems inspire us to provide a contribution that uses the systematic-literature-review (SLR) method to collect 207 papers from three databases, namely, ScienceDirect, Springer, and IEEE, dated from 2010 to 2022, grouped based on both ETL approaches and their commonly used criteria, afterwards using an existing method that automatically identifies the adequate multicriteria method for this study, which gives us the analytical-hierarchy-process method to provide the best research paper according to the requirements of scientific literature. The result implies the great significance of this study in multiple ways, providing a global idea of research papers about ETL approaches, allowing customers to eliminate uncertainty from selecting an ETL according to their specific approach needs, preferences, and interests, and also enabling future researchers and developers of ETL to decide when to focus and how to make innovative contributions to fill gaps in the literature.

Challenges and Solutions of Real-Time Data Integration Techniques by ETL Application

Chapter

Dec 2023

Business organizations are trying to focus from the traditional extract-transform-load (ETL) system towards real-time implementation of the ETL process. Traditional ETL process upgrades new data to the data warehouse (DW) at predefined time intervals when the DW is in off-line mode. Modern organizations want to capture and respond to business events faster than ever. Accessing fresh data is not possible using traditional ETL. Real-time ETL can reflect fresh data on the warehouse immediately at the occurrence of an event in the operational data store. Therefore, the key tool for business trade lies in real-time enterprise DW enabled with Business Intelligence. This study provides an overview of ETL process and its evolution towards real-time ETL. This chapter will represent the real-time ETL characteristics, its technical challenges, and some popular real-time ETL implementation methods. Finally, some future directions towards real-time ETL technology are explored.

Research and Application of an Incremental Extraction Scheme for Monitoring Power Equipment Data

Conference Paper

Jan 2024

The monitoring data of power equipment exhibit characteristics of multiple sources, a large amount of data, and high real-time performance. To effectively extract information from these data, it is imperative to extract monitoring data from various equipment and subject it to a series of processing steps. Data extraction is a crucial stage in the ETL (extraction, transformation, loading) process. In this paper, we conduct a comprehensive analysis of various incremental extraction capture mechanisms, highlighting their respective advantages and disadvantages and applicable conditions. We then select CDC technology, which involves analysing the database’s log to identify changed data and using the incremental extraction method to extract the monitoring data of power equipment. Our analysis and experimental results demonstrate that the proposed method ensures the accuracy and integrity of data to a significant extent compared to other incremental extraction mechanisms. Furthermore, it addresses the incremental extraction challenges between databases in the power equipment monitoring environment and improves the efficiency and performance of data extraction.

ETL Technologies for Big Data: A Comparative Study

Conference Paper

Nov 2023

There is a significant increase in the generation of massive amounts of data worldwide. Various sources, such as social media applications, blockchain technology, and numerous other systems, are responsible for generating data. This information can be structured or unstructured, originating from different sources. In this context, the ETL (Extract, Transform, and Load) process plays a crucial role where demands for efficient business decisions in contemporary systems are grown. During the processing phase, the entire method is organized in a pipeline structure to ensure that the resulting data contains valuable and pertinent insights. Indeed, selecting an appropriate ETL technology can be challenging due to the abundance of options available. This article delves into several technologies used for performing ETL processes, with the aim of examining their strengths and weaknesses. This can provide assistance to researchers and industry professionals in making informed decisions and choosing the most suitable technologies to meet their specific requirements.

ETL applied to Klarna e-commerce dataset

Conference Paper

Oct 2023

Extending The Data Integration Model As The Foundation Of Business Intelligence: A Systematic Literature Review

Conference Paper

Sep 2023

A comparative Review of Extraction, Transformation and Loading Tools

Article

Full-text available

Jun 2018

Business today forces the enterprises to run different but coexisting information systems. However, data warehousing enterprises have a dilemma of choosing the right ETL process and the right ETL tool for their organization as one wrong step or choice may lead to a series of losses both monetarily and by time, not to mention the amount of laborious work that the workers would put in. The organization can choose from a variety of ETL tools but without exploring or the knowledge of their features this would again result in a bad decision making process. In this paper, we have tried to present a comparative review of some of the leading ETL tools just to acquaint the users with its features and drawbacks.

Real Time Delta Extraction Based on Triggers to Support Data Warehousing

Conference Paper

Full-text available

Dec 2013

Nowadays large corporations require integrated data from diverse sources, leading to the use of data warehouse architectures for this purpose. To bypass problems related to the use of computational resources to process large volumes of data, an ETL (Extract, Transform and Load) technique with zero latency can be used, that works by constantly processing small data loads. Among the extraction techniques of the zero latency ETL are the use of logs, triggers, materialized views and timestamps. This paper proposes a structure capable of performing this task by means of triggers and a tool developed for the automatic generation of the SQL (Structured Query Language) code to create these trigger, besides showing its performance and comparing it to other techniques. Said method is relevant for the extraction of portions of selected information as it permits to combine conventional and real time ETL techniques.

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Article

Aug 2017

Ben Baumer

Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable hub-and-spoke framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.

Log-based change data capture from schema-free document stores using MapReduce

Conference Paper

Jun 2015

A comparative Review of Extraction, Transformation and Loading Tools

Article

Jan 2013

Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses

Conference Paper

Sep 2015

Multi-version concurrency control method has nowadays been widely used in data warehouses to provide OLAP queries and ETL maintenance flows with concurrent access. A snapshot is taken on existing warehouse tables to answer a certain query independently of concurrent updates. In this work, we extend this snapshot with the deltas which reside at the source side of ETL flows. Before answering a query, relevant tables are first refreshed with the exact source deltas which are captured at the time this query arrives (so-called query-driven policy). Snapshot maintenance is done by an incremental recomputation pipeline which is flushed by a set of consecutive deltas belonging to a sequence of incoming queries. A workload scheduler is thereby used to achieve a serializable schedule of concurrent maintenance tasks and OLAP queries. Performance has been examined by using read-/update-heavy workloads.

A Survey of Extract–Transform–Load Technology

Article

Sep 2011

Panos Vassiliadis

The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a)first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the L-T-L triplet, and examine problems that full within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]

Building the Data Warehouse

Book

Jan 1992

W.H Inmon

ETL Tool Evaluation – A Criteria Framework

Article

True Real-Time Change Data Capture with Web Service Database Encapsulation

Conference Paper

Jul 2010

This research is investigating the claim that Change Data Capture (CDC) technologies capture data changes in real-time. Based on theory, our hypothesis states that real-time CDC is not achievable with traditional approaches (log scanning, triggers and timestamps). Traditional approaches to CDC require a resource to be polled, which prevents true real-time CDC. We propose an approach to CDC that encapsulates the data source with a set of web services. These web services will propagate the changes to the targets and eliminate the need for polling. Additionally we propose a framework for CDC technologies that allow changes to flow from source to target. This paper discusses current CDC technologies and presents the theory about why they are unable to deliver changes in real-time. Following, we discuss our web service approach to CDC and accompanying framework, explaining how they can produce real-time CDC. The paper concludes with a discussion on the research required to investigate the real-time capabilities of CDC technologies.

Efficient incremental loading in ETL processing for real-time data integration

Abstract and Figures

Recommended publications

Empirical Analysis of Programmable ETL Tools

Challenges and Solutions of Real-Time Data Integration Techniques by ETL Application

Programmatic ETL: 7th European Summer School, eBISS 2017, Bruxelles, Belgium, July 2–7, 2017, Tutori...

Towards Handling Incremental Load for Anomalies in Near Real Time Data Warehouse