Conference PaperPDF Available

Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)

Authors:

Abstract and Figures

In today's fast-changing, competitive environment, a complaint frequently heard by data warehouse users is that access to time-critical data is too slow. Shrinking batch windows and data volume that increases exponentially are placing increasing demands on data warehouses to deliver instantly-available information. Additionally, data warehouses must be able to consistently generate accurate results. But achieving accuracy and speed with large, diverse sets of data can be challenging. Various operations can be used to optimize data manipulation and thus accelerate data warehouse processes. In this paper we have introduced two such operations: 1. Join and 2. Aggregation-which will play an integral role during preprocessing as well in manipulating and consolidating data in a data warehouse. Our approach demonstrate how we can save hours or even days, when processing large amounts of data for ETL, data warehousing, business intelligence (BI) and other mission critical applications.
Content may be subject to copyright.
Speeding ETL Processing in Data Warehouses Using
High-Performance Joins
For Changed Data Capture (CDC)
Darshan M. Tank1, Amit Ganatra2, Y P Kosta3
1,3Department of IT, 2Department of CE
Charotar Institute of Technology (Faculty of Technology)
Charotar University of Science and Technology, Changa
Anand-388421, India
1dmtank@gmail.com, 2amitganu@yahoo.com,
3ypkosta@yahoo.com
C K. Bhensdadia
Department of CE
Faculty of Technology
Dharmsinh Desai University
Nadiad-387001, India
ckbhensdadia@yahoo.co.in
AbstractIn today's fast-changing, competitive environment,
a complaint frequently heard by data warehouse users is that
access to time-critical data is too slow. Shrinking batch windows
and data volume that increases exponentially are placing
increasing demands on data warehouses to deliver instantly-
available information. Additionally, data warehouses must be
able to consistently generate accurate results. But achieving
accuracy and speed with large, diverse sets of data can be
challenging.
Various operations can be used to optimize data
manipulation and thus accelerate data warehouse processes. In
this paper we have introduced two such operations: 1. Join and 2.
Aggregation – which will play an integral role during
preprocessing as well in manipulating and consolidating data in a
data warehouse. Our approach demonstrate how we can save
hours or even days, when processing large amounts of data for
ETL, data warehousing, business intelligence (BI) and other
mission critical applications.
KeywordsBusiness Intelligence, Near Real-Time Data
Warehousing, Change Data Capture (CDC), Extract-Transform-
Load (ETL)
I. INTRODUCTION
The widespread use of the Internet and related
technologies in various business domains has accelerated the
intensity of competition, increased the volume of data and
information available, and shortened decision-making cycles
considerably. Typically, in a large organization, many
distributed, heterogeneous data sources, applications, and
processes have to be integrated to ensure delivery of the best
information to the decision makers. In order to support
effective analysis and mining of such diverse, distributed
information, a data warehouse (DWH) collects data from
multiple, heterogeneous (operational) source systems and
stores integrated information in a central repository.
There are numerous challenges facing IT departments
today as they deal with reduced budgets, smaller staffs, and
ever-increasing demands on the development of business
critical applications. Within the factory, there are often large
volumes of data that have to be processed from application to
application or from repository to repository, creating a flow of
data across the enterprise [1].
Figure 1. Technical architecture of Corporate Information Factory.
From above Fig., it’s easy to see that data volumes can
easily grow while demands on the data increase as well. This
can occur when processing the data and putting it into another
form or database. The management of all of this data
movement and the time it takes to process all of the
information creates a new set of IT challenges [5].
II. ETL PROCESS
Extract Transform Load (ETL) is a common terminology
used in data warehousing which stands for extracting data
from source systems, transforming the data according to the
business rules and loading to the target data warehouse. ETL
is a process in data warehousing responsible for pulling data
out of the source systems and placing it into a data warehouse
[3].
ETL systems move data from OLTP systems to a data
warehouse, but they can also be used to move data from one
data warehouse to another. A heterogeneous architecture for
an ETL system is one that extracts data from multiple sources.
The complexity of this architecture arises from the fact that
data from more than one source must be merged, rather than
from the fact that data may be formatted differently in the
different sources.
Figure 2. ETL System Architecture.
The ETL process is not a one-time event; new data is
added to a data warehouse periodically [4]. Typical periodicity
may be monthly, weekly, daily, or even hourly, depending on
the purpose of the data warehouse and the type of business it
serves.
III. DATA PROCESSING CHALLENGES
Staying competitive in today’s real-time business world
demands the capability to process ever-increasing volumes of
data at lightning speed. In today's fast-changing, competitive
environment, a complaint frequently heard by data warehouse
users is that access to time-critical data is too slow. Shrinking
batch windows and data volume that increases exponentially
are placing increasing demands on data warehouses to deliver
instantly-available information.
Additionally, data warehouses must be able to consistently
generate accurate results. But achieving accuracy and speed
with large, diverse sets of data can be challenging [5]. There
has been increased need to update the underlying data
infrastructure to improve access to, and the quality of, real-
time business intelligence (BI) and other mission-critical
operations, including data marts, data mining, online data
stores, and online transaction processing (OLTP) systems.
One of the primary factors affecting data warehouse
performance is the underlying physical distribution of the data
in the databases, which must be consolidated in various ways
[6]. An inability to perform data intensive operations
efficiently will inevitably impede business analyses and
subsequent decision-making.
IV. KEY COMPONENTS OF DATA WAREHOUSE
PROCESSING
Various operations can be used to optimize data
manipulation and thus accelerate data warehouse processes.
Two operations in particular, join and aggregation play an
integral role during preprocessing as well in manipulating and
consolidating data in a data warehouse.
Deriving useful information from raw data within a data
warehouse involves joining factual and dimensional
information before aggregating it to produce a report or output
for downstream analysis. However, joins and aggregations are
data-intensive and time-intensive, which prevent data
warehouses with insufficiently fast ETL software from
performing them frequently; but frequency is key to providing
data for timely business analysis [7].
For processing data for BI analytics and other mission-
critical applications, the usual go-to software products include
various extract, transform, load (ETL) and data warehousing
solutions [12]. However, these products do not always provide
the speed required to deliver data on time. It must support the
Advanced Data Management (ADM) capabilities to reduce
processing time of data-intensive operations through the use of
high-performance joins and high-performance aggregations.
A. Joins
Joins are used to combine information from two or more
data sources, such as database tables, and place it into a new
data source suitable for downstream processing or reports.
Joins are particularly powerful because they enable rapidly
changing data to be organized into categories for subsequent
report preparation through the use of matching keys. Joins are
used to pre-process data, to improve the efficiency of queries,
and to accelerate changed data capture (CDC).
B. Aggregations
Aggregations play a key role at various stages of data
warehouse processing. From preprocessing of data prior to its
entering the data warehouse, to dimensional data analysis used
for conducting queries and generating reports, aggregations
are critical to efficiently preparing, configuring, and analyzing
large volumes of data.
1) Using High-Performance Aggregations for
Preprocessing: By aggregating data prior to loading it into the
data warehouse, queries, database loads, and other
downstream processing can be performed much faster. ADM
should quickly summarize factual data to the minimum level
of granularity required by the data warehouse. A common
application removes redundant transaction data from multiple
sources for faster queries and database loads [17].
2) Pre-Calculated Aggregations Speed Queries: Data
warehouse experts agree that aggregates are the best way to
speed data warehouse queries. According to data warehousing
expert Ralph Kimball, “Every data warehouse should contain
precalculated and prestored aggregation tables”. Aggregate
operations yield the greatest performance benefits when they
are used for input to data analysis.
In the case of a six million row query, reducing the number
of rows read by creating aggregations across dimensions can
vastly accelerate processing time [13]. A query answered from
base-level data can take hours and involve millions of data
records and millions of calculations. With pre-calculated
aggregates, the same query can be answered in seconds with
just a few records and calculations.
3) Using Multi-Level Hierarchal Aggregations:
Sophisticated aggregation schemes recognize dimensional
hierarchies and build higher-level aggregations from more
granular aggregations. For instance, a daily aggregation of
sales for a region could be used to build a monthly aggregation
of sales by region, and thereby avoid an aggregation of the
more granular daily sales totals. Such aggregate-awareness can
significantly increase speed and enhance reporting capacity.
Figure 3. Multi-Level Hierarchal Aggregation.
Aggregations can also be used to replace fact data with
rolled up versions of themselves. ADM should also perform
multi-level hierarchal aggregation (Fig. 3) to help build cubes
for more advanced dimensional data analyses.
V. CDC (CHANGE DATA CAPTURE)
Change data capture is an approach to data integration,
based on the identification, capture, and delivery of only the
changes made to operational/transactional data systems. CDC
solutions occur most often in data-warehouse environments
since capturing and preserving the state of data across time is
one of the core functions of a data warehouse, but CDC can be
utilized in any database or data repository system. By
processing only the changes, CDC makes the data integration,
and more specifically the ‘Extract’ part of the ETL process
more efficient. When done correctly, it also reduces the
‘latency’ between the time a change occurs in the source
systems and the time the same change is made available to the
business user in the data warehouse.
Next generation Data Integration and ETL tools need to
support Change Data Capture (CDC), a technology that
enables to identify, capture, and move only the changes made
to enterprise data sources. No longer can the entire source data
be moved. Implementing CDC makes data and information
integration in real-time significantly more efficient, and
delivers data at the right-time [6].
Figure 4. Working of CDC in conjunction with ETL tools
A common case for using CDC is in conjunction with ETL
tools for faster and more efficient data extract in data
warehouse implementations. A key goal of CDC is to improve
efficiency by reducing the amount of data that needs to be
processed to a minimum [6]. Therefore if the business
requirements are for only certain changes to be captured, then
it would be wasteful to transfer all changes. The most
advanced CDC solutions therefore provide filters that reduce
the amount of information transferred, again minimizing
resource requirements and maximizing speed and efficiency.
VI. HIGH-PERFORMANCE JOINS FOR CHANGED DATA
CAPTURE (CDC)
To reduce the time needed to retrieve information, data
must be preprocessed into the proper form for the dimensional
data warehouse. High-speed joins are critical to this process,
which can include lookups of legacy values for appropriate
replacement, cleansing and validating data, identifying and
eliminating mismatching values, and pre-aggregation. Using
high-performance join for data-intensive operations,
descriptive information can be combined with factual data late
in a processing sequence so that storage and throughput
requirements are minimized.
CDC is an increasingly important pre-processing function.
By loading only new, updated, and deleted records into the
data warehouse, CDC significantly conserves time and
resources when carrying out data-intensive processes [16].
Figure 5. Joins Used for Changed Data Capture.
Rather than replacing the information in the data
warehouse with the data in the entire online transactional
database, a join will match the primary key of the previously
loaded record with its corresponding new record and then
compare the data portions of the two records to determine if
they’ve changed. In this way, only added, deleted, and altered
records are updated, which significantly reduces elapsed time
of database loads. By using a high-performance join for CDC,
data warehouse updates can be performed with far greater
efficiency.
VII. CONCLUSIONS
Aggregations and joins are widely used in nearly every
industrial sector where large volumes of data must be analyzed
and to significantly accelerate data processing. It speeds
processes like ETL, staging data for a data warehouse, and
database loading. It also minimizes resource consumption,
making it possible to consolidate hardware for significant cost
savings.
ACKNOWLEDGMENT
The authors’ wishes special thanks to the Management of
Charusat for their moral support and continuous
encouragement.
REFERENCES
[1] Imhoff and Claudia “The Corporate Information Factory” DM Review,
December, 1999 (http://dmreview.com)
[2] Bruckner, R. M., and Tjoa, A M. “Capturing Delays and Valid Times in
Data Warehouses—Towards Timely Consistent Analyses,” Journal of
Intelligent Information Systems, September 2002.
[3] Josef Schiefer & Robert M. Bruckner “Container-Managed ETL
Applications for integrating data in near real-time” Twenty-Fourth
International Conference on Information Systems (2003)
[4] W.H. Inmon and Dan Meers “Maximizing the “E” in Legacy Extract,
Transform & Load (ETL)” December 2003
[5] White Paper by Syncsort Incorporated “Solving the Challenges of
Exponential Data Growth” 2009 (http://syncsort.com)
[6] White Paper by Attunity Ltd. “Efficient and Real Time Data Integration
with Change Data Capture” February 2009 (http://attunity.com)
[7] Robert M. Bruckner, A M. Tjoa “Managing Time Consistency for
Active Data Warehouse Environments”. In Proceedings of the Third
International Conference on Data Warehousing and Knowledge
Discovery (DaWaK 2001), Springer LNCS 2114, pp. 254-263, Munich,
Germany, September 2001.
[8] Robert M. Bruckner, Beate List, Josef Schiefer “Striving Toward Near
Real-Time Data Integration for Data Warehouses”. In Proceedings of the
Fourth International Conference on Data Warehousing and Knowledge
Discovery (DaWaK 2002), Springer LNCS 2454, pp. 317-326, Aix-en-
Provence, France, September 2002.
[9] Josef Schiefer, Jun-Jang Jeng, Robert M. Bruckner “Managing
Continuous Data Integration Flows”. Decision Systems Engineering
Workshop (DSE'03), Velden, Austria, June 2003.
[10] Thomas JÄorg and Stefan Dessloch “Near Real-Time Data Warehousing
Using State-of-the-Art ETL Tools”, 2007
[11] Jorg, T., Dessloch, S. “Towards generating ETL processes for
incremental loading” IDEAS, 2008
[12] Jorg, T., Dessloch, S. “Formalizing ETL Jobs for Incremental Loading
of Data Warehouses” BTW, 2009
[13] Kimball, R., Caserta, J. “The Data Warehouse ETL Toolkit: Practical
Techniques for Extracting, Cleaning, Conforming, and Delivering Data”
John Wiley & Sons, 2004
[14] Samuel S. Conn “OLTP and OLAP Data Integration: A Review of
Feasible Implementation Methods and Architectures for Real Time Data
Analysis” 2005 IEEE.
[15] N. Kannan, “Real-Time Business Intelligence – Building Block for
Business Process Optimization”, DM Review Online. July 2004
[16] I. Ankorion. Change Data Capture-Efficient ETL for Real-Time BI.
Article published in DM Review Magazine, January 2005 Issue.
[17] E. Schallehn, K. U. Sattler, and G. Saake, “Advanced Grouping and
Aggregation for Data Integration”. CIKM- Atlanta, GA, 2007.
... The other two open-source tools, Airbyte and Meltano, were compared in [5], where it was found, after an evaluation of data ingestion tools, that Airbyte is more suitable for the problem of data pipeline design for audit analytics. Furthermore, CDC is a process that identifies and captures only changes to data in a database, thereby increasing its efficiency and then delivering those changes in real time to a downstream process or system [6]. The following section provides more detail about CDC. ...
... CDC is great for modern cloud architectures because it is a highly efficient way to move data across a wide-area network. It is a method of ETL (Extract, Transform, and Load) in which data are periodically extracted from a source system, transformed to meet the requirements of a target system, and loaded into the target system [1,6]. CDC is not only ideal for real-time data movement and an excellent fit to achieve low-latency, reliable, and scalable data replication but also for zero-downtime migrations to the cloud. ...
... They may be moving data into a data warehouse or data lake or creating an operational data store or a replica of source data in real time. CDC helps modernize data environments by enabling faster and more accurate decisions, minimizing disruptions to production, and reducing cloud migration costs, which, all combined, make CDC the preferred method for data ingestion and movement [6]. ...
Article
Full-text available
Today, advanced websites serve as robust data repositories that constantly collect various user-centered information and prepare it for subsequent processing. The data collected can include a wide range of important information from email addresses, usernames, and passwords to demographic information such as age, gender, and geographic location. User behavior metrics are also collected, including browsing history, click patterns, and time spent on pages, as well as different preferences like product selection, language preferences, and individual settings. Interactions, device information, transaction history, authentication data, communication logs, and various analytics and metrics contribute to the comprehensive range of user-centric information collected by websites. A method to systematically ingest and transfer such differently structured information to a central message broker is thoroughly described. In this context, a novel tool—Dataphos Publisher—for the creation of ready-to-digest data packages is presented. Data acquired from the message broker are employed for data quality analysis, storage, conversion, and downstream processing. A brief overview of the commonly used and freely available tools for data ingestion and processing is also provided.
... To resolve these issues, the study has proposed an architecture which utilises middleware layer to store changes from OLTP source system, a database queue which separates source data into transaction and metadata, a master data repository, and a message driven bean which delivers required data to data warehouse [17]. In another study [18], the authors have argued that efficiently distributing the data in data warehouse is a key factor in improving performance of ETL process as all the operations involved in ETL are data-intensive and incorrect distribution of data could deteriorate performance commonly used operations such as joins and aggregation. The study has proposed an approach to use change data capture technique in conjunction with aggregation techniques such as multiple level hierarchal aggregation, pre-calculated aggregation, and aggregating data prior to loading to improve the performance of ETL process which would enable organisations to refresh their data warehouse more frequently [18]. ...
... In another study [18], the authors have argued that efficiently distributing the data in data warehouse is a key factor in improving performance of ETL process as all the operations involved in ETL are data-intensive and incorrect distribution of data could deteriorate performance commonly used operations such as joins and aggregation. The study has proposed an approach to use change data capture technique in conjunction with aggregation techniques such as multiple level hierarchal aggregation, pre-calculated aggregation, and aggregating data prior to loading to improve the performance of ETL process which would enable organisations to refresh their data warehouse more frequently [18]. CDC technique was proposed as a solution to achieve near-real time ETL by another study [19], which explored various CDC techniques such as triggers, data loading timestamp, and OLTP log files to identify changed source records, the study has used CDC in conjunction with prioritybased data loading technique to determine the significance of source data and load only the dataset with highest importance. ...
... However, the near-real rime environment requires transformation process to run more frequently which is dependent on performance of join operations as same set of operations are need to be performed in quick succession and failure to do so would delay the ETL process and impact refresh frequency of the data warehouse. Also, the data volume for each ETL run in near-real time environment is not consistent which makes it more difficult to design an efficient join operation [18,20]. ...
Research
Full-text available
Data warehousing and related technologies play crucial role organization's decision support system. Rising need for recently refreshed data in data warehouse system is posing challenge to traditional ETL approach. Purpose of this paper is to identify key challenges and possible solutions for ETL process in near-real-time-environment. Overburdening source systems, inefficient use of available memory and possibility of impacting user performance were identified as key issues. While, using change data capture to limit amount of data during ETL process, effective use of aggregation and join operator, and use of flow controller to minimize impact of ETL on the source and target systems were identified as some of the solutions. However, a further research is required to identify which ETL tools are suitable for near-real-time environments and if a different data modeling technique can improve the performance of ETL process in near-real-time environment.
... There are several studies that investigated the issue of ETL applied in real-time to DW systems. The related work was summarized in Vassiliadis and Simitsis [30], Penzlin et al. [31], Farooq and Sarwar [32], Tank et al. [33], Kakish and Kraft [11], Sabry and Ali [6], Revathy et al. [34], Ferreira et al. [35], Wibowo, A. [12], Gajanan Mane [13], Muddasir and Raghuveer [9], Sabtu et al. [14], Bouaziz et al. [36], Phanikanth and Sudarsan [25], and Chandra, H. [5]. However, we focus here on proposals of architectures to support real-time ETL processes. ...
Article
Full-text available
Nowadays, organizations are very interested to gather data for strategic decision-making. Data are disposable in operational sources, which are distributed, heterogeneous, and autonomous. These data are gathered through ETL processes, which occur traditionally in a pre-defined time, that is, once a day, once a week, once a month or in a specific period of time. On the other hand, there are special applications for which data needs to be obtained in a faster way and sometimes even immediately after the data are generated in the operation data sources, such as health systems and digital agriculture. Thus, the conventional ETL process and the disposable techniques are incapable of making the operational data delivered in real-time, providing low latency, high availability, and scalability. As our proposal, we present an innovative architecture, named Data Magnet, to cope with real-time ETL processes. The experimental tests performed in the digital agriculture domain using real and synthetic data showed that our proposal was able to deal in real-time with the ETL process. The Data Magnet provided great performance, showing an almost constant elapsed time for growing data volumes. Besides, Data Magnet provided significant performance gains over the traditional trigger technique.
... ETL menarik data dari berbagai sumber data dan menaruhnya ke dalam sebuah data warehouse. ETL proses bukanlah proses yang dilakukan sekali, tetapi secara periodik/memiliki jadwal seperti bulanan, mingguan, harian, bahkan dalam hitungan jam [13]. ...
Article
Full-text available
Business competition is getting tougher along with the development of information technology. Information technology helps businesses to accelerate in everything, for example sales. The growing sales data makes the data more and more every day. This is a problem for companies to be able to manage data and retrieve information quickly and accurately. To support this we need several tools that can help manage data to support ongoing business processes. A Business Intelligence dashboard that uses a Business Intelligence Roadmap and Key Performance Indicator (KPI) can make it possible to analyze data quickly, according to the needs and targets the company wants to achieve. From this research, produce an application that makes it easy for users to analyze and monitor sales activities and can also find out sales trends easily. So that it allows users to set strategies to increase sales.
... There is some other work that presents a parallel processing solution that splitting big and complex SQL query into small pieces in distributed computing [17]. In paper [18], an improvement of the ETL processing in data warehouses using high-performance joins for changed data capture is presented. Paper [19] presents a distributed ETL engine based on MAS and data partition technology. ...
Chapter
Full-text available
Defining a business intelligence project for a transportation system with more than 10 k-users per day could become a challenging problem. A transportation system like this would generate more than 400 million of registers per month when monitoring users each minute. That is why, some strategies need to be applied to the ETL process to correctly handle the data generated by big transportation systems. This paper explores different operational database (OD) architectures and analyze their impact on processing time of the ETL stage in a business intelligence. The database architectures reviewed are: one centralized OD, one logical-centralized OD and distributed OD. This model is being tested with the transportation system defined in the city of Poza Rica, Mexico. This system contains more than three million simulated registers per day and the entire ETL process is done under 136 s. This model runs on a Quad-core Intel Xeon processor 8 GB RAM OSX Yosemite 10.10.5 computer.
... There is a Web service-based approach [24] for modeling of efficient CDC mechanism. A high-performance join on changed data is applied on the literature [25]. A trigger-based approach for capturing changed data from different data sources is mentioned in Article [26,27]. ...
Article
Full-text available
ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.
Chapter
Business organizations are trying to focus from the traditional extract-transform-load (ETL) system towards real-time implementation of the ETL process. Traditional ETL process upgrades new data to the data warehouse (DW) at predefined time intervals when the DW is in off-line mode. Modern organizations want to capture and respond to business events faster than ever. Accessing fresh data is not possible using traditional ETL. Real-time ETL can reflect fresh data on the warehouse immediately at the occurrence of an event in the operational data store. Therefore, the key tool for business trade lies in real-time enterprise DW enabled with Business Intelligence. This study provides an overview of ETL process and its evolution towards real-time ETL. This chapter will represent the real-time ETL characteristics, its technical challenges, and some popular real-time ETL implementation methods. Finally, some future directions towards real-time ETL technology are explored.
Chapter
Business intelligence (BI) is inevitable to the business organizations to make the most effective decisions with the highest level of efficiency, and to find ways to make continuous improvement in their decision-making processes based on data science (DS). So, it is very important for them to search for the latest BI and DS tools and technologies. They expect to get increasingly improved performance from these emerging tools and technologies in the field of DS, which consequently helps them to drive down their information technology (IT) budget to run their businesses. The advent of computing technologies, software engineering, DS technologies, cloud computing, etc., and very rapid emergence and adoption of smartphones around the globe, have made the modern day business environment much more competitive than ever before, and have resulted into enormous increase in the use of the advanced BI and DS tools and technologies, which will be steadily increasing further in the future. This article attempts to make an analytical discussion of these ever-improving DS tools and technologies.
Chapter
The world of big data becomes a Business-critical component for Enterprise resource planning system and Business Intelligence. The ERP system runs big data longer and uses resource locks, which directly blocks the users from running queries on the database. Additionally, users will require updates on real-time data changes. More computational resources are required to reduce the loading cycle creating expensive processes with complete data loads. An ETL technique with CDC is used to resolve problems, through periodic updates of changed data. A process which identifies changed records to reduce the extract volume is knows as CDC. This paper proposes a structure capable of performing CDC by means of timestamps and replication tool designed for spontaneous synchronization between two databases. The overall performance of CDC technique to ERP system is compared. This approach is employed in a real-world project has noticed a transition to near real-time data ETL and performance improvement.
Conference Paper
Full-text available
Extract-transform-load (ETL) tools are primarily designed for data ware- house loading, i.e. to perform physical data integration. When the operational data sources happen to change, the data warehouse gets stale. To ensure data timeliness, the data warehouse is refreshed on a periodical basis. The naive approach of simply reloading the data warehouse is obviously inefficient. Typically, only a small fraction of source data is changed during loading cycles. It is therefore desirable to capture these changes at the operational data sources and refresh the data warehouse incre- mentally. This approach is known as incremental loading. Dedicated ETL jobs are required to perform incremental loading. We are not aware of any ETL tool that helps to automate this task. In fact, incremental load jobs are handcrafted by ETL program- mers so far. The development is thus costly and error-prone. In this paper we present an approach to the automated derivation of incremental load jobs based on equational reasoning. We review existing Change Data Capture techniques and discuss limitations of different approaches. We further review existing loading facilities for data warehouse refreshment. We then provide transformation rules for the derivation of incremental load jobs. We stress that the derived jobs rely on existing Change Data Capture techniques, existing loading facilities, and existing ETL execution platforms.
Conference Paper
Full-text available
Data warehouses are traditionally refreshed in a periodic manner, most often on a daily basis. Thus, there is some delay between a business transaction and its appearance in the data warehouse. The most recent data is trapped in the operational sources where it is unavailable for analysis. For timely decision making, today’s business users asks for ever fresher data. Near real-time data warehousing addresses this challenge by shortening the data warehouse refreshment intervals and hence, delivering source data to the data warehouse with lower latency. One consequence is that data warehouse refreshment can no longer be performed in off-peak hours only. In particular, the source data may be changed concurrently to data warehouse refreshment. In this paper we show that anomalies may arise under these circumstances leading to an inconsistent state of the data warehouse and we propose approaches to avoid refreshment anomalies.
Conference Paper
Full-text available
Real-world changes are generally discovered delayed by computer systems. The typical update patterns for traditional data warehouses on an overnight or even weekly basis enlarge this propagation delay until the information is available to knowledge workers. Typically, traditional data warehouses focus on summarized data (at some level) rather than detail data. For active data warehouse environments, also detailed data about individual entities are required for checking the data conditions and triggering actions. Hence, keeping data current and consistent in that context is not an easy task. In this paper we present an approach for modeling conceptual time consistency problems and introduce a data model that deals with timely delays. It supports knowledge workers, to find out, why (or why not) an active system responded to a certain state of the data. Therefore the model enables analytical processing of detail data (enhanced by valid time) based on a knowledge state at a specified instant of time. All states that were not yet knowable to the system at that point in time are consistently ignored.
Conference Paper
Full-text available
This paper elaborates on the justification and need for integration of OLTP (online transaction processing) and OLAP (online analytical processing) data environments for the purpose of real time data analysis, and reviews potential implementation methods and architectures for integration of these disparate environments. A body of relevant literature is reviewed to answer a central question on the efficacy of OLTP and OLAP integrated environments, and the central question is made tractable by its decomposition to the essential questions that must be answered regarding the feasibility of various methods and architectures for integrating data architectures. Additional coverage is given to proposed architectures and methods for real time data warehouses in the context of OLTP and OLAP data integration. This qualitative study uses an interpretivist epistemology to reach conclusions about the literature's treatment of the topic and the justification and need for data integration to accomplish real time data analysis in the business enterprise.
Article
Full-text available
Real-world changes are generally discovered delayed by computer systems. The typical update patterns for traditional data warehouses on an overnight or even weekly basis increase this propagation delay until the information is available to knowledge workers. Typically, traditional data warehouses focus on summarized data (at some level) rather than detailed data. For active data warehouse environments, detailed data about entities is required for checking the data conditions and triggering actions to automize routine decision tasks. Hence, keeping data current (by minimizing the latency from when data is captured until it is available to knowledge workers) and consistent in that context is a difficult task. In this paper we present an approach for modeling conceptual time consistency problems and introduce a data model that deals with timely delays. It supports knowledge workers in finding out, why (or why not) an active system responded to acertain state of the data. Therefore, the model enables analytical processing of detailed data (enhanced by valid time) based on a knowledge state at a specific time. All states that were not yet known by the system at that point in time are consistently ignored. This enables timely consistent analyses by considering that the validity of detailed data and aggregates can be restricted to time intervals only, due to frequent updates and late-arriving information.
Conference Paper
As the analytical capabilities and applications of e-business systems expand, providing real-time access to critical business performance indicators to improve the speed and effectiveness of business operations has become crucial. The monitoring of business activities requires focused, yet incremental enterprise application integration (EAI) efforts and balancing information requirements in real-time with historical perspectives. The decision-making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in a timely manner. In this paper, we present an architecture for a container-based ETL (extraction, transformation, loading) environment, which supports a continual near real-time data integration with the aim of decreasing the time it takes to make business decisions and to attain minimized latency between the cause and effect of a business decision. Instead of using vendor proprietary ETL solutions, we use an ETL container for managing ETLets (pronounced "et-lets") for the ETL processing tasks. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. We have fully implemented the proposed architecture. Furthermore, we compare the ETL container to alternative continuous data integration approaches.
Conference Paper
Extract, Transform, and Load (ETL) processes physically integrate data from multiple, heterogeneous sources in a central repository referred to as data warehouse. Physically integrated data gets stale when source data is changed, hence periodic refreshes are required. For efficiency reasons data warehouses are typically refreshed incrementally, i.e. changes are captured at the sources and propagated to the data warehouse on a regular basis. Dedicated ETL processes referred to as incremental load processes are employed to extract changes from the sources, propagate the changes, and refresh the data warehouse incrementally. Changes required in the data warehouse are inferred from changes captured at the sources during change propagation. The creation of incremental load processes is a complex task reserved to trained ETL programmers. In this paper we review existing Change Data Capture (CDC) techniques and discuss limitations of different approaches. We further review existing techniques for refreshing data warehouses. We then present an approach for generating incremental load processes from abstract schema mappings.
Conference Paper
The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in time. A real-time data warehouse aims at decreasing the time it takes to make business decisions and tries to attain zero latency between the cause and effect of a business decision. In this paper we present an architecture of an ETL environment for real-time data warehouses, which supports a continual near real-time data propagation. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. Instead of using vendor proprietary ETL (extraction, transformation, loading) solutions, which are often hard to scale and often do not support an optimization of allocated time frames for data extracts, we propose in our approach ETLets (spoken "et-lets") and Enterprise Java Beans (EJB) for the ETL processing tasks.