Conference PaperPDF Available

Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)

November 2010

November 2010

DOI:10.1109/ARTCom.2010.63

Source
IEEE Xplore

Conference: Advances in Recent Technologies in Communication and Computing (ARTCom), 2010 International Conference on

Authors:

Darshan Tank

Gujarat Technological University

Amit Ganatra

Parul Universiy

Y. P. Kosta

Marwadi Education Foundation’s Group of Institutions

C. K. Bhensdadia

Dharmsinh Desai University

In today's fast-changing, competitive environment, a complaint frequently heard by data warehouse users is that access to time-critical data is too slow. Shrinking batch windows and data volume that increases exponentially are placing increasing demands on data warehouses to deliver instantly-available information. Additionally, data warehouses must be able to consistently generate accurate results. But achieving accuracy and speed with large, diverse sets of data can be challenging. Various operations can be used to optimize data manipulation and thus accelerate data warehouse processes. In this paper we have introduced two such operations: 1. Join and 2. Aggregation-which will play an integral role during preprocessing as well in manipulating and consolidating data in a data warehouse. Our approach demonstrate how we can save hours or even days, when processing large amounts of data for ETL, data warehousing, business intelligence (BI) and other mission critical applications.

Technical architecture of Corporate Information Factory.

…

ETL System Architecture.

…

Multi-Level Hierarchal Aggregation.

…

Working of CDC in conjunction with ETL tools

…

Joins Used for Changed Data Capture.

…

Figures - uploaded by Darshan Tank

Content may be subject to copyright.

Content uploaded by Darshan Tank

Content may be subject to copyright.

Speeding ETL Processing in Data Warehouses Using

High-Performance Joins

For Changed Data Capture (CDC)

Darshan M. Tank1, Amit Ganatra2, Y P Kosta3

1,3Department of IT, 2Department of CE

Charotar Institute of Technology (Faculty of Technology)

Charotar University of Science and Technology, Changa

Anand-388421, India

1dmtank@gmail.com, 2amitganu@yahoo.com,

3ypkosta@yahoo.com

C K. Bhensdadia

Department of CE

Faculty of Technology

Dharmsinh Desai University

Nadiad-387001, India

ckbhensdadia@yahoo.co.in

Abstract—In today's fast-changing, competitive environment,

a complaint frequently heard by data warehouse users is that

access to time-critical data is too slow. Shrinking batch windows

and data volume that increases exponentially are placing

increasing demands on data warehouses to deliver instantly-

available information. Additionally, data warehouses must be

able to consistently generate accurate results. But achieving

accuracy and speed with large, diverse sets of data can be

challenging.

Various operations can be used to optimize data

manipulation and thus accelerate data warehouse processes. In

this paper we have introduced two such operations: 1. Join and 2.

Aggregation – which will play an integral role during

preprocessing as well in manipulating and consolidating data in a

data warehouse. Our approach demonstrate how we can save

hours or even days, when processing large amounts of data for

ETL, data warehousing, business intelligence (BI) and other

mission critical applications.

Keywords—Business Intelligence, Near Real-Time Data

Warehousing, Change Data Capture (CDC), Extract-Transform-

Load (ETL)

I. INTRODUCTION

The widespread use of the Internet and related

technologies in various business domains has accelerated the

intensity of competition, increased the volume of data and

information available, and shortened decision-making cycles

considerably. Typically, in a large organization, many

distributed, heterogeneous data sources, applications, and

processes have to be integrated to ensure delivery of the best

information to the decision makers. In order to support

effective analysis and mining of such diverse, distributed

information, a data warehouse (DWH) collects data from

multiple, heterogeneous (operational) source systems and

stores integrated information in a central repository.

There are numerous challenges facing IT departments

today as they deal with reduced budgets, smaller staffs, and

ever-increasing demands on the development of business

critical applications. Within the factory, there are often large

volumes of data that have to be processed from application to

application or from repository to repository, creating a flow of

data across the enterprise [1].

Figure 1. Technical architecture of Corporate Information Factory.

From above Fig., it’s easy to see that data volumes can

easily grow while demands on the data increase as well. This

can occur when processing the data and putting it into another

form or database. The management of all of this data

movement and the time it takes to process all of the

information creates a new set of IT challenges [5].

II. ETL PROCESS

Extract Transform Load (ETL) is a common terminology

used in data warehousing which stands for extracting data

from source systems, transforming the data according to the

business rules and loading to the target data warehouse. ETL

is a process in data warehousing responsible for pulling data

out of the source systems and placing it into a data warehouse

[3].

ETL systems move data from OLTP systems to a data

warehouse, but they can also be used to move data from one

data warehouse to another. A heterogeneous architecture for

an ETL system is one that extracts data from multiple sources.

The complexity of this architecture arises from the fact that

data from more than one source must be merged, rather than

from the fact that data may be formatted differently in the

different sources.

Figure 2. ETL System Architecture.

The ETL process is not a one-time event; new data is

added to a data warehouse periodically [4]. Typical periodicity

may be monthly, weekly, daily, or even hourly, depending on

the purpose of the data warehouse and the type of business it

serves.

III. DATA PROCESSING CHALLENGES

Staying competitive in today’s real-time business world

demands the capability to process ever-increasing volumes of

data at lightning speed. In today's fast-changing, competitive

environment, a complaint frequently heard by data warehouse

users is that access to time-critical data is too slow. Shrinking

batch windows and data volume that increases exponentially

are placing increasing demands on data warehouses to deliver

instantly-available information.

Additionally, data warehouses must be able to consistently

generate accurate results. But achieving accuracy and speed

with large, diverse sets of data can be challenging [5]. There

has been increased need to update the underlying data

infrastructure to improve access to, and the quality of, real-

time business intelligence (BI) and other mission-critical

operations, including data marts, data mining, online data

stores, and online transaction processing (OLTP) systems.

One of the primary factors affecting data warehouse

performance is the underlying physical distribution of the data

in the databases, which must be consolidated in various ways

[6]. An inability to perform data intensive operations

efficiently will inevitably impede business analyses and

subsequent decision-making.

IV. KEY COMPONENTS OF DATA WAREHOUSE

PROCESSING

Various operations can be used to optimize data

manipulation and thus accelerate data warehouse processes.

Two operations in particular, join and aggregation play an

integral role during preprocessing as well in manipulating and

consolidating data in a data warehouse.

Deriving useful information from raw data within a data

warehouse involves joining factual and dimensional

information before aggregating it to produce a report or output

for downstream analysis. However, joins and aggregations are

data-intensive and time-intensive, which prevent data

warehouses with insufficiently fast ETL software from

performing them frequently; but frequency is key to providing

data for timely business analysis [7].

For processing data for BI analytics and other mission-

critical applications, the usual go-to software products include

various extract, transform, load (ETL) and data warehousing

solutions [12]. However, these products do not always provide

the speed required to deliver data on time. It must support the

Advanced Data Management (ADM) capabilities to reduce

processing time of data-intensive operations through the use of

high-performance joins and high-performance aggregations.

A. Joins

Joins are used to combine information from two or more

data sources, such as database tables, and place it into a new

data source suitable for downstream processing or reports.

Joins are particularly powerful because they enable rapidly

changing data to be organized into categories for subsequent

report preparation through the use of matching keys. Joins are

used to pre-process data, to improve the efficiency of queries,

and to accelerate changed data capture (CDC).

B. Aggregations

Aggregations play a key role at various stages of data

warehouse processing. From preprocessing of data prior to its

entering the data warehouse, to dimensional data analysis used

for conducting queries and generating reports, aggregations

are critical to efficiently preparing, configuring, and analyzing

large volumes of data.

1) Using High-Performance Aggregations for

Preprocessing: By aggregating data prior to loading it into the

data warehouse, queries, database loads, and other

downstream processing can be performed much faster. ADM

should quickly summarize factual data to the minimum level

of granularity required by the data warehouse. A common

application removes redundant transaction data from multiple

sources for faster queries and database loads [17].

2) Pre-Calculated Aggregations Speed Queries: Data

warehouse experts agree that aggregates are the best way to

speed data warehouse queries. According to data warehousing

expert Ralph Kimball, “Every data warehouse should contain

precalculated and prestored aggregation tables”. Aggregate

operations yield the greatest performance benefits when they

are used for input to data analysis.

In the case of a six million row query, reducing the number

of rows read by creating aggregations across dimensions can

vastly accelerate processing time [13]. A query answered from

base-level data can take hours and involve millions of data

records and millions of calculations. With pre-calculated

aggregates, the same query can be answered in seconds with

just a few records and calculations.

3) Using Multi-Level Hierarchal Aggregations:

Sophisticated aggregation schemes recognize dimensional

hierarchies and build higher-level aggregations from more

granular aggregations. For instance, a daily aggregation of

sales for a region could be used to build a monthly aggregation

of sales by region, and thereby avoid an aggregation of the

more granular daily sales totals. Such aggregate-awareness can

significantly increase speed and enhance reporting capacity.

Figure 3. Multi-Level Hierarchal Aggregation.

Aggregations can also be used to replace fact data with

rolled up versions of themselves. ADM should also perform

multi-level hierarchal aggregation (Fig. 3) to help build cubes

for more advanced dimensional data analyses.

V. CDC (CHANGE DATA CAPTURE)

Change data capture is an approach to data integration,

based on the identification, capture, and delivery of only the

changes made to operational/transactional data systems. CDC

solutions occur most often in data-warehouse environments

since capturing and preserving the state of data across time is

one of the core functions of a data warehouse, but CDC can be

utilized in any database or data repository system. By

processing only the changes, CDC makes the data integration,

and more specifically the ‘Extract’ part of the ETL process

more efficient. When done correctly, it also reduces the

‘latency’ between the time a change occurs in the source

systems and the time the same change is made available to the

business user in the data warehouse.

Next generation Data Integration and ETL tools need to

support Change Data Capture (CDC), a technology that

enables to identify, capture, and move only the changes made

to enterprise data sources. No longer can the entire source data

be moved. Implementing CDC makes data and information

integration in real-time significantly more efficient, and

delivers data at the right-time [6].

Figure 4. Working of CDC in conjunction with ETL tools

A common case for using CDC is in conjunction with ETL

tools for faster and more efficient data extract in data

warehouse implementations. A key goal of CDC is to improve

efficiency by reducing the amount of data that needs to be

processed to a minimum [6]. Therefore if the business

requirements are for only certain changes to be captured, then

it would be wasteful to transfer all changes. The most

advanced CDC solutions therefore provide filters that reduce

the amount of information transferred, again minimizing

resource requirements and maximizing speed and efficiency.

VI. HIGH-PERFORMANCE JOINS FOR CHANGED DATA

CAPTURE (CDC)

To reduce the time needed to retrieve information, data

must be preprocessed into the proper form for the dimensional

data warehouse. High-speed joins are critical to this process,

which can include lookups of legacy values for appropriate

replacement, cleansing and validating data, identifying and

eliminating mismatching values, and pre-aggregation. Using

high-performance join for data-intensive operations,

descriptive information can be combined with factual data late

in a processing sequence so that storage and throughput

requirements are minimized.

CDC is an increasingly important pre-processing function.

By loading only new, updated, and deleted records into the

data warehouse, CDC significantly conserves time and

resources when carrying out data-intensive processes [16].

Figure 5. Joins Used for Changed Data Capture.

Rather than replacing the information in the data

warehouse with the data in the entire online transactional

database, a join will match the primary key of the previously

loaded record with its corresponding new record and then

compare the data portions of the two records to determine if

they’ve changed. In this way, only added, deleted, and altered

records are updated, which significantly reduces elapsed time

of database loads. By using a high-performance join for CDC,

data warehouse updates can be performed with far greater

efficiency.

VII. CONCLUSIONS

Aggregations and joins are widely used in nearly every

industrial sector where large volumes of data must be analyzed

and to significantly accelerate data processing. It speeds

processes like ETL, staging data for a data warehouse, and

database loading. It also minimizes resource consumption,

making it possible to consolidate hardware for significant cost

savings.

ACKNOWLEDGMENT

The authors’ wishes special thanks to the Management of

Charusat for their moral support and continuous

encouragement.

REFERENCES

[1] Imhoff and Claudia “The Corporate Information Factory” DM Review,

December, 1999 (http://dmreview.com)

[2] Bruckner, R. M., and Tjoa, A M. “Capturing Delays and Valid Times in

Data Warehouses—Towards Timely Consistent Analyses,” Journal of

Intelligent Information Systems, September 2002.

[3] Josef Schiefer & Robert M. Bruckner “Container-Managed ETL

Applications for integrating data in near real-time” Twenty-Fourth

International Conference on Information Systems (2003)

[4] W.H. Inmon and Dan Meers “Maximizing the “E” in Legacy Extract,

Transform & Load (ETL)” December 2003

[5] White Paper by Syncsort Incorporated “Solving the Challenges of

Exponential Data Growth” 2009 (http://syncsort.com)

[6] White Paper by Attunity Ltd. “Efficient and Real Time Data Integration

with Change Data Capture” February 2009 (http://attunity.com)

[7] Robert M. Bruckner, A M. Tjoa “Managing Time Consistency for

Active Data Warehouse Environments”. In Proceedings of the Third

International Conference on Data Warehousing and Knowledge

Discovery (DaWaK 2001), Springer LNCS 2114, pp. 254-263, Munich,

Germany, September 2001.

[8] Robert M. Bruckner, Beate List, Josef Schiefer “Striving Toward Near

Real-Time Data Integration for Data Warehouses”. In Proceedings of the

Fourth International Conference on Data Warehousing and Knowledge

Discovery (DaWaK 2002), Springer LNCS 2454, pp. 317-326, Aix-en-

Provence, France, September 2002.

[9] Josef Schiefer, Jun-Jang Jeng, Robert M. Bruckner “Managing

Continuous Data Integration Flows”. Decision Systems Engineering

Workshop (DSE'03), Velden, Austria, June 2003.

[10] Thomas JÄorg and Stefan Dessloch “Near Real-Time Data Warehousing

Using State-of-the-Art ETL Tools”, 2007

[11] Jorg, T., Dessloch, S. “Towards generating ETL processes for

incremental loading” IDEAS, 2008

[12] Jorg, T., Dessloch, S. “Formalizing ETL Jobs for Incremental Loading

of Data Warehouses” BTW, 2009

[13] Kimball, R., Caserta, J. “The Data Warehouse ETL Toolkit: Practical

Techniques for Extracting, Cleaning, Conforming, and Delivering Data”

John Wiley & Sons, 2004

[14] Samuel S. Conn “OLTP and OLAP Data Integration: A Review of

Feasible Implementation Methods and Architectures for Real Time Data

Analysis” 2005 IEEE.

[15] N. Kannan, “Real-Time Business Intelligence – Building Block for

Business Process Optimization”, DM Review Online. July 2004

[16] I. Ankorion. Change Data Capture-Efficient ETL for Real-Time BI.

Article published in DM Review Magazine, January 2005 Issue.

[17] E. Schallehn, K. U. Sattler, and G. Saake, “Advanced Grouping and

Aggregation for Data Integration”. CIKM- Atlanta, GA, 2007.

Building Advanced Web Applications Using Data Ingestion and Data Processing Tools

Article

Full-text available

Feb 2024

Today, advanced websites serve as robust data repositories that constantly collect various user-centered information and prepare it for subsequent processing. The data collected can include a wide range of important information from email addresses, usernames, and passwords to demographic information such as age, gender, and geographic location. User behavior metrics are also collected, including browsing history, click patterns, and time spent on pages, as well as different preferences like product selection, language preferences, and individual settings. Interactions, device information, transaction history, authentication data, communication logs, and various analytics and metrics contribute to the comprehensive range of user-centric information collected by websites. A method to systematically ingest and transfer such differently structured information to a central message broker is thoroughly described. In this context, a novel tool—Dataphos Publisher—for the creation of ready-to-digest data packages is presented. Data acquired from the message broker are employed for data quality analysis, storage, conversion, and downstream processing. A brief overview of the commonly used and freely available tools for data ingestion and processing is also provided.

ETL in Near-real-time Environment: A Review of Challenges and Possible Solutions

Research

Full-text available

Apr 2020

Swapnil Gorhe

Data warehousing and related technologies play crucial role organization's decision support system. Rising need for recently refreshed data in data warehouse system is posing challenge to traditional ETL approach. Purpose of this paper is to identify key challenges and possible solutions for ETL process in near-real-time-environment. Overburdening source systems, inefficient use of available memory and possibility of impacting user performance were identified as key issues. While, using change data capture to limit amount of data during ETL process, effective use of aggregation and join operator, and use of flow controller to minimize impact of ETL on the source and target systems were identified as some of the solutions. However, a further research is required to identify which ETL tools are suitable for near-real-time environments and if a different data modeling technique can improve the performance of ETL process in near-real-time environment.

A non-intrusive and reactive architecture to support real-time ETL processes in data warehousing environments

Article

Full-text available

Apr 2023

Nowadays, organizations are very interested to gather data for strategic decision-making. Data are disposable in operational sources, which are distributed, heterogeneous, and autonomous. These data are gathered through ETL processes, which occur traditionally in a pre-defined time, that is, once a day, once a week, once a month or in a specific period of time. On the other hand, there are special applications for which data needs to be obtained in a faster way and sometimes even immediately after the data are generated in the operation data sources, such as health systems and digital agriculture. Thus, the conventional ETL process and the disposable techniques are incapable of making the operational data delivered in real-time, providing low latency, high availability, and scalability. As our proposal, we present an innovative architecture, named Data Magnet, to cope with real-time ETL processes. The experimental tests performed in the digital agriculture domain using real and synthetic data showed that our proposal was able to deal in real-time with the ETL process. The Data Magnet provided great performance, showing an almost constant elapsed time for growing data volumes. Besides, Data Magnet provided significant performance gains over the traditional trigger technique.

Business Intelligence Dashboard of Food Sales Based on Key Performance Indicator Case Study: Xyz Cinema

Article

Full-text available

Sep 2020

Business competition is getting tougher along with the development of information technology. Information technology helps businesses to accelerate in everything, for example sales. The growing sales data makes the data more and more every day. This is a problem for companies to be able to manage data and retrieve information quickly and accurately. To support this we need several tools that can help manage data to support ongoing business processes. A Business Intelligence dashboard that uses a Business Intelligence Roadmap and Key Performance Indicator (KPI) can make it possible to analyze data quickly, according to the needs and targets the company wants to achieve. From this research, produce an application that makes it easy for users to analyze and monitor sales activities and can also find out sales trends easily. So that it allows users to set strategies to increase sales.

ETL Processing in Business Intelligence Projects for Public Transportation Systems

Chapter

Full-text available

Dec 2019

Defining a business intelligence project for a transportation system with more than 10 k-users per day could become a challenging problem. A transportation system like this would generate more than 400 million of registers per month when monitoring users each minute. That is why, some strategies need to be applied to the ETL process to correctly handle the data generated by big transportation systems. This paper explores different operational database (OD) architectures and analyze their impact on processing time of the ETL stage in a business intelligence. The database architectures reviewed are: one centralized OD, one logical-centralized OD and distributed OD. This model is being tested with the transportation system defined in the city of Poza Rica, Mexico. This system contains more than three million simulated registers per day and the entire ETL process is done under 136 s. This model runs on a Quad-core Intel Xeon processor 8 GB RAM OSX Yosemite 10.10.5 computer.

Efficient incremental loading in ETL processing for real-time data integration

Article

Full-text available

Mar 2020
Innovat Syst Software Eng

ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.

Challenges and Solutions of Real-Time Data Integration Techniques by ETL Application

Chapter

Dec 2023

Business organizations are trying to focus from the traditional extract-transform-load (ETL) system towards real-time implementation of the ETL process. Traditional ETL process upgrades new data to the data warehouse (DW) at predefined time intervals when the DW is in off-line mode. Modern organizations want to capture and respond to business events faster than ever. Accessing fresh data is not possible using traditional ETL. Real-time ETL can reflect fresh data on the warehouse immediately at the occurrence of an event in the operational data store. Therefore, the key tool for business trade lies in real-time enterprise DW enabled with Business Intelligence. This study provides an overview of ETL process and its evolution towards real-time ETL. This chapter will represent the real-time ETL characteristics, its technical challenges, and some popular real-time ETL implementation methods. Finally, some future directions towards real-time ETL technology are explored.

Emerging Tools and Technologies in Data Science

Chapter

Oct 2022

Business intelligence (BI) is inevitable to the business organizations to make the most effective decisions with the highest level of efficiency, and to find ways to make continuous improvement in their decision-making processes based on data science (DS). So, it is very important for them to search for the latest BI and DS tools and technologies. They expect to get increasingly improved performance from these emerging tools and technologies in the field of DS, which consequently helps them to drive down their information technology (IT) budget to run their businesses. The advent of computing technologies, software engineering, DS technologies, cloud computing, etc., and very rapid emergence and adoption of smartphones around the globe, have made the modern day business environment much more competitive than ever before, and have resulted into enormous increase in the use of the advanced BI and DS tools and technologies, which will be steadily increasing further in the future. This article attempts to make an analytical discussion of these ever-improving DS tools and technologies.

A Non-Intrusive and Reactive Architecture to Support Real-Time ETL Processes in Data Warehousing Environments

Article

Jan 2021

Real Time Data Warehouse Updates Through Extraction-Transformation-Loading Process Using Change Data Capture Method

Chapter

Jan 2020

The world of big data becomes a Business-critical component for Enterprise resource planning system and Business Intelligence. The ERP system runs big data longer and uses resource locks, which directly blocks the users from running queries on the database. Additionally, users will require updates on real-time data changes. More computational resources are required to reduce the loading cycle creating expensive processes with complete data loads. An ETL technique with CDC is used to resolve problems, through periodic updates of changed data. A process which identifies changed records to reduce the extract volume is knows as CDC. This paper proposes a structure capable of performing CDC by means of timestamps and replication tool designed for spontaneous synchronization between two databases. The overall performance of CDC technique to ERP system is compared. This approach is employed in a real-world project has noticed a transition to near real-time data ETL and performance improvement.

Formalizing ETL Jobs for Incremental Loading of Data Warehouses.

Conference Paper

Full-text available

Jan 2009

Extract-transform-load (ETL) tools are primarily designed for data ware- house loading, i.e. to perform physical data integration. When the operational data sources happen to change, the data warehouse gets stale. To ensure data timeliness, the data warehouse is refreshed on a periodical basis. The naive approach of simply reloading the data warehouse is obviously inefficient. Typically, only a small fraction of source data is changed during loading cycles. It is therefore desirable to capture these changes at the operational data sources and refresh the data warehouse incre- mentally. This approach is known as incremental loading. Dedicated ETL jobs are required to perform incremental loading. We are not aware of any ETL tool that helps to automate this task. In fact, incremental load jobs are handcrafted by ETL program- mers so far. The development is thus costly and error-prone. In this paper we present an approach to the automated derivation of incremental load jobs based on equational reasoning. We review existing Change Data Capture techniques and discuss limitations of different approaches. We further review existing loading facilities for data warehouse refreshment. We then provide transformation rules for the derivation of incremental load jobs. We stress that the derived jobs rely on existing Change Data Capture techniques, existing loading facilities, and existing ETL execution platforms.

Managing Continuous Data Integration Flows.

Conference Paper

Full-text available

Jan 2003

Near Real-Time Data Warehousing Using State-of-the-Art ETL Tools

Conference Paper

Full-text available

Aug 2009

Data warehouses are traditionally refreshed in a periodic manner, most often on a daily basis. Thus, there is some delay between a business transaction and its appearance in the data warehouse. The most recent data is trapped in the operational sources where it is unavailable for analysis. For timely decision making, today’s business users asks for ever fresher data. Near real-time data warehousing addresses this challenge by shortening the data warehouse refreshment intervals and hence, delivering source data to the data warehouse with lower latency. One consequence is that data warehouse refreshment can no longer be performed in off-peak hours only. In particular, the source data may be changed concurrently to data warehouse refreshment. In this paper we show that anomalies may arise under these circumstances leading to an inconsistent state of the data warehouse and we propose approaches to avoid refreshment anomalies.

Managing Time Consistency for Active Data Warehouse Environments

Conference Paper

Full-text available

Aug 2001

Real-world changes are generally discovered delayed by computer systems. The typical update patterns for traditional data warehouses on an overnight or even weekly basis enlarge this propagation delay until the information is available to knowledge workers. Typically, traditional data warehouses focus on summarized data (at some level) rather than detail data. For active data warehouse environments, also detailed data about individual entities are required for checking the data conditions and triggering actions. Hence, keeping data current and consistent in that context is not an easy task. In this paper we present an approach for modeling conceptual time consistency problems and introduce a data model that deals with timely delays. It supports knowledge workers, to find out, why (or why not) an active system responded to a certain state of the data. Therefore the model enables analytical processing of detail data (enhanced by valid time) based on a knowledge state at a specified instant of time. All states that were not yet knowable to the system at that point in time are consistently ignored.

OLTP and OLAP data integration: A review of feasible implementation methods and architectures for real time data analysis

Conference Paper

Full-text available

May 2005

Samuel Conn

This paper elaborates on the justification and need for integration of OLTP (online transaction processing) and OLAP (online analytical processing) data environments for the purpose of real time data analysis, and reviews potential implementation methods and architectures for integration of these disparate environments. A body of relevant literature is reviewed to answer a central question on the efficacy of OLTP and OLAP integrated environments, and the central question is made tractable by its decomposition to the essential questions that must be answered regarding the feasibility of various methods and architectures for integrating data architectures. Additional coverage is given to proposed architectures and methods for real time data warehouses in the context of OLTP and OLAP data integration. This qualitative study uses an interpretivist epistemology to reach conclusions about the literature's treatment of the topic and the justification and need for data integration to accomplish real time data analysis in the business enterprise.

Capturing Delays and Valid Times in Data Warehouses - Towards Timely Consistent Analyses

Article

Full-text available

Aug 2002

Real-world changes are generally discovered delayed by computer systems. The typical update patterns for traditional data warehouses on an overnight or even weekly basis increase this propagation delay until the information is available to knowledge workers. Typically, traditional data warehouses focus on summarized data (at some level) rather than detailed data. For active data warehouse environments, detailed data about entities is required for checking the data conditions and triggering actions to automize routine decision tasks. Hence, keeping data current (by minimizing the latency from when data is captured until it is available to knowledge workers) and consistent in that context is a difficult task. In this paper we present an approach for modeling conceptual time consistency problems and introduce a data model that deals with timely delays. It supports knowledge workers in finding out, why (or why not) an active system responded to acertain state of the data. Therefore, the model enables analytical processing of detailed data (enhanced by valid time) based on a knowledge state at a specific time. All states that were not yet known by the system at that point in time are consistently ignored. This enables timely consistent analyses by considering that the validity of detailed data and aggregates can be restricted to time intervals only, due to frequent updates and late-arriving information.

Advanced grouping and aggregation for data integration

Conference Paper

Jan 2001

Container-Managed ETL Applications for Integrating Data in Near Real-Time.

Conference Paper

Jan 2003

As the analytical capabilities and applications of e-business systems expand, providing real-time access to critical business performance indicators to improve the speed and effectiveness of business operations has become crucial. The monitoring of business activities requires focused, yet incremental enterprise application integration (EAI) efforts and balancing information requirements in real-time with historical perspectives. The decision-making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in a timely manner. In this paper, we present an architecture for a container-based ETL (extraction, transformation, loading) environment, which supports a continual near real-time data integration with the aim of decreasing the time it takes to make business decisions and to attain minimized latency between the cause and effect of a business decision. Instead of using vendor proprietary ETL solutions, we use an ETL container for managing ETLets (pronounced "et-lets") for the ETL processing tasks. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. We have fully implemented the proposed architecture. Furthermore, we compare the ETL container to alternative continuous data integration approaches.

Towards generating ETL processes for incremental loading

Conference Paper

Jan 2008

Extract, Transform, and Load (ETL) processes physically integrate data from multiple, heterogeneous sources in a central repository referred to as data warehouse. Physically integrated data gets stale when source data is changed, hence periodic refreshes are required. For efficiency reasons data warehouses are typically refreshed incrementally, i.e. changes are captured at the sources and propagated to the data warehouse on a regular basis. Dedicated ETL processes referred to as incremental load processes are employed to extract changes from the sources, propagate the changes, and refresh the data warehouse incrementally. Changes required in the data warehouse are inferred from changes captured at the sources during change propagation. The creation of incremental load processes is a complex task reserved to trained ETL programmers. In this paper we review existing Change Data Capture (CDC) techniques and discuss limitations of different approaches. We further review existing techniques for refreshing data warehouses. We then present an approach for generating incremental load processes from abstract schema mappings.

Striving Towards Near Real-Time Data Integration for Data Warehouses

Conference Paper

Sep 2002

The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in time. A real-time data warehouse aims at decreasing the time it takes to make business decisions and tries to attain zero latency between the cause and effect of a business decision. In this paper we present an architecture of an ETL environment for real-time data warehouses, which supports a continual near real-time data propagation. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. Instead of using vendor proprietary ETL (extraction, transformation, loading) solutions, which are often hard to scale and often do not support an optimization of allocated time frames for data extracts, we propose in our approach ETLets (spoken "et-lets") and Enterprise Java Beans (EJB) for the ETL processing tasks.

Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)

Abstract and Figures

Recommended publications

Modeling techniques for extraction transformation and load processes: A critical review

Geoportal as decision support system with spatial data warehouse

Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence

Implementing Real-Time Business Intelligence Using Dynamic Warehouse

O Processo ETL em Sistemas Data Warehouse

Towards Handling Incremental Load for Anomalies in Near Real Time Data Warehouse