A simple ETL workflow.

Source publication

State-space optimization of ETL workflows

Article

Full-text available

Nov 2005

Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In this paper, we derive into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as...

Context 1

... Is it possible to perform the aggregation before the transformation of American values to Europeans? In principle, this should be allowed to happen, since the dates are kept in the resulting data and can be transformed later. In this case, the aggregation operations can be pushed before the function. . How can we deal with naming problems? PARTS1. COST and PARTS2.COST are homonyms, but they do not correspond to the same entity (the first is in Euros and the second in Dollars). Assuming that the transformation $ 2 = C produces the attribute = C COST , how can we guarantee that corresponds to the same real-world entity with PARTS1.COST ? In Fig. 2, we can see how the workflow of Fig. 1 can be transformed in an equivalent workflow performing the same task. The selection on Euros has been propagated to both branches of the workflow so that low values are pruned early. Still, we cannot push the selection either before the transformation $ 2 = C or before the aggregation. At the same time, there was a swapping between the aggregation and the DATE conversion function ( A2E ). In summary, the two problems that have risen are 1) to determine which operations over the workflow are legal and 2) which one is the best in terms of performance gains. We take a novel approach to the problem by taking this peculiarity into consideration. Moreover, we employ a workflow paradigm for the modeling of ETL processes, i.e., we do not strictly require that an activity outputs data to some persistent data store, but rather, activities are allowed to output data to one another. In such a context, I/O minimization is not the primary problem. In this paper, we focus on the optimization of the process in terms of logical transformations of the workflow. To this end, we devise a method based on the specifics of an ETL workflow that can reduce its execution cost either by decreasing the total number of processes or by changing the execution order of the processes. The paper deals with the specification of the design of an ETL workflow and its optimization. Our contributions can be listed as ...

View in full-text

Context 2

... give an idea of the complexity of the problem, we mention the characteristics of a vast data warehouse as cited in a recent experience report [2]. In this paper, the authors report that their data warehouse population system has to process, within a time window of 4 hours, 80 million records per hour for the entire process (compression, FTP of files, decompression, transformation, and loading) on a daily basis. The volume of data rises to about 2 TB with the main fact table containing about 3 billion records. The request for performance is so pressing that there are processes hard-coded in low-level Data Base Management Systems (DBMS) calls to avoid the extra step of storing data to a target file to be loaded to the data warehouse through the DBMS loader. The above clearly shows that intelligent techniques for data preparation can greatly improve the overall process of data warehouse population. So far, research has only partially dealt with the problem of designing and managing ETL workflows. Typically, research approaches concern 1) the optimization of stand- alone problems (e.g., the problem of duplicate detection [21]) in an isolated setting and 2) problems mostly related to Web data (e.g., [10]). Recently, research on data streams [1], [4] has brought up the possibility of giving an alternative look to the problem of ETL. Nevertheless, for the moment, research in data streaming has focused on different topics such as on-the-fly computation of queries [1], [4], [15]. To our knowledge, there is no systematic treatment of the problem, as far as the problem of the design of an optimal ETL workflow is concerned. On the other hand, leading commercial tools [12], [13], [17], [19] allow the design of ETL workflows, but do not use any optimization technique. The designed workflows are propagated to the DBMS for execution; thus, the DBMS undertakes the task of optimization. Clearly, we can do better than this because an ETL process cannot be considered as a “big” query. Instead, it is more realistic to treat an ETL process as a complex transaction. In addition, in an ETL workflow, there are processes that run in separate environments, usually not simultaneously and under time constraints. One could argue that we can possibly express all ETL operations in terms of relational algebra and then optimize the resulting expression as usual. Later in this paper, we will demonstrate that the traditional logic-based algebraic query optimization can be blocked, basically due to the existence of data manipulation functions. Consider the example of Fig. 1 that describes the population of a table of a data warehouse DW from two source databases S 1 and S 2 . In particular, it involves the propagation of data from the recordset PARTS1(PKEY, SOURCE,DATE,COST) of source S 1 that stores monthly information, as well as from the recordset PARTS2(PKEY, SOURCE,DATE,DEPT,COST) of source S 2 that stores daily information. In the DW , the recordset PARTS(PKEY,SOUR- CE,DATE,COST) stores monthly information for the cost in Euros (COST) of parts (PKEY) per source (SOURCE) . We assume that both the first supplier and the data warehouse are European and the second is American; thus, the data coming from the second source need to be converted to European values and formats. In Fig. 1, activities are numbered with their execution priority and tagged with the description of their functionality. The flow for source S1 is 3 : a check for Not Null values is performed on attribute COST . The flow for source S2 is 4 : a conversion from Dollars ($) to Euros ( = C ) performed on attribute COST , 5 : dates (DATE) are converted from American to European format, and 6 : an aggregation for monthly supplies is performed and the unnecessary attribute DEPT (for department) is discarded from the flow. Then, 7 : the two flows are unified, and before being loaded to the warehouse, 8 : a final check is performed on the COST attribute, ensuring that only values above a certain threshold (e.g., COST > 0 ) are propagated to the warehouse. There are several interesting problems and optimization opportunities in the example of Fig. ...

View in full-text

Context 3

View in full-text

Context 4

View in full-text

Context 5

... 2. A workflow equivalent to the one of Fig. 1. ...

View in full-text

Quarry: A User-centered Big Data Integration Platform

Article

Full-text available

Feb 2021
INFORM SYST FRONT

Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).

Automated data integration system at the aircraft factory

Article

Full-text available

Dec 2019
J Phys Conf

The paper discusses the experience of introducing a developed automated system for integrating data directories from various, often unrelated and heterogeneous, information sources at the aircraft manufacturing enterprise Aviastar-SP JSC (Ulyanovsk, Russia). This automated system allows both manual and automatic modes to integrate the databases of directories used at the enterprise. To test the operability on real data, the developed system went through trial exploitation and gave good results.

Optimization of data flow execution in a parallel environment

Article

Full-text available

Sep 2019
DISTRIB PARALLEL DAT

Although the modern data flows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art data flow optimization techniques, do not accurately reflect the response time of real data flow execution in these execution environments. This is mainly due to the fact that the impact of parallelism, and more specifically, the impact of concurrent task execution on the running time is not adequately modeled in current cost models. The contribution of this work is twofold. Firstly, we propose an advanced cost model that aims to reflect the response time of a data flow that is executed in parallel more accurately. Secondly, we show that existing optimization solutions are inadequate and develop new optimization techniques targeting the proposed cost model. We focus on the single multi-core machine environment provided by modern business intelligence tools, such as Pentaho Kettle, but our approach can be extended to massively parallel and distributed settings. The distinctive features of our proposal is that we model both time overlaps and the impact of concurrency on task running times in a combined manner; the latter is appropriately quantified and its significance is exemplified. Furthermore, we propose extensions to current optimizers that decide on the exact ordering of flow tasks taking into account the new optimization metric. Finally, we evaluate the new optimization algorithms and show up to 59% response time improvement over state-of-the-art task ordering techniques.

Analysis of Existing Concepts of Optimization of ETL-Processes

Chapter

Full-text available

May 2019

Sarah Myriam Lydia Hahn

Extract-Transform-Load (ETL) describes the process of loading data from a source to a destination. The source and the destination can be separated physically and transformations may take place in between. Data preparation happens regularly. To minimize interference with other business processes and to guarantee a high data availability these processes are often run during night times. Therefore the demand for shorter processing times of ETL-processes is increasing steadily. Besides data availability and actuality another reason is the transition to real- or near-time analysis of data and the growing data volume. There are several approaches for the optimization of ETL-processes which will be highlighted in detail in this article. A closer look will be taken on the advantages and disadvantages of the presented approaches. Concluding each approach will be set into competition and a recommendation depending on the use case is given.

Parallelizing user–defined functions in the ETL workflow using orchestration style sheets

Article

Full-text available

Mar 2019
INT J AP MAT COM-POL

Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.

The Challenge of an Extraction-Transformation-Loading Tool Selection

Conference Paper

Full-text available

Nov 2018

Contemporary organizations tend to improve decision-making processes in order to improve the organization's business as well. Improving the decision-making process is often associated with the establishment of the Data Warehouse (DW) system, which in many cases is proved to be a very good solution. The central part of the establishment of the DW system is the Extract-Transform-Load (ETL) process that involves extracting data from different homogeneous or heterogeneous data sources, transforming them, and finally placing transformed data in a common location. In addition to its application in DW systems, the ETL process is widely applied in data integration, migration, staging and master data management efforts. The organization can choose from a variety of ETL tools and the choice of an appropriate ETL tool is extremely important for each organization. In the paper, we analyze different approaches for ETL tool selection. We make the selection of proposed criteria in reviewed researches and illustrate how they can be applied in the case study of comparing of two ETL tools: Talend Open Studio for Data Integration and SQL Server Integration Services.

The Many Faces of Data-centric Workflow Optimization: A Survey

Article

Full-text available

Sep 2018

Workflow technology is rapidly evolving and, rather than being limited to modeling the control flow in business processes, is becoming a key mechanism to perform advanced data management, such as big data analytics. This survey focuses on data-centric workflows (or workflows for data analytics or data flows), where a key aspect is data passing through and getting manipulated by a sequence of steps. The large volume and variety of data, the complexity of operations performed, and the long time such workflows take to compute give rise to the need for optimization. In general, data-centric workflow optimization is a technology in evolution. This survey focuses on techniques applicable to workflows comprising arbitrary types of data manipulation steps and semantic inter-dependencies between such steps. Further, it serves a twofold purpose. Firstly, to present the main dimensions of the relevant optimization problems and the types of optimizations that occur before flow execution. Secondly, to provide a concise overview of the existing approaches with a view to highlighting key observations and areas deserving more attention from the community.

From conceptual design to performance optimization of ETL workflows: current state of research and open problems

Article

Full-text available

Dec 2017
VLDB J

In this paper, we discuss the state of the art and current trends in designing and optimizing ETL workflows. We explain the existing techniques for: (1) constructing a conceptual and a logical model of an ETL workflow, (2) its corresponding physical implementation, and (3) its optimization, illustrated by examples. The discussed techniques are analyzed w.r.t. their advantages, disadvantages, and challenges in the context of metrics such as autonomous behavior, support for quality metrics, and support for ETL activities as user-defined functions. We draw conclusions on still open research and technological issues in the field of ETL. Finally, we propose a theoretical ETL framework for ETL optimization.

QETL: An approach to on-demand ETL from non-owned data sources

Article

Sep 2017
DATA KNOWL ENG

In traditional OLAP systems, the ETL process loads all available data in the data warehouse before users start querying them. In some cases, this may be either inconvenient (because data are supplied from a provider for a fee) or unfeasible (because of their size); on the other hand, directly launching each analysis query on source data would not enable data reuse, leading to poor performance and high costs. The alternative investigated in this paper is that of fetching and storing data on-demand, i.e., as they are needed during the analysis process. In this direction we propose the Query-Extract-Transform-Load (QETL) paradigm to feed a multidimensional cube; the idea is to fetch facts from the source data provider, load them into the cube only when they are needed to answer some OLAP query, and drop them when some free space is needed to load other facts. Remarkably, QETL includes an optimization step to cheaply extract the required data based on the specific features of the data provider. The experimental tests, made on a real case study in the genomics area, show that QETL effectively reuses data to cut extraction costs, thus leading to significant performance improvements.

Modeling Data Flow Execution in a Parallel Environment

Conference Paper

Aug 2017

A simple ETL workflow.

Contexts in source publication

Citations