Fig 1 - uploaded by Panos Vassiliadis
Content may be subject to copyright.
A simple ETL workflow. 

A simple ETL workflow. 

Source publication
Article
Full-text available
Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In this paper, we derive into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as...

Contexts in source publication

Context 1
... Is it possible to perform the aggregation before the transformation of American values to Europeans? In principle, this should be allowed to happen, since the dates are kept in the resulting data and can be transformed later. In this case, the aggregation operations can be pushed before the function. . How can we deal with naming problems? PARTS1. COST and PARTS2.COST are homonyms, but they do not correspond to the same entity (the first is in Euros and the second in Dollars). Assuming that the transformation $ 2 = C produces the attribute = C COST , how can we guarantee that corresponds to the same real-world entity with PARTS1.COST ? In Fig. 2, we can see how the workflow of Fig. 1 can be transformed in an equivalent workflow performing the same task. The selection on Euros has been propagated to both branches of the workflow so that low values are pruned early. Still, we cannot push the selection either before the transformation $ 2 = C or before the aggregation. At the same time, there was a swapping between the aggregation and the DATE conversion function ( A2E ). In summary, the two problems that have risen are 1) to determine which operations over the workflow are legal and 2) which one is the best in terms of performance gains. We take a novel approach to the problem by taking this peculiarity into consideration. Moreover, we employ a workflow paradigm for the modeling of ETL processes, i.e., we do not strictly require that an activity outputs data to some persistent data store, but rather, activities are allowed to output data to one another. In such a context, I/O minimization is not the primary problem. In this paper, we focus on the optimization of the process in terms of logical transformations of the workflow. To this end, we devise a method based on the specifics of an ETL workflow that can reduce its execution cost either by decreasing the total number of processes or by changing the execution order of the processes. The paper deals with the specification of the design of an ETL workflow and its optimization. Our contributions can be listed as ...
Context 2
... give an idea of the complexity of the problem, we mention the characteristics of a vast data warehouse as cited in a recent experience report [2]. In this paper, the authors report that their data warehouse population system has to process, within a time window of 4 hours, 80 million records per hour for the entire process (compression, FTP of files, decompression, transformation, and loading) on a daily basis. The volume of data rises to about 2 TB with the main fact table containing about 3 billion records. The request for performance is so pressing that there are processes hard-coded in low-level Data Base Management Systems (DBMS) calls to avoid the extra step of storing data to a target file to be loaded to the data warehouse through the DBMS loader. The above clearly shows that intelligent techniques for data preparation can greatly improve the overall process of data warehouse population. So far, research has only partially dealt with the problem of designing and managing ETL workflows. Typically, research approaches concern 1) the optimization of stand- alone problems (e.g., the problem of duplicate detection [21]) in an isolated setting and 2) problems mostly related to Web data (e.g., [10]). Recently, research on data streams [1], [4] has brought up the possibility of giving an alternative look to the problem of ETL. Nevertheless, for the moment, research in data streaming has focused on different topics such as on-the-fly computation of queries [1], [4], [15]. To our knowledge, there is no systematic treatment of the problem, as far as the problem of the design of an optimal ETL workflow is concerned. On the other hand, leading commercial tools [12], [13], [17], [19] allow the design of ETL workflows, but do not use any optimization technique. The designed workflows are propagated to the DBMS for execution; thus, the DBMS undertakes the task of optimization. Clearly, we can do better than this because an ETL process cannot be considered as a “big” query. Instead, it is more realistic to treat an ETL process as a complex transaction. In addition, in an ETL workflow, there are processes that run in separate environments, usually not simultaneously and under time constraints. One could argue that we can possibly express all ETL operations in terms of relational algebra and then optimize the resulting expression as usual. Later in this paper, we will demonstrate that the traditional logic-based algebraic query optimization can be blocked, basically due to the existence of data manipulation functions. Consider the example of Fig. 1 that describes the population of a table of a data warehouse DW from two source databases S 1 and S 2 . In particular, it involves the propagation of data from the recordset PARTS1(PKEY, SOURCE,DATE,COST) of source S 1 that stores monthly information, as well as from the recordset PARTS2(PKEY, SOURCE,DATE,DEPT,COST) of source S 2 that stores daily information. In the DW , the recordset PARTS(PKEY,SOUR- CE,DATE,COST) stores monthly information for the cost in Euros (COST) of parts (PKEY) per source (SOURCE) . We assume that both the first supplier and the data warehouse are European and the second is American; thus, the data coming from the second source need to be converted to European values and formats. In Fig. 1, activities are numbered with their execution priority and tagged with the description of their functionality. The flow for source S1 is 3 : a check for Not Null values is performed on attribute COST . The flow for source S2 is 4 : a conversion from Dollars ($) to Euros ( = C ) performed on attribute COST , 5 : dates (DATE) are converted from American to European format, and 6 : an aggregation for monthly supplies is performed and the unnecessary attribute DEPT (for department) is discarded from the flow. Then, 7 : the two flows are unified, and before being loaded to the warehouse, 8 : a final check is performed on the COST attribute, ensuring that only values above a certain threshold (e.g., COST > 0 ) are propagated to the warehouse. There are several interesting problems and optimization opportunities in the example of Fig. ...
Context 3
... give an idea of the complexity of the problem, we mention the characteristics of a vast data warehouse as cited in a recent experience report [2]. In this paper, the authors report that their data warehouse population system has to process, within a time window of 4 hours, 80 million records per hour for the entire process (compression, FTP of files, decompression, transformation, and loading) on a daily basis. The volume of data rises to about 2 TB with the main fact table containing about 3 billion records. The request for performance is so pressing that there are processes hard-coded in low-level Data Base Management Systems (DBMS) calls to avoid the extra step of storing data to a target file to be loaded to the data warehouse through the DBMS loader. The above clearly shows that intelligent techniques for data preparation can greatly improve the overall process of data warehouse population. So far, research has only partially dealt with the problem of designing and managing ETL workflows. Typically, research approaches concern 1) the optimization of stand- alone problems (e.g., the problem of duplicate detection [21]) in an isolated setting and 2) problems mostly related to Web data (e.g., [10]). Recently, research on data streams [1], [4] has brought up the possibility of giving an alternative look to the problem of ETL. Nevertheless, for the moment, research in data streaming has focused on different topics such as on-the-fly computation of queries [1], [4], [15]. To our knowledge, there is no systematic treatment of the problem, as far as the problem of the design of an optimal ETL workflow is concerned. On the other hand, leading commercial tools [12], [13], [17], [19] allow the design of ETL workflows, but do not use any optimization technique. The designed workflows are propagated to the DBMS for execution; thus, the DBMS undertakes the task of optimization. Clearly, we can do better than this because an ETL process cannot be considered as a “big” query. Instead, it is more realistic to treat an ETL process as a complex transaction. In addition, in an ETL workflow, there are processes that run in separate environments, usually not simultaneously and under time constraints. One could argue that we can possibly express all ETL operations in terms of relational algebra and then optimize the resulting expression as usual. Later in this paper, we will demonstrate that the traditional logic-based algebraic query optimization can be blocked, basically due to the existence of data manipulation functions. Consider the example of Fig. 1 that describes the population of a table of a data warehouse DW from two source databases S 1 and S 2 . In particular, it involves the propagation of data from the recordset PARTS1(PKEY, SOURCE,DATE,COST) of source S 1 that stores monthly information, as well as from the recordset PARTS2(PKEY, SOURCE,DATE,DEPT,COST) of source S 2 that stores daily information. In the DW , the recordset PARTS(PKEY,SOUR- CE,DATE,COST) stores monthly information for the cost in Euros (COST) of parts (PKEY) per source (SOURCE) . We assume that both the first supplier and the data warehouse are European and the second is American; thus, the data coming from the second source need to be converted to European values and formats. In Fig. 1, activities are numbered with their execution priority and tagged with the description of their functionality. The flow for source S1 is 3 : a check for Not Null values is performed on attribute COST . The flow for source S2 is 4 : a conversion from Dollars ($) to Euros ( = C ) performed on attribute COST , 5 : dates (DATE) are converted from American to European format, and 6 : an aggregation for monthly supplies is performed and the unnecessary attribute DEPT (for department) is discarded from the flow. Then, 7 : the two flows are unified, and before being loaded to the warehouse, 8 : a final check is performed on the COST attribute, ensuring that only values above a certain threshold (e.g., COST > 0 ) are propagated to the warehouse. There are several interesting problems and optimization opportunities in the example of Fig. ...
Context 4
... give an idea of the complexity of the problem, we mention the characteristics of a vast data warehouse as cited in a recent experience report [2]. In this paper, the authors report that their data warehouse population system has to process, within a time window of 4 hours, 80 million records per hour for the entire process (compression, FTP of files, decompression, transformation, and loading) on a daily basis. The volume of data rises to about 2 TB with the main fact table containing about 3 billion records. The request for performance is so pressing that there are processes hard-coded in low-level Data Base Management Systems (DBMS) calls to avoid the extra step of storing data to a target file to be loaded to the data warehouse through the DBMS loader. The above clearly shows that intelligent techniques for data preparation can greatly improve the overall process of data warehouse population. So far, research has only partially dealt with the problem of designing and managing ETL workflows. Typically, research approaches concern 1) the optimization of stand- alone problems (e.g., the problem of duplicate detection [21]) in an isolated setting and 2) problems mostly related to Web data (e.g., [10]). Recently, research on data streams [1], [4] has brought up the possibility of giving an alternative look to the problem of ETL. Nevertheless, for the moment, research in data streaming has focused on different topics such as on-the-fly computation of queries [1], [4], [15]. To our knowledge, there is no systematic treatment of the problem, as far as the problem of the design of an optimal ETL workflow is concerned. On the other hand, leading commercial tools [12], [13], [17], [19] allow the design of ETL workflows, but do not use any optimization technique. The designed workflows are propagated to the DBMS for execution; thus, the DBMS undertakes the task of optimization. Clearly, we can do better than this because an ETL process cannot be considered as a “big” query. Instead, it is more realistic to treat an ETL process as a complex transaction. In addition, in an ETL workflow, there are processes that run in separate environments, usually not simultaneously and under time constraints. One could argue that we can possibly express all ETL operations in terms of relational algebra and then optimize the resulting expression as usual. Later in this paper, we will demonstrate that the traditional logic-based algebraic query optimization can be blocked, basically due to the existence of data manipulation functions. Consider the example of Fig. 1 that describes the population of a table of a data warehouse DW from two source databases S 1 and S 2 . In particular, it involves the propagation of data from the recordset PARTS1(PKEY, SOURCE,DATE,COST) of source S 1 that stores monthly information, as well as from the recordset PARTS2(PKEY, SOURCE,DATE,DEPT,COST) of source S 2 that stores daily information. In the DW , the recordset PARTS(PKEY,SOUR- CE,DATE,COST) stores monthly information for the cost in Euros (COST) of parts (PKEY) per source (SOURCE) . We assume that both the first supplier and the data warehouse are European and the second is American; thus, the data coming from the second source need to be converted to European values and formats. In Fig. 1, activities are numbered with their execution priority and tagged with the description of their functionality. The flow for source S1 is 3 : a check for Not Null values is performed on attribute COST . The flow for source S2 is 4 : a conversion from Dollars ($) to Euros ( = C ) performed on attribute COST , 5 : dates (DATE) are converted from American to European format, and 6 : an aggregation for monthly supplies is performed and the unnecessary attribute DEPT (for department) is discarded from the flow. Then, 7 : the two flows are unified, and before being loaded to the warehouse, 8 : a final check is performed on the COST attribute, ensuring that only values above a certain threshold (e.g., COST > 0 ) are propagated to the warehouse. There are several interesting problems and optimization opportunities in the example of Fig. ...
Context 5
... 2. A workflow equivalent to the one of Fig. 1.  ...

Citations

... Moreover, understanding the steps that undergo such data preparation opens the door to its optimization, an important step having in mind that these data flows in Big Data settings typically work with large quantities of data and require complex transformation over them. In addition to the classical flow optimization (Simitsis et al. 2005), given the dynamicity of different data flows and the fact that their parts are typically shared by different end-users (with high temporal locality -80% of data being reused within minutes or hours (Chen et al. 2012)), more advanced techniques like multi-flow optimization or view materialization may also be required. ...
Article
Full-text available
Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).
... The extraction of data from different types of sources, and their careful processing are associated with a number of problems, [5]. The source data is located in the sources of a wide variety of types and formats created in various speakers. ...
Article
Full-text available
The paper discusses the experience of introducing a developed automated system for integrating data directories from various, often unrelated and heterogeneous, information sources at the aircraft manufacturing enterprise Aviastar-SP JSC (Ulyanovsk, Russia). This automated system allows both manual and automatic modes to integrate the databases of directories used at the enterprise. To test the operability on real data, the developed system went through trial exploitation and gave good results.
... Cost models, apart from being useful in their own right, are encapsulated in cost-based optimizers; currently, for example, cost-based optimization solutions for task ordering in data flows employ simple cost models that may not capture the flow execution running time accurately, as shown in this work. For example, the sum cost metric, which is employed by many state-ofthe-art task ordering techniques [17,20,30], merely sums the cost of individual tasks. This results in an execution cost computation that may deviate from the real execution time, and the corresponding optimizations may not be reflected on response time. ...
... In [10], an optimization algorithm for query plans with dependency constraints between algebraic operators is presented. The techniques in [17] build on top of [14,21], and are shown to be superior to approaches, such as [5,12,15,22,26,30,36] when SCM-F is targeted. In [18], an exhaustive optimization proposal for data flows is presented that aims to produce all the topological sortings of the tasks in a way that each sorting is produced from the previous one with the minimal amount of changes. ...
Article
Full-text available
Although the modern data flows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art data flow optimization techniques, do not accurately reflect the response time of real data flow execution in these execution environments. This is mainly due to the fact that the impact of parallelism, and more specifically, the impact of concurrent task execution on the running time is not adequately modeled in current cost models. The contribution of this work is twofold. Firstly, we propose an advanced cost model that aims to reflect the response time of a data flow that is executed in parallel more accurately. Secondly, we show that existing optimization solutions are inadequate and develop new optimization techniques targeting the proposed cost model. We focus on the single multi-core machine environment provided by modern business intelligence tools, such as Pentaho Kettle, but our approach can be extended to massively parallel and distributed settings. The distinctive features of our proposal is that we model both time overlaps and the impact of concurrency on task running times in a combined manner; the latter is appropriately quantified and its significance is exemplified. Furthermore, we propose extensions to current optimizers that decide on the exact ordering of flow tasks taking into account the new optimization metric. Finally, we evaluate the new optimization algorithms and show up to 59% response time improvement over state-of-the-art task ordering techniques.
... In an ETL tool there are unknown functions possible as self-written functions. It is not possible for such a tool to optimize the process holistically because of black box functionalities [16]. ...
Chapter
Full-text available
Extract-Transform-Load (ETL) describes the process of loading data from a source to a destination. The source and the destination can be separated physically and transformations may take place in between. Data preparation happens regularly. To minimize interference with other business processes and to guarantee a high data availability these processes are often run during night times. Therefore the demand for shorter processing times of ETL-processes is increasing steadily. Besides data availability and actuality another reason is the transition to real- or near-time analysis of data and the growing data volume. There are several approaches for the optimization of ETL-processes which will be highlighted in detail in this article. A closer look will be taken on the advantages and disadvantages of the presented approaches. Concluding each approach will be set into competition and a recommendation depending on the use case is given.
... There has been a lot of research for the past decade on the optimization of ETL workflows due to a critical requirement on execution time. The research by Simitsis et al. (2005;, Tziovara et al. (2007), Kumar and Kumar (2010), Karagiannis et al. (2013), or Vassiliadis et al. (2009 highlights the problem of ill-timed completion of an ETL workflow and discusses the methods to improve its execution performance. ...
... The work presented by Simitsis et al. (2005) proposes to optimize the ETL workflow by re-ordering the ETL activities in a directed acyclic graph (DAG) by pushing the highly-selective activities at the beginning of the flow. Simitsis et al. (2010) use the same approach as previously (Simitsis et al., 2005), but now it is more focused on generating an optimal ETL workflow in terms of performance, fault-tolerance, and freshness. ...
... The work presented by Simitsis et al. (2005) proposes to optimize the ETL workflow by re-ordering the ETL activities in a directed acyclic graph (DAG) by pushing the highly-selective activities at the beginning of the flow. Simitsis et al. (2010) use the same approach as previously (Simitsis et al., 2005), but now it is more focused on generating an optimal ETL workflow in terms of performance, fault-tolerance, and freshness. Similarly to the aforementioned approaches in terms of ETL workflow design, Tziovara et al. (2007) propose to change the order of input tuples to improve the execution cost of an ETL activity, which results in the optimal overall cost of execution flow. ...
Article
Full-text available
Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.
... These different source systems can originate from the organization itself or from some external organizations [7]. Data that are to be taken over can be stored in flat files, excel files, relational databases, non-relational databases, data repositories from different tools and systems like Custom Relation Management (CRM) tool, Project Management tool, or Enterprise Resource Planning (ERP) system [8] [9]. ...
Conference Paper
Full-text available
Contemporary organizations tend to improve decision-making processes in order to improve the organization's business as well. Improving the decision-making process is often associated with the establishment of the Data Warehouse (DW) system, which in many cases is proved to be a very good solution. The central part of the establishment of the DW system is the Extract-Transform-Load (ETL) process that involves extracting data from different homogeneous or heterogeneous data sources, transforming them, and finally placing transformed data in a common location. In addition to its application in DW systems, the ETL process is widely applied in data integration, migration, staging and master data management efforts. The organization can choose from a variety of ETL tools and the choice of an appropriate ETL tool is extremely important for each organization. In the paper, we analyze different approaches for ETL tool selection. We make the selection of proposed criteria in reviewed researches and illustrate how they can be applied in the case study of comparing of two ETL tools: Talend Open Studio for Data Integration and SQL Server Integration Services.
... -Task schemata, which refer to the definition of schema of the data input and/or output of each task. Note that dependencies may be produced by task schemata through simple processing [87], especially if they contain information about which schema elements are bound or free [58]. However, task schemata may serve additional purposes than deriving dependencies, e.g., to check whether a task contributes to the final desired output of the flow. ...
... Another exhaustive technique is to define the problem as a state space search one [87]. In such a space, each possible task ordering is modeled as a distinct state and all states are eventually visited. ...
... Similar to the optimization proposals described previously, this technique is not scalable either. Another form of task re-ordering is when a single input/output task is moved before or after a multi-input or a multi-output task [87,92]. An example case is when two copies of a proliferate single input/ output task are originally placed in the two inputs of a binary fork operation and after re-ordering, are moved after the fork. ...
Article
Full-text available
Workflow technology is rapidly evolving and, rather than being limited to modeling the control flow in business processes, is becoming a key mechanism to perform advanced data management, such as big data analytics. This survey focuses on data-centric workflows (or workflows for data analytics or data flows), where a key aspect is data passing through and getting manipulated by a sequence of steps. The large volume and variety of data, the complexity of operations performed, and the long time such workflows take to compute give rise to the need for optimization. In general, data-centric workflow optimization is a technology in evolution. This survey focuses on techniques applicable to workflows comprising arbitrary types of data manipulation steps and semantic inter-dependencies between such steps. Further, it serves a twofold purpose. Firstly, to present the main dimensions of the relevant optimization problems and the types of optimizations that occur before flow execution. Secondly, to provide a concise overview of the existing approaches with a view to highlighting key observations and areas deserving more attention from the community.
... Furthermore, community has been focusing on techniques for optimizing the execution of an ETL workflow [9]. The most common techniques are based on tasks rearranging and moving more selective tasks toward the beginning of a workflow, e.g., [10][11][12]. On top of that, the existing research proves that applying processing parallelism at a data level or at activity level, or both, is a known approach to attain better execution of an ETL workflow. ...
... (d) Represent ETL constraints with separate activities in a logical model and determine their execution order. (e) Generate a schema involved in a logical model using the algorithm proposed in [12] in order to assure that semantics of the involved concepts does not change even after changing the execution order of tasks in an ETL workflow. ...
... A work of [31] extends that of [12] w.r.t. generating an optimal ETL workflow in terms of performance, fault tolerance, and freshness, as described in [23]. ...
Article
Full-text available
In this paper, we discuss the state of the art and current trends in designing and optimizing ETL workflows. We explain the existing techniques for: (1) constructing a conceptual and a logical model of an ETL workflow, (2) its corresponding physical implementation, and (3) its optimization, illustrated by examples. The discussed techniques are analyzed w.r.t. their advantages, disadvantages, and challenges in the context of metrics such as autonomous behavior, support for quality metrics, and support for ETL activities as user-defined functions. We draw conclusions on still open research and technological issues in the field of ETL. Finally, we propose a theoretical ETL framework for ETL optimization.
... In a right-time DW architecture [12] there are two components whose performance is crucial to assuring real-time or near-real-time processing of data: optimized ETL software and refreshing software. Logical optimization, focusing on restructuring ETL processes in order to minimize the cardinality of data flows, has been proposed by Simitsis et al. [13,14]. In particular, the second paper proposes a heuristic for searching the space of possible ETL graphs to find the most efficient execution. ...
Article
In traditional OLAP systems, the ETL process loads all available data in the data warehouse before users start querying them. In some cases, this may be either inconvenient (because data are supplied from a provider for a fee) or unfeasible (because of their size); on the other hand, directly launching each analysis query on source data would not enable data reuse, leading to poor performance and high costs. The alternative investigated in this paper is that of fetching and storing data on-demand, i.e., as they are needed during the analysis process. In this direction we propose the Query-Extract-Transform-Load (QETL) paradigm to feed a multidimensional cube; the idea is to fetch facts from the source data provider, load them into the cube only when they are needed to answer some OLAP query, and drop them when some free space is needed to load other facts. Remarkably, QETL includes an optimization step to cheaply extract the required data based on the specific features of the data provider. The experimental tests, made on a real case study in the genomics area, show that QETL effectively reuses data to cut extraction costs, thus leading to significant performance improvements.
... 1. Sum Cost Metric of the Full plan (SCM-F): minimize the sum of the task and communication costs of a data flow [7,13,20,10,15,8,16]. The first three metrics and the associated cost models can capture the re- sponse time under specific assumptions only. ...
Conference Paper
Although the modern data flows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art data flow optimization techniques, do not accurately reflect the response time of real data flow execution in these execution environments. This is mainly due to the fact that the impact of parallelism, and more specifically, the impact of concurrent task execution on the running time is not adequately modeled. In this work, we propose a cost modeling solution that aims to accurately reflect the response time of a data flow that is executed in parallel. We focus on the single multi-core machine environment provided by modern business intelligence tools, such as Pentaho Kettle, but our approach can be extended to massively parallel and distributed settings. The distinctive features of our proposal is that we model both time overlaps and the impact of concurrency on task running times in a combined manner; the latter is appropriately quantified and its significance is exemplified.