Fig 4 - uploaded by Sergio Luján-Mora
Content may be subject to copyright.
Level 2 of a data mapping diagram 

Level 2 of a data mapping diagram 

Source publication
Conference Paper
Full-text available
In Data Warehouse (DW) scenarios, ETL (Extraction, Trans- formation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (con- version, cleaning, normalization, etc.) and their loading into the DW. In this paper, we present a framework for the design of the DW back-stage (a...

Contexts in source publication

Context 1
... will detail the stereotypes of the table level in the next section and defer the discussion for the stereotypes of the attribute level to subsection 4.2. During the integration process from data sources into the DW, source data may undergo a series of transformations, which may vary from simple alge- braic operations or aggregations to complex procedures. In our approach, the designer can segment a long and complex transformation process into simple and small parts represented by means of UML packages that are materialization of a Mapping stereotype and contain an attribute/class diagram. Moreover, Mapping packages are linked by Input and Output dependencies that represent the flow of data. During this process, the designer can create intermediate classes , represented by the Intermediate stereotype, in order to simplify or clarify the models. These classes represent intermediate storage that may or may not exist actually, but they help to understand the mappings. In Fig. 4, a schematic representation of a data mapping diagram at the table level is shown. This level specifies data sources and target sources, to which these data are directed. At this level, the classes are represented as usually in UML with the attributes depicted inside the container class. Since all the classes are imported from other packages, the legend (from ...) appears below the name of each class. The mapping diagram is shown as a package decorated with the Mapping stereotype and hides the complexity of the mapping, because a vast number of attributes can be involved in a data mapping. This package presents two kinds of stereotyped dependencies: Input to the data providers (i.e., the data sources) and Output to the data consumers (i.e., the tables of the DW). As already mentioned, in the attribute level, the diagram includes the relationships between the attributes of the classes involved in a data mapping. At this level, we offer two design ...
Context 2
... content of the package from Fig. 4 is defined in the following way (recall that is a package that contains an attribute/class diagram): The classes DS1 , DS2 , . . . , and Dim1 are imported in Mapping diagram . The attributes of these classes are suppressed because they are shown as Attribute classes in this ...

Citations

... In this section, we propose a new categorization of the studied research works. As follows from Figure 1, we categorize these chronologically sorted research works into six main classes, according to the modeling formalism on which they are based: (i) Model based on UML [22]; (ii) model based on ontology [23]; (iii) model based on MDA [24] and model-driven development (MDD) [25]; (iv) model based on graphical flow formalism, including BPMN [26], the CPN modeling language [27], YAWL (Yet Another Workflow Language) [28], and data flow visualization [29]; (v) model based on ad hoc formalisms, including conceptual constructs [30], CommonCube [31], and EMD [32]; and, finally, (vi) contributions dealing with Big Data. ...
... ETL process modeling proposals based on the UML standard modeling language were among the first attempts in this area of research [6,22,[33][34][35]. Although UML is the most popular modeling language, due to its wide range of uses in software system development and the succession of improvements it has undergone, it has advantages and drawbacks; as such, all modeling work based on this unified modeling language has advantages and limits. ...
... In [22], conceptual modeling of the DW backstage at a very low level of granularity (attributes) was proposed. For this purpose, and as UML presents the relationships between classes and does not present the relationships between attributes, the authors took advantage of an extension mechanism that enables UML to treat attributes as first-class modeling elements (FCME, also known as first-class citizens) of the model. ...
Article
Full-text available
The extract, transform, and load (ETL) process is at the core of data warehousing architectures. As such, the success of data warehouse (DW) projects is essentially based on the proper modeling of the ETL process. As there is no standard model for the representation and design of this process, several researchers have made efforts to propose modeling methods based on different formalisms, such as unified modeling language (UML), ontology, model-driven architecture (MDA), model-driven development (MDD), and graphical flow, which includes business process model notation (BPMN), colored Petri nets (CPN), Yet Another Workflow Language (YAWL), CommonCube, entity modeling diagram (EMD), and so on. With the emergence of Big Data, despite the multitude of relevant approaches proposed for modeling the ETL process in classical environments, part of the community has been motivated to provide new data warehousing methods that support Big Data specifications. In this paper, we present a summary of relevant works related to the modeling of data warehousing approaches, from classical ETL processes to ELT design approaches. A systematic literature review is conducted and a detailed set of comparison criteria are defined in order to allow the reader to better understand the evolution of these processes. Our study paints a complete picture of ETL modeling approaches, from their advent to the era of Big Data, while comparing their main characteristics. This study allows for the identification of the main challenges and issues related to the design of Big Data warehousing systems, mainly involving the lack of a generic design model for data collection, storage, processing, querying, and analysis
... In this section, we propose a new categorization of the studied research works. As follows from Figure 1, we categorize these chronologically sorted research works into six main classes, according to the modeling formalism on which they are based: (i) Model based on UML [22]; (ii) model based on ontology [23]; (iii) model based on MDA [24] and model-driven development (MDD) [25]; (iv) model based on graphical flow formalism, including BPMN [26], the CPN modeling language [27], YAWL (Yet Another Workflow Language) [28], and data flow visualization [29]; (v) model based on ad hoc formalisms, including conceptual constructs [30], CommonCube [31], and EMD [32]; and, finally, (vi) contributions dealing with Big Data. ...
... ETL process modeling proposals based on the UML standard modeling language were among the first attempts in this area of research [6,22,[33][34][35]. Although UML is the most popular modeling language, due to its wide range of uses in software system development and the succession of improvements it has undergone, it has advantages and drawbacks; as such, all modeling work based on this unified modeling language has advantages and limits. ...
... In [22], conceptual modeling of the DW backstage at a very low level of granularity (attributes) was proposed. For this purpose, and as UML presents the relationships between classes and does not present the relationships between attributes, the authors took advantage of an extension mechanism that enables UML to treat attributes as first-class modeling elements (FCME, also known as first-class citizens) of the model. ...
Article
Full-text available
The extract, transform, and load (ETL) process is at the core of data warehousing architectures. As such, the success of data warehouse (DW) projects is essentially based on the proper modeling of the ETL process. As there is no standard model for the representation and design of this process, several researchers have made efforts to propose modeling methods based on different formalisms, such as unified modeling language (UML), ontology, model-driven architecture (MDA), model-driven development (MDD), and graphical flow, which includes business process model notation (BPMN), colored Petri nets (CPN), Yet Another Workflow Language (YAWL), CommonCube, entity modeling diagram (EMD), and so on. With the emergence of Big Data, despite the multitude of relevant approaches proposed for modeling the ETL process in classical environments, part of the community has been motivated to provide new data warehousing methods that support Big Data specifications. In this paper, we present a summary of relevant works related to the modeling of data warehousing approaches, from classical ETL processes to ELT design approaches. A systematic literature review is conducted and a detailed set of comparison criteria are defined in order to allow the reader to better understand the evolution of these processes. Our study paints a complete picture of ETL modeling approaches, from their advent to the era of Big Data, while comparing their main characteristics. This study allows for the identification of the main challenges and issues related to the design of Big Data warehousing systems, mainly involving the lack of a generic design model for data collection, storage, processing, querying, and analysis.
... In this section, we propose a new categorization of the studied research works. As follows from Figure 1, we categorize these chronologically sorted research works into six main classes, according to the modeling formalism on which they are based: (i) Model based on UML [22]; (ii) model based on ontology [23]; (iii) model based on MDA [24] and model-driven development (MDD) [25]; (iv) model based on graphical flow formalism, including BPMN [26], the CPN modeling language [27], YAWL (Yet Another Workflow Language) [28], and data flow visualization [29]; (v) model based on ad hoc formalisms, including conceptual constructs [30], CommonCube [31], and EMD [32]; and, finally, (vi) contributions dealing with Big Data. ...
... ETL process modeling proposals based on the UML standard modeling language were among the first attempts in this area of research [6,22,[33][34][35]. Although UML is the most popular modeling language, due to its wide range of uses in software system development and the succession of improvements it has undergone, it has advantages and drawbacks; as such, all modeling work based on this unified modeling language has advantages and limits. ...
... In [22], conceptual modeling of the DW backstage at a very low level of granularity (attributes) was proposed. For this purpose, and as UML presents the relationships between classes and does not present the relationships between attributes, the authors took advantage of an extension mechanism that enables UML to treat attributes as first-class modeling elements (FCME, also known as first-class citizens) of the model. ...
... To address those activities, the literature shows several approaches describing the design and modeling of ETL processes. In general these proposals can be classified into those based on UML [3,4,5,6], BPMN [7,8], or on semantic web technologies, like ontologies or logical models [9,10,11]. With respect to the proposals based on UML, in general the approaches extend the language with stereotypes for representing different functions of the ETL process. ...
Article
Full-text available
Data analysis is a widely researched field, where innumerable applications allow to discover domain particularities that are specially useful. In this paper, we introduce the data analysis process that we applied to two different systems storing information about statements and testimonies of crimes against Humanity. We describe the activities, design decisions and lessons learned from implementing a specific goal, which involves transforming text data into georeferenced information.
... The authors use the Declarative Database Language called LDL programming language to define the semantics of ETL activities. Some of the researchers have used some other approaches such as Unified Modeling Language (UML) [7]- [8] of ETL activities. The above researchers describe the design of their ETL workflow, but they do not deal with the meaningful semantic data. ...
Conference Paper
Big Data has become the new inclusive term which describes enormous collection of datasets that are difficult to handle with traditional database and software technologies. Big Data research is dealing with variety of data that includes various formats, specifically semi-structured, unstructured data. In the context of Big Data processing, current Extract-Transform-Load (ETL) frameworks are not suitable for this "real-world scenario" because they do not consider semantic issues in the integration process. This paper develops a programmable semi-automatic Semantic ETL (SETL) framework by using Apache Jena tool that uses semantic technologies to integrate and publish structured data from multiple data sources.
... In data warehouse (DW) scenarios, ETL (extraction, transformation, loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW [5]. A data warehouse is used a central repository of data of medium and large business organizations [6]. ...
Article
Full-text available
The Constitutional Court of The Republic of Indonesia is an institution which has authority to judge and examine constitutions. In this institution, each component has responsibility to manage their own finance and budgeting. Based on this case, they need a solution to summarize the amount and absorption of the budget in every component. This paper proposes a solution to design a data warehouse model to process and analyse the down payment administration data. The nine steps design methodology for developing data warehouse is implemented in this work. This paper aims to bring some benefit for The Constitutional Court, especially in form of information about down payment absorption for every component each year. The result is presented in form of pdf report and informative dashboard using data warehouse tools.
... 67 4.4 Processus de mappage SCS-DWCS [Luján-Mora 2004]. . . . . . . . . . . . 69 4.5 Niveau 2 de modélisation du diagramme de mappage (DM) : mappage entre les tables [Luján-Mora 2004]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Niveau 3 de modélisation du diagramme de mappage (DM) : mappage entre les attributs [Luján-Mora 2004]. . . . . . . . . . . . . . . . . . . . . . . . . . ...
... 69 4.5 Niveau 2 de modélisation du diagramme de mappage (DM) : mappage entre les tables [Luján-Mora 2004]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Niveau 3 de modélisation du diagramme de mappage (DM) : mappage entre les attributs [Luján-Mora 2004]. . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.7 Elément de modélisation "mécanisme" sous forme de classeur (a) et notation graphique (b) [Trujillo 2003]. . . . . . . . . . . . . . . . . . . . . . . . . . . ...
... La démarche de modélisation top-down dans BPMN est intéressante. L'objet workflow Subprocess dans sa forme réduite (Collapsed Subprocess) ressemble à l'idée du package dans l'approche[Luján-Mora 2004] puisqu'un modèle initial en BPMN présente une architecture générale du processus sans détails sur le contenu des sous-processus. Ensuite, chaque sous-processus est modélisé séparément dans sa forme étendue (Expanded Subprocess). ...
Thesis
This thesis deals with the processing of "Big Data" in decision-support systems. Indeed, our research has been motivated by the observation that the well-known Extracting-Transforming-Loading (ETL) process, which is responsible for integrating data from various sources for analysis and decision-making purposes, had to be revisited in the light of the recent emergence of “Big Data” in order to deal with its complexity characterized by the so-called "4Vs", specifically the "Volume", "Variety", "Velocity", and "Veracity". Among all the complexity dimensions, our work focuses on the issue of "Volume" in Big Data by means of adapting the ETL to Parallel and distributed environments. With respect to ETL modeling, a number of interesting approaches have been suggested in the research literature. Unfortunately, these have proven to be unsuitable for Big Data environments. Furthermore, the existing parallel/distributed processing approaches have been defined at the implementation stage only. In an effort to fill this gap, we have exploited and hybridized conceptual modeling and parallel/distributed methods. In particular, we have adopted the Vassiliadis’ notation which we have extended with the MapReduce paradigm concepts in order to model classical and parallel/distributed entities. As for the parallelization/distribution of ETL processes on clusters, most existing solutions have been defined only at a "process" level which tends to be coarse-grained in nature. For the purposes of defining fine-grained and various topologies of distribution, our approach allows the parallelization/distribution of the ETL at three levels : notably the "functionality"/"elementary functions" levels as well as the "process" level. In addition, the distribution can be processed according to vertical or/and horizontal directions. Based on our new methods, we have developed a prototype referred to as Parallel-ETL (P-ETL for short). The P-ETL platform has been developed under the Apache Hadoop environment with java 1.7 (java SE Runtime Environment), Netbeans IDE 7.4, under Ubuntu 12.10. Our performance analysis has revealed that employing 25 to 38 parallel tasks enables our novel approach to speed up the ETL process by up to 33% with the improvement rate being linear.
... This book has been accepted as the best guidance for data warehouse developers. Ali El-Sappagh et al. [4] divide the research in the field of modeling ETL processes into three groups: (1) modeling ETL using mapping expressions and guidelines [103,88,119,120]; (2) modeling using conceptual constructs [129,114,18]; (3) modeling based on UML [86,85]. As time went by, data warehousing has become a mature and well-established technology. ...
Thesis
With the rapid development of information technology, the nature of evidence presented in court has shifted from paper documents to electronically stored information, also known as ESI. During the fact-finding stage of civil litigation, ESI in the form of structured and unstructured data is sought out and mined for relevant information that can be used as evidence. This process is known as Electronic Discovery, commonly referred to as eDiscovery. Since organizations are increasingly using databases to store critical business information, such as financial information, human resource data and customer data, structured data in databases is now requested more often by courts. Organizations increasingly face the enormous problem of how to find relevant information across large numbers of relational databases where data is highly structured and information is stored separately in different tables due to database normalization. To address this problem, we propose an approach to determine the relevance of databases with regard to given search terms, which are linked by Boolean operators. By applying this approach, the volume of data can be significantly reduced by filtering out irrelevant databases that have no probative value on any issue in the lawsuit. The methodology adopted for this research is design science, which has been widely applied to guide and evaluate information systems research. First, we describe the current business context of eDiscovery and the emergence of structured data in this field. A review of the existing literature and concurrent approaches is followed by an assessment of each approach from both theoretical and practical perspectives. This includes literature found relating to eDiscovery, keyword search methods, data warehouses and NoSQL databases. Our assessment finds that none of these approaches is best suited for our problem, although these methods do contain knowledge that could be helpful for developing a new approach. We suggest denormalizing databases before conducting a search, to put information together in a single place and enable queries containing Boolean operators. The denormalized data should then be exported as a file in Avro format and imported into the Hadoop Distributed File System (HDFS) using Sqoop. Hadoop MapReduce should then be adopted for distributed and scalable searching on HDFS. We conducted some experiments to evaluate our approach using both public and productive data provided by eDiscovery experts at Volkswagen Group. The results demonstrate that our approach is feasible compared with Elasticsearch carried out by an external consulting firm. Finally, this thesis is concluded by outlining limitations and open issues.
... The modeling of ETL processes using existing general purpose modeling languages (such as Unified Modeling Language-UML or Business Process Model and Notation-BPMN), which have been extended to incorporate the concepts specific to the ETL process domain, has been proposed in (Trujillo and Luján-Mora 2003;Luján-Mora et al. 2004;Muñoz et al. 2008Muñoz et al. , 2009El Akkaoui and Zimányi 2009;El Akkaoui et al. 2011.At the same time, the use of DSLs which are tailored to a particular domain has also been proposed in (Vassiliadis et al. 2002a(Vassiliadis et al. , b, 2003Simitsis andVassiliadis 2003, 2008;Simitsis et al. 2005;Simitsis 2005). ...
... As a final point it should be noted that some of the approaches do not provide explicit concepts which allow for the formal definition of the semantics of the data transformations. For example, in (Vassiliadis et al. 2002a, b;Simitsis and Vassiliadis 2003;Luján-Mora and Trujillo 2004) notes or annotations are used for the explanation of the semantics of the transformations (e.g., type, expression, conditions, constrains etc.), while in (Trujillo and Luján-Mora 2003) even the actual attribute mappings are defined through notes. Since in these approaches the authors allow for the notes to be given in a natural language (and often without any restrictions on their content) they do not represent a formal specification. ...
Article
Full-text available
The development of Extract–Transform–Load (ETL) processes is the most complex, time-consuming and expensive phase of data warehouse development. Yet, the dynamics of modern business systems demand a more agile and flexible approach to their development. As a result, current research in this area is focused on ETL process conceptualization and the automation of ETL process development. This paper proposes a novel solution for automating ETL processes using the domain-specific modeling (DSM) approach. The proposed solution is based on the formal specification of ETL processes and the implementation of such formal specifications. Thus, in accordance with the DSM approach, several new domain-specific languages (DSLs) are introduced, each defining concepts relevant for a specific aspect of an ETL process. The focus of this paper is the actual implementation of the formal specification of an ETL process. To this end, a specific ETL platform (ETL-PL) is introduced to technologically support both the modeling of ETL processes (i.e., the creation of models in accordance with the introduced DSLs) and the automated transformation of the created models into the executable code of a specific application framework (representing ETL-PL’s execution environment). It should be emphasized that ETL-PL actually presumes the dynamic execution of ETL models or, more precisely, the executable code is generated at runtime. Thus the execution environment consists of code generator components and the components implementing the application framework. ETL-PL has been implemented as an extension of the .NET platform.
... We produce a huge quantity of agricultural produce and rice being the staple diet of this country; we produce an average annual quantity of 35.8 metric tons [2]. Rice is world"s 3 rd most important cereal crop, based on production volume [3] and is the prime agricultural produce of Bangladesh. In the Fiscal Year 2013-14, the government sanctioned 4.66 billion Taka as fertilizer subsidy and on an average more than 5% of a fiscal year"s total budget is allocated to the agricultural sector [5] and a survey in 2010 showed that about 47% of the country"s total employment is given by agriculture [5]. ...
Article
Full-text available
In this paper, we have tried to identify the prominent factors of Rice production of all the three seasons of the year (Aus, Aman, and Boro) by applying K-Means clustering on climate and soil variables' data warehoused using Fact Constellation schema. For the clustering, the popular machine-learning tool Weka was used whose visualization feature was principally useful to determine the patterns, dependencies, and relationships of rice yield on different climate and soil factors of rice production.