Level 2 of a data mapping diagram

Source publication

Data Mapping Diagrams for Data Warehouse Design with UML

Conference Paper

Full-text available

Nov 2004

In Data Warehouse (DW) scenarios, ETL (Extraction, Trans- formation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (con- version, cleaning, normalization, etc.) and their loading into the DW. In this paper, we present a framework for the design of the DW back-stage (a...

Context 1

... will detail the stereotypes of the table level in the next section and defer the discussion for the stereotypes of the attribute level to subsection 4.2. During the integration process from data sources into the DW, source data may undergo a series of transformations, which may vary from simple alge- braic operations or aggregations to complex procedures. In our approach, the designer can segment a long and complex transformation process into simple and small parts represented by means of UML packages that are materialization of a Mapping stereotype and contain an attribute/class diagram. Moreover, Mapping packages are linked by Input and Output dependencies that represent the flow of data. During this process, the designer can create intermediate classes , represented by the Intermediate stereotype, in order to simplify or clarify the models. These classes represent intermediate storage that may or may not exist actually, but they help to understand the mappings. In Fig. 4, a schematic representation of a data mapping diagram at the table level is shown. This level specifies data sources and target sources, to which these data are directed. At this level, the classes are represented as usually in UML with the attributes depicted inside the container class. Since all the classes are imported from other packages, the legend (from ...) appears below the name of each class. The mapping diagram is shown as a package decorated with the Mapping stereotype and hides the complexity of the mapping, because a vast number of attributes can be involved in a data mapping. This package presents two kinds of stereotyped dependencies: Input to the data providers (i.e., the data sources) and Output to the data consumers (i.e., the tables of the DW). As already mentioned, in the attribute level, the diagram includes the relationships between the attributes of the classes involved in a data mapping. At this level, we offer two design ...

View in full-text

Context 2

... content of the package from Fig. 4 is defined in the following way (recall that is a package that contains an attribute/class diagram): The classes DS1 , DS2 , . . . , and Dim1 are imported in Mapping diagram . The attributes of these classes are suppressed because they are shown as Attribute classes in this ...

View in full-text

Data Warehousing Process Modeling from Classical Approaches to New Trends: Main Features and Comparisons

Article

Full-text available

Aug 2022

The extract, transform, and load (ETL) process is at the core of data warehousing architectures. As such, the success of data warehouse (DW) projects is essentially based on the proper modeling of the ETL process. As there is no standard model for the representation and design of this process, several researchers have made efforts to propose modeling methods based on different formalisms, such as unified modeling language (UML), ontology, model-driven architecture (MDA), model-driven development (MDD), and graphical flow, which includes business process model notation (BPMN), colored Petri nets (CPN), Yet Another Workflow Language (YAWL), CommonCube, entity modeling diagram (EMD), and so on. With the emergence of Big Data, despite the multitude of relevant approaches proposed for modeling the ETL process in classical environments, part of the community has been motivated to provide new data warehousing methods that support Big Data specifications. In this paper, we present a summary of relevant works related to the modeling of data warehousing approaches, from classical ETL processes to ELT design approaches. A systematic literature review is conducted and a detailed set of comparison criteria are defined in order to allow the reader to better understand the evolution of these processes. Our study paints a complete picture of ETL modeling approaches, from their advent to the era of Big Data, while comparing their main characteristics. This study allows for the identification of the main challenges and issues related to the design of Big Data warehousing systems, mainly involving the lack of a generic design model for data collection, storage, processing, querying, and analysis

Data Warehousing Process Modeling from Classical Approaches to New Trends: Main Features and Comparisons

Article

Full-text available

Aug 2022

Data Warehousing Process Modeling from Classical Approaches to New Trends: Main Features and Comparisons

Preprint

Full-text available

Aug 2022

Experiences from a Data Analysis of Crimes against Humanity

Article

Full-text available

Apr 2021

Data analysis is a widely researched field, where innumerable applications allow to discover domain particularities that are specially useful. In this paper, we introduce the data analysis process that we applied to two different systems storing information about statements and testimonies of crimes against Humanity. We describe the activities, design decisions and lessons learned from implementing a specific goal, which involves transforming text data into georeferenced information.

Towards a Semi-Automatic Semantic Extract-Transform-Load framework for Big Data Integration

Conference Paper

Nov 2019

Big Data has become the new inclusive term which describes enormous collection of datasets that are difficult to handle with traditional database and software technologies. Big Data research is dealing with variety of data that includes various formats, specifically semi-structured, unstructured data. In the context of Big Data processing, current Extract-Transform-Load (ETL) frameworks are not suitable for this "real-world scenario" because they do not consider semantic issues in the integration process. This paper develops a programmable semi-automatic Semantic ETL (SETL) framework by using Apache Jena tool that uses semantic technologies to integrate and publish structured data from multiple data sources.

The data warehouse for down payment administration in the Constitutional Court of Republic of Indonesia

Article

Full-text available

Oct 2018

The Constitutional Court of The Republic of Indonesia is an institution which has authority to judge and examine constitutions. In this institution, each component has responsibility to manage their own finance and budgeting. Based on this case, they need a solution to summarize the amount and absorption of the budget in every component. This paper proposes a solution to design a data warehouse model to process and analyse the down payment administration data. The nine steps design methodology for developing data warehouse is implemented in this work. This paper aims to bring some benefit for The Constitutional Court, especially in form of information about down payment absorption for every component each year. The result is presented in form of pdf report and informative dashboard using data warehouse tools.

Parallélisation et distribution du processus d’intégration ETL pour le traitement de données massives

Thesis

Nov 2017

Mahfoud Bala

This thesis deals with the processing of "Big Data" in decision-support systems. Indeed, our research has been motivated by the observation that the well-known Extracting-Transforming-Loading (ETL) process, which is responsible for integrating data from various sources for analysis and decision-making purposes, had to be revisited in the light of the recent emergence of “Big Data” in order to deal with its complexity characterized by the so-called "4Vs", specifically the "Volume", "Variety", "Velocity", and "Veracity". Among all the complexity dimensions, our work focuses on the issue of "Volume" in Big Data by means of adapting the ETL to Parallel and distributed environments. With respect to ETL modeling, a number of interesting approaches have been suggested in the research literature. Unfortunately, these have proven to be unsuitable for Big Data environments. Furthermore, the existing parallel/distributed processing approaches have been defined at the implementation stage only. In an effort to fill this gap, we have exploited and hybridized conceptual modeling and parallel/distributed methods. In particular, we have adopted the Vassiliadis’ notation which we have extended with the MapReduce paradigm concepts in order to model classical and parallel/distributed entities. As for the parallelization/distribution of ETL processes on clusters, most existing solutions have been defined only at a "process" level which tends to be coarse-grained in nature. For the purposes of defining fine-grained and various topologies of distribution, our approach allows the parallelization/distribution of the ETL at three levels : notably the "functionality"/"elementary functions" levels as well as the "process" level. In addition, the distribution can be processed according to vertical or/and horizontal directions. Based on our new methods, we have developed a prototype referred to as Parallel-ETL (P-ETL for short). The P-ETL platform has been developed under the Apache Hadoop environment with java 1.7 (java SE Runtime Environment), Netbeans IDE 7.4, under Ubuntu 12.10. Our performance analysis has revealed that employing 25 to 38 parallel tasks enables our novel approach to speed up the ETL process by up to 33% with the improvement rate being linear.

Determining the Relevance of Relational Databases for eDiscovery

Thesis

Nov 2017

Jie Shi

With the rapid development of information technology, the nature of evidence presented in court has shifted from paper documents to electronically stored information, also known as ESI. During the fact-finding stage of civil litigation, ESI in the form of structured and unstructured data is sought out and mined for relevant information that can be used as evidence. This process is known as Electronic Discovery, commonly referred to as eDiscovery. Since organizations are increasingly using databases to store critical business information, such as financial information, human resource data and customer data, structured data in databases is now requested more often by courts. Organizations increasingly face the enormous problem of how to find relevant information across large numbers of relational databases where data is highly structured and information is stored separately in different tables due to database normalization. To address this problem, we propose an approach to determine the relevance of databases with regard to given search terms, which are linked by Boolean operators. By applying this approach, the volume of data can be significantly reduced by filtering out irrelevant databases that have no probative value on any issue in the lawsuit. The methodology adopted for this research is design science, which has been widely applied to guide and evaluate information systems research. First, we describe the current business context of eDiscovery and the emergence of structured data in this field. A review of the existing literature and concurrent approaches is followed by an assessment of each approach from both theoretical and practical perspectives. This includes literature found relating to eDiscovery, keyword search methods, data warehouses and NoSQL databases. Our assessment finds that none of these approaches is best suited for our problem, although these methods do contain knowledge that could be helpful for developing a new approach. We suggest denormalizing databases before conducting a search, to put information together in a single place and enable queries containing Boolean operators. The denormalized data should then be exported as a file in Avro format and imported into the Hadoop Distributed File System (HDFS) using Sqoop. Hadoop MapReduce should then be adopted for distributed and scalable searching on HDFS. We conducted some experiments to evaluate our approach using both public and productive data provided by eDiscovery experts at Volkswagen Group. The results demonstrate that our approach is feasible compared with Elasticsearch carried out by an external consulting firm. Finally, this thesis is concluded by outlining limitations and open issues.

Automating ETL processes using the domain-specific modeling approach

Article

Full-text available

May 2017
Inform Syst E Bus Manag

The development of Extract–Transform–Load (ETL) processes is the most complex, time-consuming and expensive phase of data warehouse development. Yet, the dynamics of modern business systems demand a more agile and flexible approach to their development. As a result, current research in this area is focused on ETL process conceptualization and the automation of ETL process development. This paper proposes a novel solution for automating ETL processes using the domain-specific modeling (DSM) approach. The proposed solution is based on the formal specification of ETL processes and the implementation of such formal specifications. Thus, in accordance with the DSM approach, several new domain-specific languages (DSLs) are introduced, each defining concepts relevant for a specific aspect of an ETL process. The focus of this paper is the actual implementation of the formal specification of an ETL process. To this end, a specific ETL platform (ETL-PL) is introduced to technologically support both the modeling of ETL processes (i.e., the creation of models in accordance with the introduced DSLs) and the automated transformation of the created models into the executable code of a specific application framework (representing ETL-PL’s execution environment). It should be emphasized that ETL-PL actually presumes the dynamic execution of ETL models or, more precisely, the executable code is generated at runtime. Thus the execution environment consists of code generator components and the components implementing the application framework. ETL-PL has been implemented as an extension of the .NET platform.

Investigating Factors that Influence Rice Yields of Bangladesh using Data Warehousing, Machine Learning, and Visualization

Article

Full-text available

Mar 2017

In this paper, we have tried to identify the prominent factors of Rice production of all the three seasons of the year (Aus, Aman, and Boro) by applying K-Means clustering on climate and soil variables' data warehoused using Fact Constellation schema. For the clustering, the popular machine-learning tool Weka was used whose visualization feature was principally useful to determine the patterns, dependencies, and relationships of rice yield on different climate and soil factors of rice production.

Level 2 of a data mapping diagram

Contexts in source publication

Citations