Manual effort. Manual effort estimates

A semantic and service-based approach for adaptive mutli-structured data curation in data lakehouses

Article

Full-text available

Nov 2023
WORLD WIDE WEB

Recently, we noticed the emergence of several data management architectures to cope with the challenges imposed by big data. Among them, data lakehouses are receiving much interest from industrial and academic fields due to their ability to hold disparate multi-structured batch and streaming data sources in a single data repository. Thus, the heterogeneous and complex aspect of the data requires a dedicated process to improve their quality and retrieve value from them. Therefore, data curation encompasses several tasks that clean and enrich data to ensure it continues to fit the user requirements. Nevertheless, most existing data curation approaches need more dynamics, flexibility, and customization in constituting the data curation pipeline to align with end user requirements that may vary according to her/his decision context. Moreover, they are dedicated to curating only a single type of structure of batch data sources (e.g., semi-structured). Considering the changing requirements of the user and the need to build a customized data curation pipeline according to the users and the data source characteristics, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation service composition, and data curation. The proposed framework is built upon new data characterization and evaluation modular ontology and a curation service composition approach that we detail in the following paper. The experimental findings validate the contributions’ performance in terms of effectiveness and execution time.

Data Preparation: A Technological Perspective and Review

Article

Full-text available

Jun 2023

Data analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an organization with internal data resources. As a result, there is a need to discover, clean, integrate and restructure data into a form that is suitable for an intended analysis. Data preparation, also known as data wrangling, is the process by which data are transformed from its existing representation into a form that is suitable for analysis. In this paper, we review the state-of-the-art in data preparation, by: (i) describing functionalities that are central to data preparation pipelines, specifically profiling, matching, mapping, format transformation and data repair; and (ii) presenting how these capabilities surface in different approaches to data preparation, that involve programming, writing workflows, interacting with individual data sets as tables, and automating aspects of the process. These functionalities and approaches are illustrated with reference to a running example that combines open government data with web extracted real estate data.

Operationalizing and automating Data Governance

Article

Full-text available

Dec 2022

The ability to cross data from multiple sources represents a competitive advantage for organizations. Yet, the governance of the data lifecycle, from the data sources into valuable insights, is largely performed in an ad-hoc or manual manner. This is specifically concerning in scenarios where tens or hundreds of continuously evolving data sources produce semi-structured data. To overcome this challenge, we develop a framework for operationalizing and automating data governance. For the first, we propose a zoned data lake architecture and a set of data governance processes that allow the systematic ingestion, transformation and integration of data from heterogeneous sources, in order to make them readily available for business users. For the second, we propose a set of metadata artifacts that allow the automatic execution of data governance processes, addressing a wide range of data management challenges. We showcase the usefulness of the proposed approach using a real world use case, stemming from the collaborative project with the World Health Organization for the management and analysis of data about Neglected Tropical Diseases. Overall, this work contributes on facilitating organizations the adoption of data-driven strategies into a cohesive framework operationalizing and automating data governance.

A Service-Based Framework for Adaptive Data Curation in Data Lakehouses

Chapter

Nov 2022

Data lakehouses are novel data management designs intended to hold disparate batch and streaming data sources in a single data repository. These data sources could be retrieved from different sources, including sensors, social networks, and open data. Because the data carried in data lakehouses is heterogeneous and complicated, data curation is required to improve its quality. Most existing data curation systems are static, require expert intervention, which can be error-prone and time-consuming, do not meet user expectations, and do not treat real-time data. Given these constraints, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation services composition, and data curation. The curation services composition module, which leverages several curation services to curate multi-structured batch and streaming data sources, is the focus of this work. A reinforcement learning-based method is provided for adaptively extracting the curation services composition scheme based on the data source type and the end user’s functional and non-functional requirements. The experimental findings validate the proposal’s effectiveness and demonstrate that it outperforms the First Visit Monte Carlo and Temporal Learning algorithms in terms of scalability, execution time, and alignment with functional and non-functional requirements.KeywordsData curationService compositionMachine learningReinforcement learningData lakehouse

Big Data: The Engine to Future Cities - A Reflective Case Study in Urban Transport

Article

Full-text available

Feb 2022

In an era of smart cities, artificial intelligence and machine learning, data is purported to be the 'new oil', fuelling increasingly complex analytics and assisting us to craft and invent future cities. This paper outlines the role of what we know today as big data in understanding the city and includes a summary of its evolution. Through a critical reflective case study approach, the research examines the application of urban transport big data for informing planning of the city of Sydney. Specifically, transport smart card data, with its diverse constraints, was used to understand mobility patterns through the lens of the 30 min city concept. The paper concludes by offering reflections on the opportunities and challenges of big data and the promise it holds in supporting data-driven approaches to planning future cities.

Data science with Vadalog: Knowledge Graphs with machine learning and reasoning in practice

Article

Nov 2021
FUTURE GENER COMP SY

Following the recent successful examples of large technology companies, many modern enterprises seek to build Knowledge Graphs to provide a unified view of corporate knowledge, and to draw deep insights using machine learning and logical reasoning. There is currently a perceived disconnect between the traditional approaches for data science, typically based on machine learning and statistical modelling, and systems for reasoning with domain knowledge. In this paper, we demonstrate how to perform a broad spectrum of data science tasks in a unified Knowledge Graph environment. This includes data wrangling, complex logical and probabilistic reasoning, and machine learning. We base our work on the state-of-the-art Knowledge Graph Management System Vadalog, which delivers highly expressive and efficient logical reasoning and provides seamless integration with modern data science toolkits such as the Jupyter platform. We argue that this is a significant step forward towards practical, holistic data science workflows that combine machine learning and reasoning in data science.

Missing values compensation in duplicates detection using hot deck method

Article

Full-text available

Aug 2021

Duplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.

Tab2Know: Building a Knowledge Base from Tables in Scientific Papers

Preprint

Jul 2021

Tables in scientific papers contain a wealth of valuable knowledge for the scientific enterprise. To help the many of us who frequently consult this type of knowledge, we present Tab2Know, a new end-to-end system to build a Knowledge Base (KB) from tables in scientific papers. Tab2Know addresses the challenge of automatically interpreting the tables in papers and of disambiguating the entities that they contain. To solve these problems, we propose a pipeline that employs both statistical-based classifiers and logic-based reasoning. First, our pipeline applies weakly supervised classifiers to recognize the type of tables and columns, with the help of a data labeling system and an ontology specifically designed for our purpose. Then, logic-based reasoning is used to link equivalent entities (via sameAs links) in different tables. An empirical evaluation of our approach using a corpus of papers in the Computer Science domain has returned satisfactory performance. This suggests that ours is a promising step to create a large-scale KB of scientific knowledge.

Graph-Driven Federated Data Management

Article

May 2021
IEEE T KNOWL DATA EN

Modern data analysis applications, require the ability to provide on-demand integration of data sources while offering a flexible and user-friendly query interface. Traditional techniques for answering queries using views, focused on a rather static setting, fail to address such requirements. To overcome these issues, we propose a fully-fledged data integration approach based on graph-based constructs. The extensibility of graphs allows us to extend the traditional framework for data integration with view definitions. Furthermore, we also propose a query language based on subgraphs. We tackle query answering via a query rewriting algorithm based on well-known algorithms for answering queries using views. We experimentally show that the proposed method yields good performance and does not introduce a significant overhead.

Missing Values Compensation in Duplicates Detection Using Hot Deck

Preprint

Full-text available

Apr 2021

Duplicate record is a known problem within the datasets especially within databases of huge volumes. The accuracy of duplicates detection determines the efficiency of the duplicates removal process. Unfortunately, the effort to detect duplicates becomes more challenging due to the presence of missing values within the records. This is because, during the clustering and matching process, missing values can cause records that are similar to be assigned in a wrong group, causing the duplicates left undetected. In this paper, we present how duplicates detection can be improved even though missing values are present within a data set using our Duplicates Detection within the Incomplete Data set (DDID) method. We hypothetically add the missing values to the key attributes of two datasets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. We analyze the results to evaluate the performance of duplicates detection using the Hot Deck method to compensate for the missing values in the key attributes. We hypothesize that by using Hot Deck, there is a performance improvement in duplicates detection. The performance of the DDID is compared with an early duplicates detection method (called DuDe) in terms of its accuracy and speed. The findings of the experiment show that, even though the data sets are incomplete, DDID is capable to offer better accuracy and faster duplicates detection as compared to a benchmark method (called DuDe). The results of this study contribute to duplicates detection under incomplete data sets constraint.

Manual effort. Manual effort estimates

Citations