Figure - available from: Journal of Big Data
This content is subject to copyright. Terms and conditions apply.
Manual effort. Manual effort estimates

Manual effort. Manual effort estimates

Source publication
Article
Full-text available
Background Data scientists spend considerable amounts of time preparing data for analysis. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden. Results This paper presents an archit...

Citations

... This work focuses on curating social data through several curation services combined with crowdsourcing. We also examined [6] that ensure data preparation for structured data source curation via loosely coupled modules. In the same context, we identified other architectures and frameworks like KAYAK [7], which presents several data preparation tasks as Direct Acyclic Graph to constitute pipelines. ...
Article
Full-text available
Recently, we noticed the emergence of several data management architectures to cope with the challenges imposed by big data. Among them, data lakehouses are receiving much interest from industrial and academic fields due to their ability to hold disparate multi-structured batch and streaming data sources in a single data repository. Thus, the heterogeneous and complex aspect of the data requires a dedicated process to improve their quality and retrieve value from them. Therefore, data curation encompasses several tasks that clean and enrich data to ensure it continues to fit the user requirements. Nevertheless, most existing data curation approaches need more dynamics, flexibility, and customization in constituting the data curation pipeline to align with end user requirements that may vary according to her/his decision context. Moreover, they are dedicated to curating only a single type of structure of batch data sources (e.g., semi-structured). Considering the changing requirements of the user and the need to build a customized data curation pipeline according to the users and the data source characteristics, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation service composition, and data curation. The proposed framework is built upon new data characterization and evaluation modular ontology and a curation service composition approach that we detail in the following paper. The experimental findings validate the contributions’ performance in terms of effectiveness and execution time.
... Here, we describe automation in data preparation using Data Preparer, a descendant of VADA [57]. In Data Preparer, the problem description is: given a target, some sources and some optional instances that are related to the target, populate the target with data from the sources. ...
Article
Full-text available
Data analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an organization with internal data resources. As a result, there is a need to discover, clean, integrate and restructure data into a form that is suitable for an intended analysis. Data preparation, also known as data wrangling, is the process by which data are transformed from its existing representation into a form that is suitable for analysis. In this paper, we review the state-of-the-art in data preparation, by: (i) describing functionalities that are central to data preparation pipelines, specifically profiling, matching, mapping, format transformation and data repair; and (ii) presenting how these capabilities surface in different approaches to data preparation, that involve programming, writing workflows, interacting with individual data sets as tables, and automating aspects of the process. These functionalities and approaches are illustrated with reference to a running example that combines open government data with web extracted real estate data.
... This graph is used to compute queries that discover join paths with data cleaning operators. An alternative is the VADA [28] architecture, aimed to support the process of data wrangling (i.e., extracting, cleaning and collating datasets). VADA is a knowledge-representation system that, relying on the expressivity of the Datalog± description logic, provides services for schema matching, schema alignment, data fusion or data quality, among others. ...
Article
Full-text available
The ability to cross data from multiple sources represents a competitive advantage for organizations. Yet, the governance of the data lifecycle, from the data sources into valuable insights, is largely performed in an ad-hoc or manual manner. This is specifically concerning in scenarios where tens or hundreds of continuously evolving data sources produce semi-structured data. To overcome this challenge, we develop a framework for operationalizing and automating data governance. For the first, we propose a zoned data lake architecture and a set of data governance processes that allow the systematic ingestion, transformation and integration of data from heterogeneous sources, in order to make them readily available for business users. For the second, we propose a set of metadata artifacts that allow the automatic execution of data governance processes, addressing a wide range of data management challenges. We showcase the usefulness of the proposed approach using a real world use case, stemming from the collaborative project with the World Health Organization for the management and analysis of data about Neglected Tropical Diseases. Overall, this work contributes on facilitating organizations the adoption of data-driven strategies into a cohesive framework operationalizing and automating data governance.
... Several works in the literature addressing the data curation process exist. Some of them target the sequencing and the dynamic orchestration of data preparation and cleansing automatically combined with semi-automated steps of curation such in [2] for social data via a pool of services, and in [4] which is dedicated only to structured data source curation via loosely coupled data preparation components. In the same vein, some architectures and frameworks were proposed, such as KAYAK [8] that lies between users/applications and the file system (i.e., data storage location), and exposes a set of primitives and tasks for data preparation represented as a Direct Acyclic Graph. ...
Chapter
Data lakehouses are novel data management designs intended to hold disparate batch and streaming data sources in a single data repository. These data sources could be retrieved from different sources, including sensors, social networks, and open data. Because the data carried in data lakehouses is heterogeneous and complicated, data curation is required to improve its quality. Most existing data curation systems are static, require expert intervention, which can be error-prone and time-consuming, do not meet user expectations, and do not treat real-time data. Given these constraints, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation services composition, and data curation. The curation services composition module, which leverages several curation services to curate multi-structured batch and streaming data sources, is the focus of this work. A reinforcement learning-based method is provided for adaptively extracting the curation services composition scheme based on the data source type and the end user’s functional and non-functional requirements. The experimental findings validate the proposal’s effectiveness and demonstrate that it outperforms the First Visit Monte Carlo and Temporal Learning algorithms in terms of scalability, execution time, and alignment with functional and non-functional requirements.KeywordsData curationService compositionMachine learningReinforcement learningData lakehouse
... A survey by Forbes in 2016 reported that data scientists typically spend 80% of their time on preparing data, before analysing it [69]. As Opal Card data is an official government system used in the whole State of NSW, it has several advantages: it is a structured, complete, and maintained dataset. ...
Article
Full-text available
In an era of smart cities, artificial intelligence and machine learning, data is purported to be the 'new oil', fuelling increasingly complex analytics and assisting us to craft and invent future cities. This paper outlines the role of what we know today as big data in understanding the city and includes a summary of its evolution. Through a critical reflective case study approach, the research examines the application of urban transport big data for informing planning of the city of Sydney. Specifically, transport smart card data, with its diverse constraints, was used to understand mobility patterns through the lens of the 30 min city concept. The paper concludes by offering reflections on the opportunities and challenges of big data and the promise it holds in supporting data-driven approaches to planning future cities.
... In our case, this is the Vadalog system, which combines an expressive, scalable Knowledge Graph engine with good support for the data science workflow. The Vadalog system is Oxford's contribution to VADA [4,5], a joint project of the universities of Edinburgh, Manchester, and Oxford. ...
Article
Following the recent successful examples of large technology companies, many modern enterprises seek to build Knowledge Graphs to provide a unified view of corporate knowledge, and to draw deep insights using machine learning and logical reasoning. There is currently a perceived disconnect between the traditional approaches for data science, typically based on machine learning and statistical modelling, and systems for reasoning with domain knowledge. In this paper, we demonstrate how to perform a broad spectrum of data science tasks in a unified Knowledge Graph environment. This includes data wrangling, complex logical and probabilistic reasoning, and machine learning. We base our work on the state-of-the-art Knowledge Graph Management System Vadalog, which delivers highly expressive and efficient logical reasoning and provides seamless integration with modern data science toolkits such as the Jupyter platform. We argue that this is a significant step forward towards practical, holistic data science workflows that combine machine learning and reasoning in data science.
... "Matcher" is the term used for the algorithm that is used to determine if two entities are similar enough to represent the same real-world entity. It also examines the possibility to combine multiple matches and use training data if necessary (see [25,26]). In general, researchers are aware of the difficulty of detecting duplicates within incomplete data sets [12,18]. ...
Article
Full-text available
Duplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.
... After the first two operations are completed, we transform the tables into an RDF KB and apply reasoning with existentially quantified rules to identify and link entities in different tables. Reasoning with existentially quantified rules is a well-known technology for data integration and wrangling [22]. For our problem, we designed a set of rules that considers the types of columns and string similarities to establish links using the sameAs relation. ...
Preprint
Tables in scientific papers contain a wealth of valuable knowledge for the scientific enterprise. To help the many of us who frequently consult this type of knowledge, we present Tab2Know, a new end-to-end system to build a Knowledge Base (KB) from tables in scientific papers. Tab2Know addresses the challenge of automatically interpreting the tables in papers and of disambiguating the entities that they contain. To solve these problems, we propose a pipeline that employs both statistical-based classifiers and logic-based reasoning. First, our pipeline applies weakly supervised classifiers to recognize the type of tables and columns, with the help of a data labeling system and an ontology specifically designed for our purpose. Then, logic-based reasoning is used to link equivalent entities (via sameAs links) in different tables. An empirical evaluation of our approach using a corpus of papers in the Computer Science domain has returned satisfactory performance. This suggests that ours is a promising step to create a large-scale KB of scientific knowledge.
... Alternatively, given that graphs are widely accepted as a convenient data model to represent real-world abstractions and their relationships [3], the community has proposed different solutions grounded on this formalism. However, as the popularity of wrangling systems grows, non-technical users face high-entry barriers on interacting with them, requiring queries to be written in technical languages such as Datalog [4] or SPARQL [5]. Additionally, the vast number of available heterogeneous and independent datasets on the web pose several challenges for contemporary wrangling demands [6]. ...
... Pre: G ψ is a rewritings graph Post: V (G ψ ) is a set of source queries 1: function COMBINEREWRITINGS(G ψ ) 2: while E(G) = ∅ do 3: e ← CHOOSEEDGE(G ψ ) 4: ...
Article
Modern data analysis applications, require the ability to provide on-demand integration of data sources while offering a flexible and user-friendly query interface. Traditional techniques for answering queries using views, focused on a rather static setting, fail to address such requirements. To overcome these issues, we propose a fully-fledged data integration approach based on graph-based constructs. The extensibility of graphs allows us to extend the traditional framework for data integration with view definitions. Furthermore, we also propose a query language based on subgraphs. We tackle query answering via a query rewriting algorithm based on well-known algorithms for answering queries using views. We experimentally show that the proposed method yields good performance and does not introduce a significant overhead.
... "Matcher" is the term used for the algorithm that is used to determine if two entities are similar enough to represent the same real-world entity. It also examines the possibility to combine multiple matches and use training data if necessary (for more detail, refer to [25,26]). Researchers are aware of the difficulty of detecting duplicates within incomplete data sets [12,18]. ...
Preprint
Full-text available
Duplicate record is a known problem within the datasets especially within databases of huge volumes. The accuracy of duplicates detection determines the efficiency of the duplicates removal process. Unfortunately, the effort to detect duplicates becomes more challenging due to the presence of missing values within the records. This is because, during the clustering and matching process, missing values can cause records that are similar to be assigned in a wrong group, causing the duplicates left undetected. In this paper, we present how duplicates detection can be improved even though missing values are present within a data set using our Duplicates Detection within the Incomplete Data set (DDID) method. We hypothetically add the missing values to the key attributes of two datasets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. We analyze the results to evaluate the performance of duplicates detection using the Hot Deck method to compensate for the missing values in the key attributes. We hypothesize that by using Hot Deck, there is a performance improvement in duplicates detection. The performance of the DDID is compared with an early duplicates detection method (called DuDe) in terms of its accuracy and speed. The findings of the experiment show that, even though the data sets are incomplete, DDID is capable to offer better accuracy and faster duplicates detection as compared to a benchmark method (called DuDe). The results of this study contribute to duplicates detection under incomplete data sets constraint.