Figure - uploaded by Binh Vu
Content may be subject to copyright.
Performance comparison between different classifiers (# labeled cells = 20)

Performance comparison between different classifiers (# labeled cells = 20)

Source publication
Conference Paper
Full-text available
Error detection is one of the most important steps in data cleaning and usually requires extensive human interaction to ensure quality. Existing supervised methods in error detection require a significant amount of training data while unsupervised methods rely on fixed inductive biases, which are usually hard to generalize, to solve the problem. In...

Citations

... Similar to outlier detection approaches (see Sect. 6.4), the engineered features are generally designed to highlight the abnormal characteristics of data. A commonly used feature-engineering practice for error detection is to create features that measure how frequent a value is in a dataset (i.e., frequency-based features) (Visengeriyeva and Abedjan 2018;Neutatz et al. 2019;Heidari et al. 2019;Pham et al. 2021). For example, the authors in (Neutatz et al. 2019) measured the TF-IDF score of n-grams inside a cell to encode how common a cell's value is. ...
... For example, Neutatz et al. (2019) used the co-occurrence of values among different attributes to facilitate the error detection process. Another category of engineered features focuses on the format of the values (i.e., format-based features) (Visengeriyeva and Abedjan 2018;Neutatz et al. 2019;Heidari et al. 2019;Pham et al. 2021). Corrupted cells might not follow the syntactic format that is expected for a feature. ...
... For example, a value that represents a name should not have numbers. Pham et al. (2021) replaces all numbers and characters with symbols that are unique to them; thus focusing only on the shape of the data. For example, given the value "400$", the encoded format could be "nnns", where "n" represents numbers and "s", symbols. ...
Article
Full-text available
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. This paper’s objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. We believe that our review of the literature will help the community develop better approaches to clean data.
... Table 6). Examples include vertex generation methods [28], [29], branch-and-bound methods [28], [30], [31] and holistic [19], [32] and probabilistic methods [33], [34], [35], [36]. However, one of the most elegant and easy to understand methods to solve the error localization problem is the set cover method, which is proposed by Fellegi & Holt for regular edit rules [12]. ...
... The reason to choose HoloClean to compare with is because it is, to the best of our knowledge, one of the best repair engines in terms of error detection/correction based on data quality rules. Other well-performing, state-of-the-art repair engines, such as Raha/Baran [34], [35] and Spade [36], are not based on data quality rules and are, therefore, not taken into account. ...
Article
Full-text available
In this paper, we propose and study a type of tuple-level constraints that arises from the selection operator σ of relational algebra and that closely resembles the concepts of tuple-level denial constraints. We call this type of constraints selection rules and study their concepts and properties in the setting of data consistency management. The main contribution of this paper is the study of rule implication with selection rules in order to solve the error localization problem by means of the set cover method. It turns out that rule implication can be applied more easily if the representation of selection rules is extended in order to allow gaps between attribute values. We show that the properties of selection rules allow to improve the performance of rule implication. Evaluation of our approach compared to HoloClean on four real-world datasets shows promising results. First, repair with selection rules is often faster and less memory-consumable than HoloClean, especially when the amount of work that rule implication has to do is limited. Second, in terms of precision and recall of error detection and correction, repair strategies with selection rules almost always outperform HoloClean.