Figure - available from: Scientific Programming
This content is subject to copyright. Terms and conditions apply.
MapReduce processing framework.

MapReduce processing framework.

Source publication
Article
Full-text available
Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single ma...

Similar publications

Article
Full-text available
The incidence of poverty is not a taboo topic. In fact, it happens in every country worldwide where the policymakers and governments struggle to reduce their country’s poverty rate. However, the existing ways of finding the right targeted impoverished group to provide economic aid are often flawed because of multiple issues such as data transparenc...

Citations

Chapter
The amount of data features is growing rapidly in the era of big data, posing challenges to both the security and efficiency of feature query. Most existing encryption-based retrieval approaches are limited by the significant computational overhead and merely support precise query, which might fail to handle the incomplete keywords and misspellings in the query. To achieve both query efficiency and privacy-preserving for large-scale data, this paper presents EQFF, an Efficient Query Method Using Feature Fingerprints. It converts varying-length features into fingerprints in the form of fixed-length vectors, and hence turns semantic information invisible to ensure query security. Based on the feature fingerprints, we further present the corresponding precise and fuzzy query approaches, design the inverted index library and propose a compression storage mechanism to improve query efficiency. Extensive experiments are conducted based on real-world datasets. Experimental results show that our EQFF takes only 6.4% memory compared with raw data, reduces the time cost from minutes to tens of milliseconds, and achieves an accuracy of 98% above.