Uncertain Transactional Dataset.

Source publication

A review on big data parallel and distributed approaches of pattern mining

Article

Full-text available

Sep 2019

Pattern mining is a fundamental technique of data mining to discover interesting correlations in the data set. There are several variations of pattern mining, such as frequent itemset mining, sequence mining, and high utility itemset mining. High utility itemset mining is an emerging data science task, aims to extract knowledge based on a domain ob...

Context 1

... i h } each item i q has an existential probability ( Leung and Jiang, 2014) Pr(i q , T p ), which reflects the probability that i q is present in T p , with value 0 < Pr(i q , T p ), 1. In Table 6, the probability of existing For item 'i 2 ' in transaction T 3 is 0.5. ...

View in full-text

Context 2

View in full-text

Utility Frequent Patterns Mining on Large Scale Data based on Apriori MapReduce Algorithm

Article

Full-text available

Aug 2019

Pattern mining is a standout amongst the majority essential responsibilities to separate significant and helpful data from unprocessed data. Here the work intends to separate itemsets are speak to a homogeneity and consistency in data. At present techniques have been produced in such manner; the developing enthusiasm for data have cause of executio...

Comparison of Logistic Regression, Random Forest and Adaboost Algorithms for Diabetes Mellitus Classification

Article

May 2024

Diabetes mellitus is a chronic disease that affects the way the body regulates sugar (glucose). High blood sugar levels can lead to health complications including heart problems, eye disorders, nerve damage, kidney and blood vessel disorders. It is important for early detection of diabetes by utilizing data mining technology. Data mining has various classification models that can be used to detect diabetes, including logistic regression, random forest and adaboost. The comparison of the three algorithms aims to find out which algorithm is most appropriate in the classification of diabetes. From the results obtained, the random forest algorithm has the best performance in the classification of diabetes mellitus compared to other algorithms.

Implementation of Association Rules Algorithm to Identify Popular Topping Combinations in Orders

Article

Feb 2024

Association rule is a data mining technique to find associative rules between a combination of items. This research aims to apply association rules algorithm in identifying popular topping combinations in food orders. This application aims to help restaurant owners or food businesses understand their customers' preferences and optimize their menu offerings. Data obtained from kaggle, the association rules algorithm is applied to this dataset to identify patterns or combinations of toppings that often appear together in orders. The results of this study show toppings with chocolate as a popular item in orders. These findings can provide valuable insights for food business owners in structuring their menus and determining attractive offers for customers. This study also applied a comparison between the apriori, fp- growth and eclat algorithms, with the result that the best item transaction rule was found: a combination of dill & unicorn toppings with chocolate with 60% confidence. Overall, the application of eclat algorithm in this study provides the best performance with higher execution speed, thus providing insight into customer preferences regarding topping combinations in food orders. Despite the shortcomings of the data form from this study, it is expected to help business owners in optimizing their offerings, increasing customer satisfaction, and improving their business performance.

ASCF: Optimization of the Apriori Algorithm Using Spark-Based Cuckoo Filter Structure

Article

Full-text available

Jan 2024
INT J INTELL SYST

Data mining is the process used for extracting hidden patterns from large databases using a variety of techniques. For example, in supermarkets, we can discover the items that are often purchased together and that are hidden within the data. This helps make better decisions which improve the business outcomes. One of the techniques that are used to discover frequent patterns in large databases is frequent itemset mining (FIM) that is a part of association rule mining (ARM). There are different algorithms for mining frequent itemsets. One of the most common algorithms for this purpose is the Apriori algorithm that deduces association rules between different objects which describe how these objects are related together. It can be used in different application areas like market basket analysis, student’s courses selection process in the E-learning platforms, stock management, and medical applications. Nowadays, there is a great explosion of data that will increase the computational time in the Apriori algorithm. Therefore, there is a necessity to run the data-intensive algorithms in a parallel-distributed environment to achieve a convenient performance. In this paper, optimization of the Apriori algorithm using the Spark-based cuckoo filter structure (ASCF) is introduced. ASCF succeeds in removing the candidate generation step from the Apriori algorithm to reduce computational complexity and avoid costly comparisons. It uses the cuckoo filter structure to prune the transactions by reducing the number of items in each transaction. The proposed algorithm is implemented on the Spark in-memory processing distributed environment to reduce processing time. ASCF offers a great improvement in performance over the other candidate algorithms based on Apriori, where it achieves a time of only 5.8% of the state-of-the-art approach on the retail dataset with a minimum support of 0.75%.

Enhancing Library Digitalization: A Heterogeneous Network Embedding Approach for Personalized Book Recommendations

Article

Full-text available

Jan 2024

Book recommendations are crucial in digital library transformation, enhancing service sophistication and customization. They allow readers to access books tailored to their specific interests. In this paper, we propose a novel heterogeneous network embedding approach for personalized book recommendations. Our model integrates both assessment and representation data within fields. Additionally, it uses a neural network architecture to refine traditional cross-field matrix factorization. By incorporating a nonlinear mapping function, our approach captures field disparities. Furthermore, it also embeds product attribute representations into cross-field recommendations as heterogeneous network embeddings. Consequently, it effectively exploits comprehensive representation data across fields, enhancing book recommendations. The experimental results show that our method achieves RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) metrics of no higher than 0.767 and 0.605, respectively. These metrics apply across various training set proportions and cold-start customer ratios in both general and customer cold-start scenarios. Compared to other advanced methods, our improvements in RMSE and MAE are not less than 1.01% and 1.13%, respectively. These findings confirm the superiority and robustness of our model in boosting recommendation performance and addressing cold-start issues effectively.

Probabilistic Support Prediction: Fast Frequent Itemset Mining in Dense Data

Article

Full-text available

Jan 2024

Frequent itemset mining (FIM) is a highly resource-demanding data-mining task fundamental to numerous data-mining applications. Support calculation is a frequently performed computation-intensive operation of FIM algorithms, whereas storing transactional data is memory-intensive. The FIM is even more resource-hungry for dense data than for sparse data. The rapidly growing size of datasets further exacerbates this situation and necessitates the design of out-of-the-box highly efficient solutions. This paper proposes a novel approach to frequent itemset mining for dense datasets. This approach, after the initial stage, does not use transactional data, which makes it memory efficient. It also replaces processing-intensive support calculations with efficient support predictions, which are probabilistic and need no transactional data. To predict the support of an itemset, it only needs the support of its subsets. However, this technique works only for itemsets of size three or higher. We also propose an FIM algorithm ProbBF, which incorporates this technique. The ProbBF discards transactional data after it uses it to calculate frequent one and two-size itemsets. For the itemsets of size k , where k ≥ 3, ProbBF uses the proposed probabilistic technique to predict their support. It is considered frequent if the predicted support is greater than a given threshold. Our experiments show that ProbBF is efficient in both time and space against state-of-the-art FIM algorithms that use transactional data. The experiments also show that ProbBF can successfully generate the majority of the frequent itemsets on real-world datasets. Since ProbBF is probabilistic, some loss in quality is inevitable.

Navigating the Ethical and Privacy Concerns of Big Data and Machine Learning in Decision Making

Article

Dec 2023

Hamed Taherdoost

In recent years, the fields of big data and machine learning have gained significant attention for their potential to revolutionize decision-making processes. The vast amounts of data generated by various sources can provide valuable insights to inform decisions across a range of domains, from business and finance to healthcare and social policy. Machine learning algorithms enable computers to learn from data and improve their performance over time, thereby enhancing their ability to make predictions and identify patterns. This article provides a comprehensive overview of how big data and machine learning can improve decision-making processes between 2017–2022. It covers key concepts and techniques involved in these tools, including data collection, data preprocessing, feature selection, model training, and evaluation. The article also discusses the potential benefits and limitations of these tools and explores the ethical and privacy concerns associated with their use. In particular, it highlights the need for transparency and fairness in decision-making algorithms and the importance of protecting individuals' privacy rights. The review concludes by highlighting future research opportunities and challenges in this rapidly evolving field, including the need for more robust and interpretable models, as well as the integration of human decision making with machine learning algorithms. Ultimately, this review aims to provide insights for researchers and practitioners seeking to leverage big data and machine learning to improve decision-making processes in various domains.

Improving the Efficiency of Distributed Utility Item Sets Mining in Relation to Big Data

Article

Full-text available

Nov 2023

High utility pattern mining is an analytical approach used to identify sets of items that exceed a specific threshold of utility values. Unlike traditional frequency-based analysis, this method considers user-specific constraints like the number of units and benefits. In recent years, the importance of making informed decisions based on utility patterns has grown significantly. While several utility-based frequent pattern extraction techniques have been proposed, they often face limitations in handling large datasets. To address this challenge, we propose an optimized method called improving the efficiency of Distributed Utility itemsets mining in relation to big data (IDUIM). This technique improves upon the Distributed Utility item sets Mining (DUIM) algorithm by incorporating various refinements. IDUIM effectively mines item sets of big datasets and provides useful insights as the basis for information management and nearly real-time decision-making systems. According to experimental investigation, the method is being compared to IDUIM and other state algorithms like DUIM, PHUI-Miner, and EFIM-Par. The results demonstrate the IDUIM algorithm is more efficient and performs better than different cutting-edge algorithms. | KEYWORDS High utility item sets, Item sets mining, High utility pattern, Parallel computing. Big data

Pattern Prediction on Uncertain Big Datasets using Combined Light GBM and LSTM Model

Article

Full-text available

Nov 2023

Mining frequent patterns from voluminous datasets termed under ‘Big data’ and having inherent uncertainties poses a significant challenge. Minor changes carried out on the databases like; addition, deletion or modification of items should not lead to scanning the whole database. Besides, a number of algorithms proposed to handle these issues are effective, but their basis of mathematics and way of installation are complex. Keeping the above points in mind, we have proposed an approach, which innovatively combines the models Light Gradient Boosting Machine (LightGBM) and Long Short-Term Memory (LSTM) serially to improve the prediction accuracy. Here, the LightGBM brings its tree-based learning algorithms optimized for speed and performance, while LSTM contributes its advanced sequence modeling capabilities, effectively resolving the vanishing gradient dilemma that often plagues recurrent networks. Our approach is applied to the healthcare sector in general and particularly in the early detection of Breast Cancer from a dataset obtained from Kaggle, yielding outstanding results as are evident from the scores; precision rates of 0.92 for predicted negatives and 0.93 for predicted positives, recall rates of 0.96 for negatives and 0.88 for positives, alongside F1-scores of 0.94 and 0.90, respectively. With a comprehensive accuracy of 0.93 across 188 samples, our model demonstrates a remarkable potential for early medical diagnosis, outperforming existing single-model solutions. The robustness of our approach is further validated by the consistency of performance across various metrics, highlighting its suitability for deployment in high-stakes domains where predictive accuracy is paramount.

A Novel Nodesets-Based Frequent Itemset Mining Algorithm for Big Data using MapReduce

Article

Full-text available

Nov 2023

Due to the rapid growth of data from different sources in organizations, the traditional tools and techniques that cannot handle such huge data are known as big data which is in a scalable fashion. Similarly, many existing frequent itemset mining algorithms have good performance but scalability problems as they cannot exploit parallel processing power available locally or in cloud infrastructure. Since big data and cloud ecosystem overcomes the barriers or limitations in computing resources, it is a natural choice to use distributed programming paradigms such as Map Reduce. In this paper, we propose a novel algorithm known as A Nodesets-based Fast and Scalable Frequent Itemset Mining (FSFIM) to extract frequent itemsets from Big Data. Here, Pre-Order Coding (POC) tree is used to represent data and improve speed in processing. Nodeset is the underlying data structure that is efficient in discovering frequent itemsets. FSFIM is found to be faster and more scalable in mining frequent itemsets. When compared with its predecessors such as Node-lists and N-lists, the Nodesets save half of the memory as they need only either pre-order or post-order coding. Cloudera's Distribution of Hadoop (CDH), a MapReduce framework, is used for empirical study. A prototype application is built to evaluate the performance of the FSFIM. Experimental results revealed that FSFIM outperforms existing algorithms such as Mahout PFP, Mlib PFP, and Big FIM. FSFIM is more scalable and found to be an ideal candidate for real-time applications that mine frequent itemsets from Big Data.

CEP-DTHP : A Complex Event Processing using the Dual-Tier Hybrid Paradigm Over the Stream Mining Process

Article

Full-text available

Oct 2023

CEP is a widely used technique for the reliability and recognition of arbitrarily complex patterns in enormous data streams with great performance in real time. Real-time detection of crucial events and rapid response to them are the key goals of sophisticated event processing. The performance of event processing systems can be improved by parallelizing CEP evaluation procedures. Utilizing CEP in parallel while deploying a multi-core or distributed environment is one of the most popular and widely recognized tackles to accomplish the goal. This paper demonstrates the ability to use an unusual parallelization strategy to effectively process complicated events over streams of data. This method depends on a dual-tier hybrid paradigm that combines several parallelism levels. Thread-level or task-level parallelism (TLP) and Data-level parallelism (DLP) were combined in this research. Many threads or instruction sequences from a comparable application can run concurrently under the TLP paradigm. In the DLP paradigm, instruc-tions from a single stream operate on several data streams at the same time. In our suggested model, there are four major stages: data mining, pre-processing, load shedding, and optimization. The first phase is online data mining, following which the data is materialized into a publicly available solution that combines a CEP engine with a library. Next, data pre-processing encompasses the efficient adaptation of the content or format of raw data from many, perhaps diverse sources. Finally, parallelization approaches have been created to reduce CEP processing time. By providing this two-type parallelism, our proposed solution combines the benefits of DLP and TLP while addressing their constraints. The JAVA tool will be used to assess the suggested technique. The performance of the suggested technique is compared to that of other current ways for determining the efficacy and efficiency of the proposed algorithm.

Uncertain Transactional Dataset.

Contexts in source publication

Similar publications

Citations