Figure - uploaded by K. K. Mohbey
Content may be subject to copyright.
Uncertain Transactional Dataset.

Uncertain Transactional Dataset.

Source publication
Article
Full-text available
Pattern mining is a fundamental technique of data mining to discover interesting correlations in the data set. There are several variations of pattern mining, such as frequent itemset mining, sequence mining, and high utility itemset mining. High utility itemset mining is an emerging data science task, aims to extract knowledge based on a domain ob...

Contexts in source publication

Context 1
... i h } each item i q has an existential probability ( Leung and Jiang, 2014) Pr(i q , T p ), which reflects the probability that i q is present in T p , with value 0 < Pr(i q , T p ), 1. In Table 6, the probability of existing For item 'i 2 ' in transaction T 3 is 0.5. ...
Context 2
... i h } each item i q has an existential probability ( Leung and Jiang, 2014) Pr(i q , T p ), which reflects the probability that i q is present in T p , with value 0 < Pr(i q , T p ), 1. In Table 6, the probability of existing For item 'i 2 ' in transaction T 3 is 0.5. ...

Similar publications

Article
Full-text available
Pattern mining is a standout amongst the majority essential responsibilities to separate significant and helpful data from unprocessed data. Here the work intends to separate itemsets are speak to a homogeneity and consistency in data. At present techniques have been produced in such manner; the developing enthusiasm for data have cause of executio...

Citations

... Data mining is the process of finding patterns or information in certain data using appropriate techniques or methods [7]. Data mining has several functions, including: association function, classification function, clustering function, prediction function, estimation function [8]. ...
Article
Diabetes mellitus is a chronic disease that affects the way the body regulates sugar (glucose). High blood sugar levels can lead to health complications including heart problems, eye disorders, nerve damage, kidney and blood vessel disorders. It is important for early detection of diabetes by utilizing data mining technology. Data mining has various classification models that can be used to detect diabetes, including logistic regression, random forest and adaboost. The comparison of the three algorithms aims to find out which algorithm is most appropriate in the classification of diabetes. From the results obtained, the random forest algorithm has the best performance in the classification of diabetes mellitus compared to other algorithms.
... In this research, the basic concepts of association rules are briefly presented, which is the starting point for detailed analysis and application of the above-mentioned on the presented real data set. After selecting a suitable algorithm and its application to find frequent itemsets [4] [12], the last step is to identify and visualize the association rules. In the conclusion, the parameters used are support, which is the likelihood of a rule and confidence, the degree of trust in a rule obtained high results [13]. ...
Article
Association rule is a data mining technique to find associative rules between a combination of items. This research aims to apply association rules algorithm in identifying popular topping combinations in food orders. This application aims to help restaurant owners or food businesses understand their customers' preferences and optimize their menu offerings. Data obtained from kaggle, the association rules algorithm is applied to this dataset to identify patterns or combinations of toppings that often appear together in orders. The results of this study show toppings with chocolate as a popular item in orders. These findings can provide valuable insights for food business owners in structuring their menus and determining attractive offers for customers. This study also applied a comparison between the apriori, fp- growth and eclat algorithms, with the result that the best item transaction rule was found: a combination of dill & unicorn toppings with chocolate with 60% confidence. Overall, the application of eclat algorithm in this study provides the best performance with higher execution speed, thus providing insight into customer preferences regarding topping combinations in food orders. Despite the shortcomings of the data form from this study, it is expected to help business owners in optimizing their offerings, increasing customer satisfaction, and improving their business performance.
... A survey of the distinct approaches to pattern mining in the big data feld based on Hadoop and Spark parallel and distributed processing was conducted by Kumar and Mohbey [36]. It studied four types of itemset mining: parallel frequent itemset mining, high-utility itemset mining, sequential patterns mining, and frequent itemset mining in uncertain data (data which are obtained from sensors or in experimental observations in real-world applications). ...
Article
Full-text available
Data mining is the process used for extracting hidden patterns from large databases using a variety of techniques. For example, in supermarkets, we can discover the items that are often purchased together and that are hidden within the data. This helps make better decisions which improve the business outcomes. One of the techniques that are used to discover frequent patterns in large databases is frequent itemset mining (FIM) that is a part of association rule mining (ARM). There are different algorithms for mining frequent itemsets. One of the most common algorithms for this purpose is the Apriori algorithm that deduces association rules between different objects which describe how these objects are related together. It can be used in different application areas like market basket analysis, student’s courses selection process in the E-learning platforms, stock management, and medical applications. Nowadays, there is a great explosion of data that will increase the computational time in the Apriori algorithm. Therefore, there is a necessity to run the data-intensive algorithms in a parallel-distributed environment to achieve a convenient performance. In this paper, optimization of the Apriori algorithm using the Spark-based cuckoo filter structure (ASCF) is introduced. ASCF succeeds in removing the candidate generation step from the Apriori algorithm to reduce computational complexity and avoid costly comparisons. It uses the cuckoo filter structure to prune the transactions by reducing the number of items in each transaction. The proposed algorithm is implemented on the Spark in-memory processing distributed environment to reduce processing time. ASCF offers a great improvement in performance over the other candidate algorithms based on Apriori, where it achieves a time of only 5.8% of the state-of-the-art approach on the retail dataset with a minimum support of 0.75%.
... By mining book borrowing records with association rules and setting weights for the mined rules, they demonstrated significant improvements in recommendation outcomes. Reference [26] applied an improved FP-growth algorithm within university library recommendation systems to mine frequent product sets between customers and books, allowing for tailored book recommendations across various fields. Reference [27] focused on the specificity of university library book recommendation services, considering the hierarchical relationship between book attributes and temporal factors in mining association rules, aiming for targeted recommendations for customers of different majors and grades. ...
Article
Full-text available
    Book recommendations are crucial in digital library transformation, enhancing service sophistication and customization. They allow readers to access books tailored to their specific interests. In this paper, we propose a novel heterogeneous network embedding approach for personalized book recommendations. Our model integrates both assessment and representation data within fields. Additionally, it uses a neural network architecture to refine traditional cross-field matrix factorization. By incorporating a nonlinear mapping function, our approach captures field disparities. Furthermore, it also embeds product attribute representations into cross-field recommendations as heterogeneous network embeddings. Consequently, it effectively exploits comprehensive representation data across fields, enhancing book recommendations. The experimental results show that our method achieves RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) metrics of no higher than 0.767 and 0.605, respectively. These metrics apply across various training set proportions and cold-start customer ratios in both general and customer cold-start scenarios. Compared to other advanced methods, our improvements in RMSE and MAE are not less than 1.01% and 1.13%, respectively. These findings confirm the superiority and robustness of our model in boosting recommendation performance and addressing cold-start issues effectively.
    ... The intensity of the problem is rising because decades of developments in information systems have resulted in the fast growth of transaction data, which is posing tremendous challenges to the exact FIM algorithms [16]. More efficient parallel algorithms were developed [17,18], but their scalability is limited by the size of the shared memory. The distributed algorithms, based on the MapReduce framework [19], are more scalable, but they also suffer from frequent I/O operations and communication overheads. ...
    Article
    Full-text available
    Frequent itemset mining (FIM) is a highly resource-demanding data-mining task fundamental to numerous data-mining applications. Support calculation is a frequently performed computation-intensive operation of FIM algorithms, whereas storing transactional data is memory-intensive. The FIM is even more resource-hungry for dense data than for sparse data. The rapidly growing size of datasets further exacerbates this situation and necessitates the design of out-of-the-box highly efficient solutions. This paper proposes a novel approach to frequent itemset mining for dense datasets. This approach, after the initial stage, does not use transactional data, which makes it memory efficient. It also replaces processing-intensive support calculations with efficient support predictions, which are probabilistic and need no transactional data. To predict the support of an itemset, it only needs the support of its subsets. However, this technique works only for itemsets of size three or higher. We also propose an FIM algorithm ProbBF, which incorporates this technique. The ProbBF discards transactional data after it uses it to calculate frequent one and two-size itemsets. For the itemsets of size k , where k ≥ 3, ProbBF uses the proposed probabilistic technique to predict their support. It is considered frequent if the predicted support is greater than a given threshold. Our experiments show that ProbBF is efficient in both time and space against state-of-the-art FIM algorithms that use transactional data. The experiments also show that ProbBF can successfully generate the majority of the frequent itemsets on real-world datasets. Since ProbBF is probabilistic, some loss in quality is inevitable.
    ... In the age of big data, the manufacturing sector generates enormous amounts of large data, much of which has an ultra-high dimension [34] . A difficult topic is how to handle these ultra-high dimension data, unlock their potential, and create a data flow model appropriate for the current manufacturing environment [35] . ...
    Article
    In recent years, the fields of big data and machine learning have gained significant attention for their potential to revolutionize decision-making processes. The vast amounts of data generated by various sources can provide valuable insights to inform decisions across a range of domains, from business and finance to healthcare and social policy. Machine learning algorithms enable computers to learn from data and improve their performance over time, thereby enhancing their ability to make predictions and identify patterns. This article provides a comprehensive overview of how big data and machine learning can improve decision-making processes between 2017–2022. It covers key concepts and techniques involved in these tools, including data collection, data preprocessing, feature selection, model training, and evaluation. The article also discusses the potential benefits and limitations of these tools and explores the ethical and privacy concerns associated with their use. In particular, it highlights the need for transparency and fairness in decision-making algorithms and the importance of protecting individuals' privacy rights. The review concludes by highlighting future research opportunities and challenges in this rapidly evolving field, including the need for more robust and interpretable models, as well as the integration of human decision making with machine learning algorithms. Ultimately, this review aims to provide insights for researchers and practitioners seeking to leverage big data and machine learning to improve decision-making processes in various domains.
    ... HUPM has applications in a variety of industries, including marketing, click-stream analysis, biomedical technologies, and gene control. [Kumar et al. 2022]. To address the combinatorial explosion problem, researchers have proposed methods like the two-phase Apriori-based approach for discovering High Utility item sets (HUIMs) collections across different dataset scans [Wu et al. 2013]. ...
    Article
    Full-text available
    High utility pattern mining is an analytical approach used to identify sets of items that exceed a specific threshold of utility values. Unlike traditional frequency-based analysis, this method considers user-specific constraints like the number of units and benefits. In recent years, the importance of making informed decisions based on utility patterns has grown significantly. While several utility-based frequent pattern extraction techniques have been proposed, they often face limitations in handling large datasets. To address this challenge, we propose an optimized method called improving the efficiency of Distributed Utility itemsets mining in relation to big data (IDUIM). This technique improves upon the Distributed Utility item sets Mining (DUIM) algorithm by incorporating various refinements. IDUIM effectively mines item sets of big datasets and provides useful insights as the basis for information management and nearly real-time decision-making systems. According to experimental investigation, the method is being compared to IDUIM and other state algorithms like DUIM, PHUI-Miner, and EFIM-Par. The results demonstrate the IDUIM algorithm is more efficient and performs better than different cutting-edge algorithms. | KEYWORDS High utility item sets, Item sets mining, High utility pattern, Parallel computing. Big data
    ... However, RNN has some problems such as gradient explosion and it cannot converge to the optimal solution [22], [56]. Deep learning is applied on predicting stock price forecasting [23], [57,61] when a comparison is done among the models like and machine learning and neural, deep learning has more certainty, thorough explanation ability, and vigorous learning ability to adjust to new problems [24], [58][59][60][61][62][63]. ...
    Article
    Full-text available
    Mining frequent patterns from voluminous datasets termed under ‘Big data’ and having inherent uncertainties poses a significant challenge. Minor changes carried out on the databases like; addition, deletion or modification of items should not lead to scanning the whole database. Besides, a number of algorithms proposed to handle these issues are effective, but their basis of mathematics and way of installation are complex. Keeping the above points in mind, we have proposed an approach, which innovatively combines the models Light Gradient Boosting Machine (LightGBM) and Long Short-Term Memory (LSTM) serially to improve the prediction accuracy. Here, the LightGBM brings its tree-based learning algorithms optimized for speed and performance, while LSTM contributes its advanced sequence modeling capabilities, effectively resolving the vanishing gradient dilemma that often plagues recurrent networks. Our approach is applied to the healthcare sector in general and particularly in the early detection of Breast Cancer from a dataset obtained from Kaggle, yielding outstanding results as are evident from the scores; precision rates of 0.92 for predicted negatives and 0.93 for predicted positives, recall rates of 0.96 for negatives and 0.88 for positives, alongside F1-scores of 0.94 and 0.90, respectively. With a comprehensive accuracy of 0.93 across 188 samples, our model demonstrates a remarkable potential for early medical diagnosis, outperforming existing single-model solutions. The robustness of our approach is further validated by the consistency of performance across various metrics, highlighting its suitability for deployment in high-stakes domains where predictive accuracy is paramount.
    ... Asbern and Asha [16] explored different algorithms for FIM that operate on big data using the MapReduce paradigm. Kumar and Mohbey [17] investigated different parallel FIM algorithms that are executed in distributed environments. Different issues they identified in such algorithms include scalability, privacy, complex data types, load balancing, and gene regulation patterns. ...
    Article
    Full-text available
    Due to the rapid growth of data from different sources in organizations, the traditional tools and techniques that cannot handle such huge data are known as big data which is in a scalable fashion. Similarly, many existing frequent itemset mining algorithms have good performance but scalability problems as they cannot exploit parallel processing power available locally or in cloud infrastructure. Since big data and cloud ecosystem overcomes the barriers or limitations in computing resources, it is a natural choice to use distributed programming paradigms such as Map Reduce. In this paper, we propose a novel algorithm known as A Nodesets-based Fast and Scalable Frequent Itemset Mining (FSFIM) to extract frequent itemsets from Big Data. Here, Pre-Order Coding (POC) tree is used to represent data and improve speed in processing. Nodeset is the underlying data structure that is efficient in discovering frequent itemsets. FSFIM is found to be faster and more scalable in mining frequent itemsets. When compared with its predecessors such as Node-lists and N-lists, the Nodesets save half of the memory as they need only either pre-order or post-order coding. Cloudera's Distribution of Hadoop (CDH), a MapReduce framework, is used for empirical study. A prototype application is built to evaluate the performance of the FSFIM. Experimental results revealed that FSFIM outperforms existing algorithms such as Mahout PFP, Mlib PFP, and Big FIM. FSFIM is more scalable and found to be an ideal candidate for real-time applications that mine frequent itemsets from Big Data.
    ... Thus, CEP enables organizations to exploit the potential of real-time event data, permitting proactive decision-making, situational awareness, and intelligent automation [15,16]. Organizations may gain a competitive advantage, increase operational efficiency, and enhance customer experiences by fast processing and analyzing events, recognizing trends, and triggering appropriate responses [17,18]. ...
    Article
    Full-text available
    CEP is a widely used technique for the reliability and recognition of arbitrarily complex patterns in enormous data streams with great performance in real time. Real-time detection of crucial events and rapid response to them are the key goals of sophisticated event processing. The performance of event processing systems can be improved by parallelizing CEP evaluation procedures. Utilizing CEP in parallel while deploying a multi-core or distributed environment is one of the most popular and widely recognized tackles to accomplish the goal. This paper demonstrates the ability to use an unusual parallelization strategy to effectively process complicated events over streams of data. This method depends on a dual-tier hybrid paradigm that combines several parallelism levels. Thread-level or task-level parallelism (TLP) and Data-level parallelism (DLP) were combined in this research. Many threads or instruction sequences from a comparable application can run concurrently under the TLP paradigm. In the DLP paradigm, instruc-tions from a single stream operate on several data streams at the same time. In our suggested model, there are four major stages: data mining, pre-processing, load shedding, and optimization. The first phase is online data mining, following which the data is materialized into a publicly available solution that combines a CEP engine with a library. Next, data pre-processing encompasses the efficient adaptation of the content or format of raw data from many, perhaps diverse sources. Finally, parallelization approaches have been created to reduce CEP processing time. By providing this two-type parallelism, our proposed solution combines the benefits of DLP and TLP while addressing their constraints. The JAVA tool will be used to assess the suggested technique. The performance of the suggested technique is compared to that of other current ways for determining the efficacy and efficiency of the proposed algorithm.