Book

Data Mining: Concepts and Techniques

Authors:

Abstract

This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
... A recommendation system (RS) can be developed to overcome the problem. The recommendation system is a method for recommending items to users from a pile of relevant information [1,2]. At present, it has been widely applied in various fields to provide services for its users such as www.amazon.com, ...
... The data is categorized into two, namely implicit feedback and explicit feedback. Over the years, researches related explicit feedback (scale of [1][2][3][4][5] have been widely developed in the movie, e-commerce dataset and others. The recommendation system using this dataset produces an accuracy up to 80% [4,5]. ...
Article
Recommendation system always involves huge volumes of data, therefore it causes the scalability issues that do not only increase the processing time but also reduce the accuracy. In addition, the type of data used also greatly affects the result of the recommendations. In the recommendation system, there are two common types of data namely implicit (binary) rating and explicit (scalar) rating. Binary rating produces lower accuracy when it is not handled with the properly. Thus, optimized K-Means+ clustering and user-based collaborative filtering are proposed in this research. The K-Means clustering is optimized by selecting the K value using the Davies-Bouldin Index (DBI) method. The experimental result shows that the optimization of the K values produces better clustering than Elbow Method. The K-Means+ and User-Based Collaborative Filtering (UBCF) produce precision of 8.6% and f-measure of 7.2%, respectively. The proposed method was compared to DBSCAN algorithm with UBCF, and had better accuracy of 1% increase in precision value. This result proves that K-Means+ with UBCF can handle implicit feedback datasets and improve precision.
... Classification is an important problem in decision making. It has been studied extensively by the machine learning (ML) community as a possible solution to the knowledge acquisition or knowledge extraction problem (Jiawei 2002). A ML model that is effective and wildly used is the decision tree (DT) model (Swain and Hauska 1977;Stiglic, Kocbek, Pernek, et al. 2012). ...
... The models are contrasted as follows: first, the baseline DT model (DT baseline ) is obtained by training a DT model with the CART algorithm and the gini dividing metric (Grabmeier and Lambe 2007). The following hyper-parameters are optimized using the grid search method: tree's depth (2)(3)(4)(5)(6)(7)(8)(9)(10), maximum number of leaves ([5, 10, ..., 50]), and minimum samples for dividing a node ( [2,5,8,10,15,20,25,30]) (R. Liu, E. Liu, Yang, et al. 2006). The other five DT models were obtained by performing the following PP algorithms on the baseline model (DT baseline ): ...
Preprint
Decision tree (DT) is one of the most popular and efficient techniques in data mining. Specifically, in the clinical domain, DTs have been wildly used thanks to relatively easy interpretation and efficient computation time. However, some DT models may produce a large tree size structure which is difficult to understand and often leads to misclassification of data in the testing process. Therefore, a DT model which is a simple tree with high accuracy is the desired goal. Post pruning (PP) algorithms have been introduced to reduce the complexity of the tree structure with a minor decrease in the accuracy of classification. We propose a new Boolean satisfiability (SAT) based PP algorithm (Namely, SAT-PP algorithm) which reduces the tree size while preserving the accuracy of the unpruned tree, since in medical-related tasks, decreasing the model’s performance is something we emphatically try to avoid. Namely, in the case of medical-related tasks, one may prefer an unpruned DT model to a pruned DT model with worse performance. To evaluate the proposed algorithm and other PP algorithms, we compare the performance in terms of the model query response time and accuracy of classification using three oncology data sets. The SAT-PP DT model obtained the same accuracy and F1 score as the DT model without PP while significantly reducing computation time (6.8%).
... Data Mining or knowledge Mining refers to extracting or mining knowledge from large amounts of data [5]. The data may be either structured or unstructured. ...
... But to extract the hidden patterns data mining algorithms are used. Data may be any type of data such as Relational, Transactional, object oriented, temporal, spatial, Text, Multimedia, web etc [5]. Text is unstructured, nebulous and convoluted to deal with. ...
Article
Full-text available
Tremendous growth of text document collection has led to an increased interest in developing various approaches. Analysis and evaluation of useful information or patterns are must from the existing piles of raw data. To extract useful documents/ information/ patterns from the existing large amounts of unstructured retrospective corpora, various approaches such as Information Retrieval, Information Extraction and Text Mining were introduced. Different natural language processing techniques have been developed to improve the accuracy of extracting patterns. While several text handling approaches based on different features have been proposed in the past, there is no systematic study which discusses the similarities and distinctions of all these approaches. This paper discusses the overview of all these approaches, IR, IE, TM and NLP. We discuss about the methods and applications of existing approaches and also the theories in possible future research.
... The automated rule discovery technique known as decision tree (Dtree) [39] analyzes and learns from training data, producing a series of branching decisions that classify the data based on the values of different feature attributes. ...
Article
Full-text available
In recent years, the informatization of the educational system has caused a substantial increase in educational data. Educational data mining can assist in identifying the factors influencing students’ performance. However, two challenges have arisen in the field of educational data mining: (1) How to handle the abundance of unlabeled data? (2) How to identify the most crucial characteristics that impact student performance? In this paper, a semi-supervised feature selection framework is proposed to analyze the factors influencing student performance. The proposed method is semi-supervised, enabling the processing of a considerable amount of unlabeled data with only a few labeled instances. Additionally, by solving a feature selection matrix, the weights of each feature can be determined, to rank their importance. Furthermore, various commonly used classifiers are employed to assess the performance of the proposed feature selection method. Extensive experiments demonstrate the superiority of the proposed semi-supervised feature selection approach. The experiments indicate that behavioral characteristics are significant for student performance, and the proposed method outperforms the state-of-the-art feature selection methods by approximately 3.9% when extracting the most important feature.
... Among the three classifiers Random Forest rank first with 75.09% accuracy [3]. With the original features of 21, Logistic Regression performed well with 96% and Random Forest performed very efficiently with 97% accuracy [4]. ...
Article
Full-text available
In the recent years, more and more fields around the development and application of artificial intelligence have received new improvements and breakthroughs. Among them, drug design and classification are one of the most popular application areas which can provide the greatest help to the society. According to statistics, the number of newly listed drugs in the world is decreasing year by year. At the same time, the risks and costs of drug development are increasing year by year which illustrates a different trend compared with the newly listed drugs. In this case, the application of artificial intelligence technology provides a new idea and opportunity to solve the problem of drug design and classification with higher accuracy than human work. This paper aims to do a research based on using machine learning models to produce suitable outcomes for patients of different drug types in order to reduce the working pressure of doctors in the hospitals.
... The simplest form of oversampling is random oversampling, which involves repeating randomly selected minority class samples to balance the distribution of classes [52]. A variation of random oversampling is Synthetic Minority Over-sampling Technique (SMOTE), which creates synthetic samples from the minority class by interpolating between existing minority class samples [53]. 2) Undersampling: Undersampling is a technique that reduces the number of samples in the majority class. ...
Preprint
Full-text available
p>Industry 4.0, also known as the Fourth Industrial Revolution, is characterized by the incorporation of advanced manufacturing technologies such as the Internet of Things (IoT), Artificial Intelligence (AI), and automation. With the increasing adoption of Industry 4.0 technologies, it becomes crucial to implement effective security measures to safeguard these systems from cyber attacks. The development of intrusion detection systems (IDS) that can detect and respond to cyber threats in real-time is crucial for securing Industry 4.0 systems. This research topic seeks to investigate the various techniques and methodologies employed in developing IDS for Industry 4.0 systems, with a particular concentration on identifying the most effective solutions for protecting these systems from cyber attacks. In this study, we compared supervised and unsupervised intrusion detection algorithms. We utilized data collected from heterogeneous sources, including Telemetry datasets of IoT and The industrial Internet of things (IIoT) sensors, Operating systems (OS) datasets of Windows 7 and 10, as well as Ubuntu 14 and 18 TLS and Network traffic datasets simulated by the School of Engineering and Information Technology (SEIT), UNSW Canberra @ the Australian Defence Force Academy (ADFA). The preliminary results of IDS accuracy are extremely encouraging on the selected data for this study (Windows OS and Ubuntu OS), which motivates the continuance of this line of inquiry using a variety of other data sources to formulate a general recommendation of IDS for Industry 4.0.</p
... The simplest form of oversampling is random oversampling, which involves repeating randomly selected minority class samples to balance the distribution of classes [52]. A variation of random oversampling is Synthetic Minority Over-sampling Technique (SMOTE), which creates synthetic samples from the minority class by interpolating between existing minority class samples [53]. 2) Undersampling: Undersampling is a technique that reduces the number of samples in the majority class. ...
Preprint
Full-text available
p>Industry 4.0, also known as the Fourth Industrial Revolution, is characterized by the incorporation of advanced manufacturing technologies such as the Internet of Things (IoT), Artificial Intelligence (AI), and automation. With the increasing adoption of Industry 4.0 technologies, it becomes crucial to implement effective security measures to safeguard these systems from cyber attacks. The development of intrusion detection systems (IDS) that can detect and respond to cyber threats in real-time is crucial for securing Industry 4.0 systems. This research topic seeks to investigate the various techniques and methodologies employed in developing IDS for Industry 4.0 systems, with a particular concentration on identifying the most effective solutions for protecting these systems from cyber attacks. In this study, we compared supervised and unsupervised intrusion detection algorithms. We utilized data collected from heterogeneous sources, including Telemetry datasets of IoT and The industrial Internet of things (IIoT) sensors, Operating systems (OS) datasets of Windows 7 and 10, as well as Ubuntu 14 and 18 TLS and Network traffic datasets simulated by the School of Engineering and Information Technology (SEIT), UNSW Canberra @ the Australian Defence Force Academy (ADFA). The preliminary results of IDS accuracy are extremely encouraging on the selected data for this study (Windows OS and Ubuntu OS), which motivates the continuance of this line of inquiry using a variety of other data sources to formulate a general recommendation of IDS for Industry 4.0.</p
... Clustering is a powerful machine learning tool for detecting structures in datasets. Detecting, analyzing, and describing natural clusters within a dataset, is of fundamental importance to a number of various fields [1][2][3][4]. Such fields concern bioinformatics to identify similar genes, marketing to segment customers to establish market research, social sciences, psychology, biology, security, computer vision and image processing in various areas such as the medical or industrial fields, and so on. ...
... Thus, one solution to avoid mislabeling or poor annotation in large datasets is to rely on text clustering. Cluster analysis is the unsupervised process of grouping data instances into relatively similar categories without understanding the group's structure or class labels [16]. ...
Preprint
Full-text available
The large quantity of information retrieved from communities, public data repositories, web pages, or data mining can be sparsed and poorly classified. This work shows how to employ unsupervised classification algorithms such as K-means proper to classify user reviews into their closest category, forming a balanced data set. Moreover, we found that the text vectorization technique significantly impacts the clustering formation, comparing TF-IDF and Word2Vec. The value for mapping a cluster with movie genre was 81.34% ± 20.48 of the cases when the TF-IDF was applied, whereas Word2Vec only yielded a 53.51% ± 24.1. In addition, we highlight the impact of the removal of stop-words. Thus, we detected that pre-compiled lists are not the best method to remove stop-words before clustering because there is much ambiguity, centroids are poorly separated, and only 57% of clusters could match a movie genre. Thus, our proposed approach achieved a 94% of accuracy. After analyzing the classifiers’ results, we appreciated a similar effect when divided by the stop-words method removal. Statistically significant changes were observed, especially in precision metric and Jaccard scores in both classifiers, using custom-generated stop lists rather than pre-compiled ones. Reclassifying sparse data is strongly recommended as using custom-generated stop lists.
... We have been limited, in this work to one hidden layer, this have shown to be sufficient after many simulations carried out (figure3). The number of neurons in the hidden layer has generally been determined heuristically or by trial and error [17, 21, 22]. The input layer has n 1 neurons that we call ne k , 1≤k≤n 1 , the hidden and output layers contain n 2 and n 3 neurons that we respectively call nc j and ns i , where 1≤j≤n 2 , 1≤i≤n 3 . ...
Article
Full-text available
The present work is concerned with handwritten and printed numeral recognition based on an improved version of the loci characteristic method (CL) for extracting the numeral features. After a preprocessing of the numeral image, the method divides the image into four equal parts and applies the traditional CL to each of the parts. The recognition rate obtained by this method is improved indicating that the numeral features extracted contain more details. Numeral recognition is carried out in this work through k nearest neighbors and multilayer perceptron techniques.
... Once when an unfamiliar data record is provided, the KNN classifier looks for the similar data record in pre-trained data set. K-nearest classifier selects the closest similar record and assigns that record class label to the tested record (Han & Micheline, 2005). For an unknown test record a majority vote of its nearby neighbours are fined. ...
Article
Full-text available
Computer software plays a crucial role in the health sector, facilitating effective management of medical records, enhancing the delivery of services, and improving patient outcomes through advanced diagnostic equipment. Cardiovascular disease poses a significant challenge for the medical community in the contemporary era, emerging as the leading cause of mortality. The healthcare business collects substantial quantities of health data that, however, cannot be effectively utilised for informed decision-making. Data mining techniques are employed to analyse large datasets in order to extract valuable information from vast amounts of data. This study aims to provide motivation for the development of an intelligent classification system for heart disease, utilising data mining techniques with a smaller set of characteristics or attributes. The K-nearest neighbour and FuzzyK-nearest neighbour classifier algorithms are employed in conjunction with evolutionary search and symmetric uncertainty attribute evaluator techniques to enhance the process of feature selection. The experimental findings demonstrate that each technique possesses distinct benefits in effectively meeting the objectives of cardiac disease detection with a high level of precision. The results collected indicate that the K-nearest neighbours (KNN) algorithm has demonstrated superior performance compared to the Fuzzy-KNN algorithm. The analysis further unveiled that K-nearest neighbours (KNN) regularly exhibited commendable accuracy while employing the symmetric uncertainty measure.
... If we study the definition of the term data mining, then we can say data mining refers to extracting or "mining" knowledge from large amounts of data or databases [10]. The process of finding useful patterns or meaning in raw data has been called KDD [11]. ...
Article
If you think about the dangerous diseases in the world then you always list Cancer as one. Lung cancer is one of the most dangerous cancer types in the world. These diseases can spread by uncontrolled cell growth in tissues of the lung. Early detection can save the life and survivability of the patients. In this paper we survey several aspects of data mining which is used for lung cancer prediction. Data mining is useful in lung cancer classification. We also survey the aspects of ant colony optimization (ACO) technique. Ant colony optimization helps in increasing or decreasing the disease prediction value. This study assorted data mining and ant colony optimization techniques for appropriate rule generation and classification, which pilot to exact cancer classification. In addition to, it provides basic framework for further improvement in medical diagnosis.
... Precision adalah tingkat ketepatan antara informasi yang diminta oleh pengguna dengan jawaban yang diberikan oleh sistem. Sedangkan Recall adalah persentase keberhasilan sistem dalam menemukan kembali sebuah inform [14]. F1 Score merupakan harmonic mean dari nilai recall dan precision ini perhitungan ini berguna untuk mengetahui seberapa presisi dan handalnya performa system dalam mengklasifikasikan kelas [15]. ...
Article
Full-text available
Penelitian ini berfokus pada sentiment analysis berdasarkan tweets masyarakat terhadap kinerja presiden dalam aspek penanganan covid-19. Tujuan penelitian ini adalah mengetahui pengaruh dataset dan model resampling untuk membangun model sentiment analysis machine dalam menganalisa topik kinerja presiden dalam aspek penanganan covid-19 kedalam 3 kelas sentiment yaitu positif, negatif, dan netral. Terdapat dua dataset yang digunakan pada penelitian ini dataset A yang merupakan kumpulan tweets yang diambil dari Twitter sebanyak 5694 dan dataset B yang dibentuk dengan mengambil “parameter + kata independen” dari tweets sebanyak 1015. Algoritma yang digunakan dalam penelitian ini adalah algoritma Support Vector Machine (SVM) untuk membangun sebuah model machine learning dan menggunakan model resampling ROS (Random Over Sampler) dan RUS (Random Under Sampler) dalam mengatasi data yang tidak seimbang. Dari hasil pengujian pada peneltian ini dapat diketahui skenario 5 (dataset B + ROS) memiliki performa yang paling baik dengan accuracy sebesar 90,08 % dan precision 90,39 %, dari hasil pengujian juga diketahui skenario 5 merupakan model machine learning yang tidak mengalami overfitting. Penelitian ini berhasil mengimplementasikan sentiment analysis machine sehingga dapat melakukan kategorisasi teks terhadap kinerja presiden dalam aspek penanganan covid-19.Kata kunci: Sentiment Analysis, dataset, Support Vector Machine, ROS (Random Over Sampler), RUS (Random Under Sampler)
... On another note, the Receiver Operating Characteristic (ROC) curve was used to compare the classification ability of the proposed algorithm with the other ML methods. The curve is the result of plotting the sensitivity and specificity for each threshold value [34]. As it is shown in Fig. 4, the results obtained from the different ML methods and XGB are represented. ...
Article
Full-text available
Aim Nonalcoholic fatty liver disease (NAFLD) is a silent epidemy that has become the most common chronic liver disease worldwide. Nonalcoholic steatohepatitis (NASH) is an advanced stage of NAFLD, which is linked to a high risk of cirrhosis and hepatocellular carcinoma. The aim of this study is to develop a predictive model to identify the main risk factors associated with the progression of hepatic fibrosis in patients with NASH. Methods A database from a multicenter retrospective cross-sectional study was analyzed. A total of 215 patients with NASH biopsy-proven diagnosed were collected. NAFLD Activity Score and Kleiner scoring system were used to diagnose and staging these patients. Noninvasive tests (NITs) scores were added to identify which one were more reliable for follow-up and to avoid biopsy. For analysis, different Machine Learning methods were implemented, being the eXtreme Gradient Booster (XGB) system the proposed algorithm to develop the predictive model. Results The most important variable in this predictive model was High-density lipoprotein (HDL) cholesterol, followed by systemic arterial hypertension and triglycerides (TG). NAFLD Fibrosis Score (NFS) was the most reliable NIT. As for the proposed method, XGB obtained higher results than the second method, K-Nearest Neighbors, in terms of accuracy (95.05 vs. 90.42) and Area Under the Curve (0.95 vs. 0.91). Conclusions HDL cholesterol, systemic arterial hypertension, and TG were the most important risk factors for liver fibrosis progression in NASH patients. NFS is recommended for monitoring and decision making.
... This also is an indication of stress. Use of social networking sites analyses the state of one's mind and thinking [4]. Twitter and Facebook have a wide number of users. ...
Article
Every year tens of millions of people suffer from depression and few of them get proper treatment on time. So, it is crucial to detect human stress and relaxation automatically via social media on a timely basis. It is very important to detect and manage stress before it goes into a severe problem. A huge number of informal messages are posted every day in social networking sites, blogs and discussion forums. This paper describes an approach to detect the stress using the information from social media networking sites, like tweeter.This paper presents a method to detect expressions of stress and relaxation on tweeter dataset i.e. working on sentiment analysis to find emotions or feelings about daily life. Sentiment analysis works the automatic extraction of sentiment related information from text. Here using TensiStrengthframework for sentiment strength detection on social networking sites to extract sentiment strength from the informal English text. TensiStrength is a system todetect the strength of stress and relaxation expressed in social media text messages. TensiStrength uses a lexical approach and a set of rules to detect direct and indirect expressions of stress or relaxation. This classifies both positive and negative emotions based on the strength scale from -5 to +5 indications of sentiments. Stressed sentences from the conversation are considered &categorised into stress and relax. TensiStrength is robust, it can be applied to a widevarietyofdifferent social web contexts. Theeffectiveness of TensiStrength depends on the nature of the tweets.In human being there is inborn capability to differentiate the multiple senses of an ambiguous word in a particular context, but machine executes only according to the instructions. The major drawback of machine translation is Word Sense Disambiguation. There is a fact that a single word can have multiple meanings or "senses." In the pre-processing part- of- speech disambiguation is analysed and the drawback of WSD overcomes in the proposed method by unigram, bigram and trigram to give better result on ambiguous words. Here, SVM with Ngram gives better resultPrecision is65% and Recall is 67% .But, the main objective of this technique is to find the explicit and implicit amounts of stress and relaxation expressed in tweets. Keywords: Stress Detection, Data Mining, TensiStrength, wordsense disambiguation.
Article
Exploratory data analysis can uncover interesting data insights from data. Current methods utilize "interestingness measures" designed based on system designers' perspectives, thus inherently restricting the insights to their defined scope. These systems, consequently, may not adequately represent a broader range of user interests. Furthermore, most existing approaches that formulate "interestingness measure" are rule-based, which makes them inevitably brittle and often requires holistic re-design when new user needs are discovered. This paper presents a data-driven technique for deriving an "interestingness measure" that learns from annotated data. We further develop an innovative annotation algorithm that significantly reduces the annotation cost, and an insight synthesis algorithm based on the Markov Chain Monte Carlo method for efficient discovery of interesting insights. We consolidate these ideas into a system. Our experimental outcomes and user studies demonstrate that DAISY can effectively discover a broad range of interesting insights, thereby substantially advancing the current state-of-the-art.
Chapter
Many significant advancements in web technology have resulted in the fast expansion of the world wide web (WWW) and web development. WWW employs a variety of technologies to improve communication with internet users, yet user frustrations exist. Adding a new resource and distributing network traffic over one or more resources is one possible solution to this problem. Web caching is a popular strategy for reducing network traffic by storing websites closer to the client site. A proxy server is in charge of web caching, which operates as a middleman between the web server and the web client, reducing latency in page retrieval. This proxy-based web caching solution may be enhanced further to regulate web performance. As a result, this chapter focuses on how to improve the proxy-based web caching system. It optimises the speed of the proxy-based web caching system using web usage mining (WUM).
Chapter
Sorting software modules in order of defect count can help testers to focus on software modules with more defects. One of the most popular methods for sorting modules is generalized linear regression. However, our previous study showed the poor performance of these regression models, which might be caused by severe multicollinearity. Ridge regression (RR) can improve the prediction performance for multicollinearity problems. Lasso regression (LAR) is a worthy competitor to RR. Therefore, we investigate both RR and LAR models for cross-version defect prediction. Cross-version defect prediction is an approximate to real applications. It constructs prediction models from a previous version of projects and predicts defects in the next version. We propose a two-layer ensemble learning approach TLEL which leverages decision tree and ensemble learning to improve the performance of just-in-time defect prediction. To evaluate the performance of TLEL, we use two metrics, that is, cost effectiveness and F1-score.We perform experiments on the datasets from six large open source projects, that is, Bugzilla, Columba, JDT, Platform, Mozilla, and PostgreSQL, containing a total of 137,417 changes. Unsupervised models do not require the defect data to build the prediction models and hence incur a low building cost and gain a wide application range. Consequently, it would be more desirable for practitioners to apply unsupervised models in effort-aware just-in-time (JIT) defect prediction if they can predict defect-inducing changes well. However, little is currently known on their prediction effectiveness in this context.
Article
Pada saat ini, sektor pariwisata nasional menjadi primadona baru bagi pembangunan nasional. Kontribusi devisa dan penyerapan tenaga kerja di sektor ini sangat signifikan bagi devisa negara. Bahkan, diperkirakan pada 2019 sudah mengalahkan perolehan devisa dari industri sawit (CPHAI). Pada kasus ini, pemerintah harus meningkatkan pertumbuhan kunjungan wisatawan yang datang ke Indonesia. Salah satu bagian yang perlu diperhatikan oleh pemerintah dalam meningkatkan sektor pariwisata adalah akomodasi hotel. Dalam meningkatkan pelayanan akomodasi hotel, perlu adanya suatu pelayanan yang memuat informasi tentang keunikan hotel tersebut. Layanan yang akan dikembangkan memiliki teknik hubungan dua arah antara pelanggan dan penyedia layanan. Hubungan dua arah ini terjadi dengan mengelompokkan jenis hotel berdasarkan data komentar di google. Tujuan utama dari penelitian ini adalah menganalisis beberapa metode yang sesuai dengan klasifikasi keunikan hotel. Keunikan hotel yang akan diklasifikasikan adalah hotel yang bertema alam, eropa, klasik, foto dan nuansa rumah. Metode yang akan dibandingkan adalah metode Support Vector Machine (SVM) dan metode Naïve Bayes. Pada penelitian ini dapat dihasilkan bahwa akurasi Naïve Bayes lebih tinggi dibandingkan dengan akurasi SVM, perbandingannya adalah 75% dan 62,5%.
Article
Full-text available
Fault diagnosis is integral to maintenance practices, ensuring optimal machinery functionality. While traditional methods relied on human expertise, Intelligent Fault Diagnosis (IFD) techniques, propelled by Machine Learning (ML) advancements, now offer automated fault identification. Despite their efficiency, a research gap exists, emphasizing the need for methods providing not just reliable fault identification but also in-depth causal factor analysis. This research introduces a novel approach using an Extra Tree classification algorithm and feature selection to identify fault importance in manufacturing processes. Compared with SVM, neural networks, and tree-based ML, the method enhances training and computational efficiency, achieving over 99% classification accuracy on PHM 2021 dataset. Importantly, the algorithm enables researchers to analyze individual fault causes, addressing a critical research gap. The study provides guidelines for further research, aiming to refine the proposed strategy. This work contributes to advancing fault diagnosis methodologies, combining automation with comprehensive causal analysis, crucial for both academic and industrial applications.
Article
Full-text available
In recent years, various studies have been conducted on SVMs and their applications in different area. They have been developed significantly in many areas. SVM is one of the most robust classification and regression algorithms that plays a significant role in pattern recognition. However, SVM has not been developed significantly in some areas like large-scale datasets, unbalanced datasets, and multiclass classification. Efficient SVM training in large-scale datasets is of great importance in the big data era. However, as the number of samples increases, the time and memory required to train SVM increase, making SVM impractical even for a medium-sized problem. With the emergence of big data, this problem becomes more significant. This paper presents a novel distributed method for SVM training in which a very small subset of training samples is used for classification, which reduces the problem size and thus the required memory and computational resources. The solution of this problem almost converges to standard SVM. This method includes three steps: first, detecting a subset of distributed training samples, second, creating local models of SVM and obtaining partial vectors, and finally combining the partial vectors and obtaining the global vector and the final model. In addition, the datasets which suffer from unbalanced number of samples and tend to the majority class, the proposed method balances the samples of the two classes and it can be used in unbalanced datasets. The empirical results show that using this method is efficient for large-scale problems.
Conference Paper
Full-text available
Educational systems nowadays are implemented online and hence due to versatility of trades and subjects it becomes difficult for such system to predict the needed area of several trades or subjects that are to be improved. In simple words a educational system can be a university, school, college these systems mainly involve students with their future endeavors. We have to create a system that will train itself on previous students' admission data and recommend the education system about areas that are to be improved so as to create a new exposure for students to successfully and easily choose trades that would benefit them in future. Hence a system that will predict and improve key areas of certain trades that need attention or certain improvement, that can be a recommendation as well.
Article
The automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.
Article
Intrusion detection systems (IDSs) analyze internet activities and traffic to detect potential attacks, thereby safeguarding computer systems. In this study, researchers focused on developing an advanced IDS that achieves high accuracy through the application of feature selection and ensemble learning methods. The utilization of the CIC-CSE-IDS2018 dataset for training and testing purposes adds relevance to the study. The study comprised two key stages, each contributing to its significance. In the first stage, the researchers reduced the dataset through strategic feature selection and carefully selected algorithms for ensemble learning. This process optimizes the IDS’s performance by selecting the most informative features and leveraging the strengths of different classifiers. In the second stage, the ensemble learning approach was implemented, resulting in a powerful model that combines the benefits of multiple algorithms. The results of the study demonstrate its impact on improving attack detection and reducing detection time. By applying techniques such as Spearman’s correlation analysis, recursive feature elimination (RFE), and chi-square test methods, the researchers identified key features that enhance the IDS’s performance. Furthermore, the comparison of different classifiers showcased the effectiveness of models such as extra trees, decision trees, and logistic regression. These models not only achieved high accuracy rates but also considered the practical aspect of execution time. The study’s overall significance lies in its contribution to advancing IDS capabilities and improving computer security. By adopting an ensemble learning approach and carefully selecting features and classifiers, the researchers created a model that outperforms individual classifier approaches. This model, with its high accuracy rate, further validates the effectiveness of ensemble learning in enhancing IDS performance. The findings of this study have the potential to drive future developments in intrusion detection systems and have a tangible impact on ensuring robust computer security in various domains.
Conference Paper
Internet use has grown ubiquitous worldwide. Online crime has grown alongside Internet activity. Cybersecurity has evolved rapidly to keep up with cyberspace’s rapid changes. Cyber security refers to a nation’s or company’s online protections. The term “cyber security” was barely known by the general public two decades ago. Cybersecurity is an issue that impacts not just individuals but also businesses and governments. Everything is now digital, and cybernetics makes use of tools from the cloud, mobile devices, and the Internet of Things. The issues of confidentiality, safety, and restitution all arise when discussing cyber attacks. To protect networks, computers, programmes, and data from being attacked, damaged, or accessed without permission, cyber security has been developed. With an eye on the future of next-generation networks, this article provides a high-level review of several ANN and DL techniques utilized for Cybersecurity and CyberSecurity Capacity Building(CCB) strategies.
Article
Full-text available
Background To investigate the contribution of machine learning decision tree models applied to perfusion and spectroscopy MRI for multiclass classification of lymphomas, glioblastomas, and metastases, and then to bring out the underlying key pathophysiological processes involved in the hierarchization of the decision-making algorithms of the models Methods From 2013 to 2020, 180 consecutive patients with histopathologically proved lymphomas (n = 77), glioblastomas (n = 45), and metastases (n = 58) were included in machine learning analysis after undergoing MRI. The perfusion parameters (rCBVmax, PSRmax) and spectroscopic concentration ratios (lac/Cr, Cho/NAA, Cho/Cr, and lip/Cr) were applied to construct Classification and Regression Tree (CART) models for multiclass classification of these brain tumors. A 5-fold random cross validation was performed on the dataset. Results The decision tree model thus constructed successfully classified all 3 tumor types with a performance (AUC) of 0.98 for PCNSLs, 0.98 for GBM and 1.00 for METs. The model accuracy was 0.96 with a RSquare of 0.887. Five rules of classifier combinations were extracted with a predicted probability from 0.907 to 0.989 for that end nodes of the decision tree for tumor multiclass classification. In hierarchical order of importance, the root node (Cho/NAA) in the decision tree algorithm was primarily based on the proliferative, infiltrative, and neuronal destructive characteristics of the tumor, the internal node (PSRmax), on tumor tissue capillary permeability characteristics, and the end node (Lac/Cr or Cho/Cr), on tumor energy glycolytic (Warburg effect), or on membrane lipid tumor metabolism. Conclusion Our study shows potential implementation of machine learning decision tree model algorithms based on a hierarchical, convenient, and personalized use of perfusion and spectroscopy MRI data for multiclass classification of these brain tumors.
Conference Paper
Cyber-physical power systems” reliance on cyberspace makes them vulnerable to cyber-attacks, particularly false data injection attacks (FDIAs), where the aim is to alter the state estimation (SE) results by changing meters” readings. Because of distribution systems” properties such as the lower measurement redundancy and varying loads, existing methods cannot be used to accurately detect and localize an FDIA. To fill these gaps and to deal with the rarity of FDIAs in distribution systems, we propose an ensemble of deep convolutional neural networks (CNNs) to detect and localize FDIAs in active balance, and unbalanced distribution systems. To this end, first, a dataset is created using different attacking scenarios and the possible reconfigurations and renewable generation scenarios in the system. The records are in the form of WLS-generated voltage estimation of PQ buses with different balancing ratios between attacked and normal records. Then, these datasets are used to train different CNNs. These CNNs' outputs are merged using a multilayer perceptron network. Finally, FDIA is detected and localized by the proposed ensemble model in a balanced and unbalanced power distribution system. Results of simulations on IEEE 33-bus and the modified IEEE 13-bus networks verify that the ensemble model can distinguish between normal and attacked records with great accuracy according to the area under the curve (AUC) criteria, thus giving the operators a powerful tool to defend against FDIAs.
Article
Full-text available
In this note, we will provide some results of a literature study related to one of the clustering methods, namely K-Means, but with some modifications, devoted to the case of computation time. Modifications were made at the time of determining the cluster center by previously applying principal component analysis (PCA), other researchers [4] proposed this method first, which differs in this note, namely in the preprocessing of the data before principal component analysis is carried out. Comparison of the accuracy of the cluster results is also given in this note.
Article
Full-text available
In this paper, we give an overview of our case-based reasoning program, HYPO, which operates in the field of trade secret law. We discuss key ingredients of case-based reasoning, in general, and the correspondence of these to elements of HYPO. We conclude with an extended example of HYPO working through a hypothetical trade secrets case, patterned after an actual case.
Article
Full-text available
The authors propose a new algorithm which builds a feedforward layered network in order to learn any Boolean function of N Boolean units. The number of layers and the number of hidden units in each layer are not prescribed in advance: they are outputs of the algorithm. It is an algorithm for growth of the network, which adds layers, and units inside a layer, at will until convergence. The convergence is guaranteed and numerical tests of this strategy look promising.
Article
Full-text available
norm (defined as the limit of an Lp norm as p approaches zero). In Monte Carlo simulations, both K-modes and the latent class procedures (e.g., Goodman 1974) performed with equal efficiency in recovering a known underlying cluster structure. However, K-modes is an order of magnitude faster than the latent class procedure in speed and suffers from fewer problems of local optima than do the latent class procedures. For data sets involving a large number of categorical variables, latent class procedures become computationally extremly slow and hence infeasible. We conjecture that, although in some cases latent class procedures might perform better than K-modes, it could out-perform latent class procedures in other cases. Hence, we recommend that these two approaches be used as "complementary" procedures in performing cluster analysis. We also present an empirical comparison of K-modes and latent class, where the former method prevails.
Article
Full-text available
In this paper we present SPADE, a new algorithm for fast discovery of Sequential Patterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original problem into smaller sub-problems, that can be independently solved in main-memory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in only three database scans. Experiments show that SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some pre-processed data. It also has linear scalability with respect to the number of input-sequences, and a number of other database parameters. Finally, we discuss how the results of sequence mining can be applied in a real application domain.
Conference Paper
Full-text available
Sequential pattern mining is an important data mining problem with broad applications. It is also a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern mining methods follow the methodology of Apriori since the Apriori-based method may substantially reduce the number of combinations to be examined. However, Apriori still encounters problems when a sequence database is large and/or when sequential patterns to be mined are numerous and/or long. In this paper, we re-examine the sequential pattern mining problem and propose a novel, efficient sequential pattern mining method, called FreeSpan (i.e., Frequent pattern-projected Sequential pattern mining). The general idea of the method is to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence databases to confine the search and the growth of subsequence fragments. FreeSpan mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Our performance study shows that FreeSpan examines a substantially smaller number of combinations of subsequences and runs considerably faster than the Apriori based GSP algorithm.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
)Jeffrey Scott VitterCenter for Geometric Computingand Department of Computer ScienceDuke UniversityDurham, NC 27708--0129 USAjsv@cs.duke.eduMin WangyCenter for Geometric Computingand Department of Computer ScienceDuke UniversityDurham, NC 27708--0129 USAminw@cs.duke.eduBala IyerDatabase Technology InstituteIBM Santa Teresa LaboratoryP.O. Box 49023San Jose, CA 95161 USAbalaiyer@vnet.ibm.comAbstractThere has recently been an explosion of interest in the analysisof...
Conference Paper
Full-text available
In most current applications of belief networks, domain knowledge is represented by a single belief network that applies to all problem instances in the domain. In more complex domains, problem-specific models must be constructed from a knowledge base encoding probabilistic relationships in the domain. Most work in knowledge-based model construction takes the rule as the basic unit of knowledge. We present a knowledge representation framework that permits the knowledge base designer to specify knowledge in larger semantically meaningful units which we call network fragments. Our framework provides for representation of asymmetric independence and canonical intercausal interaction. We discuss the combination of network fragments to form problem-specific models to reason about particular problem instances. The framework is illustrated using examples from the domain of military situation awareness.
Conference Paper
Full-text available
Constrained gradient analysis (similar to the "cubegrade" problem posed by Imielinski, et al. (9)) is to extract pairs of similar cell characteris- tics associated with big changes in measure in a data cube. Cells are considered similar if they are related by roll-up, drill-down, or 1-dimensional mutation operation. Constrained gradient queries are expressive, capable of capturing trends in data and answering "what-if" questions. To facilitate our discussion, we call one cell in a gradient pair probe cell and the other gradi- ent cell. An efficient algorithm is developed, which pushes constraints deep into the computa- tion process, finding all gradient-probe cell pairs in one pass. It explores bi-directional pruning between probe cells and gradient cells, utilizing transformed measures and dimensions. Moreover, it adopts a hyper-tree structure and an H-cubing method to compress data and maximize sharing of computation. Our performance study shows that this algorithm is efficient and scalable.
Conference Paper
Full-text available
In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible. We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.
Conference Paper
Full-text available
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Conference Paper
Full-text available
Datacube queries compute simple aggregates at multiple granularities. In this paper we examine the more general and useful problem of computing a complex subquery involving multiple dependent aggregates at multiple granularities. We call such queries “multi-feature cubes.” An example is “Broken down by all combinations of month and customer, find the fraction of the total sales in 1996 of a particular item due to suppliers supplying within 10% of the minimum price (within the group), showing all subtotals across each dimension.” We classify multi-feature cubes based on the extent to which fine granularity results can be used to compute coarse granularity results; this classification includes distributive, algebraic and holistic multi-feature cubes. We provide syntactic sufficient conditions to determine when a multi-feature cube is either distributive or algebraic. This distinction is important because, as we show, existing datacube evaluation algorithms can be used to compute multi-feature cubes that are distributive or algebraic, without any increase in I/O complexity. We evaluate the CPU performance of computing multi-feature cubes using the datacube evaluation algorithm of Ross and Srivastava. Using a variety of synthetic, benchmark and real-world data sets, we demonstrate that the CPU cost of evaluating distributive multi-feature cubes is comparable to that of evaluating simple datacubes. We also show that a variety of holistic multi-feature cubes can be evaluated with a manageable overhead compared to the distributive case.
Conference Paper
Full-text available
In a telecommunication network, hundreds of millions of call detail records (CDRs) are generated daily. Applications such as tandem traffic analysis require the collection and mining of CDRs on a continuous basis. The data volumes and data flow rates pose serious scalability and performance challenges. This has motivated us to develop a scalable data-warehouse/OLAP framework, and based on this framework, tackle the issue of scaling the whole operation chain, including data cleansing, loading, maintenance, access and analysis. We introduce the notion of dynamic data warehousing for managing information at different aggregation levels with different life spans. We use OLAP servers, together with the associated multidimensional databases, as a computation platform for data caching, reduction and aggregation, in addition to data analysis. The framework supports parallel computation for scaling up data mining, and supports incremental OLAP for providing continuous data mining. A tandem traffic analysis engine is implemented on the proposed framework. In addition to the parallel and incremental computation architecture, we provide a set of application-specific optimization mechanisms for scaling performance. These mechanisms fit well into the above framework. Our experience demonstrates the practical value of the above framework in supporting an important class of telecommunication business intelligence applications.
Conference Paper
Full-text available
Despite the overwhelming amounts of multimedia data recently generated and the significance of such data, very few people have systematically investigated multimedia data mining. With our previous studies on content-based retrieval of visual artifacts, we study in this paper the methods for mining content-based associations with recurrent items and with spatial relationships from large visual data repositories. A progressive resolution refinement approach is proposed in which frequent item-sets at rough resolution levels are mined, and progressively, finer resolutions are mined only on the candidate frequent item-sets derived from mining rough resolution levels. Such a multi-resolution mining strategy substantially reduces the overall data mining cost without loss of the quality and completeness of the results.
Article
Full-text available
Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. This paper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of the work in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing work is also provided. Finally, a brief overview of the WebSIFT system as an example of a prototypical Web usage mining system is given.
Conference Paper
We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, "Age" and "Balance" are two numeric attributes, and "CardLoan" is a Boolean attribute. Taking the pair (Age, Balance) as a point in two-dimensional space, we consider an association rule of the form((Age, Balance) ∈ P) ⇒ (CardLoan = Yes),which implies that bank customers whose ages and balances fall in a planar region P tend to use card loan with a high probability. We consider two classes of regions, rectangles and admissible (i.e. connected and x-monotone) regions. For each class, we propose efficient algorithms for computing the regions that give optimal association rules for gain, support, and confidence, respectively. We have implemented the algorithms for admissible regions, and constructed a system for visualizing the rules.
Article
The ability to restructure a decision tree efficiently enables a variety of approaches to decision tree induction that would otherwise be prohibitively expensive. Two such approaches are described here, one being incremental tree induction (ITI), and the other being non-incremental tree induction using a measure of tree quality instead of test quality (DMTI). These approaches and several variants offer new computational and classifier characteristics that lend themselves to particular applications.
Article
We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (is-a hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets is-a outerwear is-a clothes, we may infer a rule that "people who buy outerwear tend to buy shoes". This rule may hold even if rules that "people who buy jackets tend to buy shoes", and "people who buy clothes tend to buy shoes" do not hold. An obvious solution to the problem is to add all ancestors of each item in a transaction to the transaction, and then run any of the algorithms for mining association rules on these "extended transactions ". However, this "Basic" algorithm is not very fast; we present two algorithms, Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100 times faster on one real-life dataset). We also ...
Article
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the "real" tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
Article
Several splitting criteria for binary classification trees are shown to be written as weighted sums of two values of divergence measures. This weighted sum approach is then used to form two families of splitting criteria. One of them contains the chi-squared and entropy criterion, the other contains the mean posterior improvement criterion. Both family members are shown to have the property of exclusive preference. Furthermore, the optimal splits based on the proposed families are studied. We find that the best splits depend on the parameters in the families. The results reveal interesting differences among various criteria. Examples are given to demonstrate the usefulness of both families.
Article
This paper presents an empirical comparison of three classification methods: neural networks, decision tree induction and linear discriminant analysis. The comparison is based on seven datasets with different characteristics, four being real, and three artificially created. Analysis of variance was used to detect any significant differences between the performance of the methods. There is also some discussion of the problems involved with using neural networks and, in particular, on overfitting of the training data. A comparison between two methods to prevent overfitting is presented: finding the most appropriate network size, and the use of an independent validation set to determine when to stop training the network.
Article
Two models, the aberrant innovation model and the aberrant observation model, are considered to characterize outliers in time series. The approach adopted here allows for a small probability α that any given observation is ‘bad’ and in this set-up the inference about the parameters of an autoregressive model is considered.
Article
We introduce the Iceberg-CUBE problem as a reformulation of the datacube (CUBE) problem. The Iceberg-CUBE problem is to compute only those group-by partitions with an aggregate value (e.g., count) above some minimum support threshold. The result of Iceberg-CUBE can be used (1) to answer group-by queries with a clause such as HAVING COUNT(*) >= X, where X is greater than the threshold, (2) for mining multidimensional association rules, and (3) to complement existing strategies for identifying interesting subsets of the CUBE for precomputation. We present a new algorithm (BUC) for Iceberg-CUBE computation. BUC builds the CUBE bottom-up; i.e., it builds the CUBE by starting from a group-by on a single attribute, then a group-by on a pair of attributes, then a group-by on three attributes, and so on. This is the opposite of all techniques proposed earlier for computing the CUBE, and has an important practical advantage: BUC avoids computing the larger group-bys that do not meet minimum support. The pruning in BUC is similar to the pruning in the Apriori algorithm for association rules, except that BUC trades some pruning for locality of reference and reduced memory requirements. BUC uses the same pruning strategy when computing sparse, complete CUBEs. We present a thorough performance evaluation over a broad range of workloads. Our evaluation demonstrates that (in contrast to earlier assumptions) minimizing the aggregations or the number of sorts is not the most important aspect of the sparse CUBE problem. The pruning in BUC, combined with an efficient sort method, enables BUC to outperform all previous algorithms for sparse CUBEs, even for computing entire CUBEs, and to dramatically improve Iceberg-CUBE computation.
Article
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Article
Mining for association rules in market basket data has proved a fruitful area of research. Measures such as conditional probability (confidence) and correlation have been used to infer rules of the form “the existence of item A implies the existence of item B.” However, such rules indicate only a statistical relationship between A and B. They do not specify the nature of the relationship: whether the presence of A causes the presence of B, or the converse, or some other attribute or phenomenon causes both to appear together. In applications, knowing such causal relationships is extremely useful for enhancing understanding and effecting change. While distinguishing causality from correlation is a truly difficult problem, recent work in statistics and Bayesian learning provide some avenues of attack. In these fields, the goal has generally been to learn complete causal models, which are essentially impossible to learn in large-scale data mining applications with a large number of variables. In this paper, we consider the problem of determining casual relationships, instead of mere associations, when mining market basket data. We identify some problems with the direct application of Bayesian learning ideas to mining large databases, concerning both the scalability of algorithms and the appropriateness of the statistical techniques, and introduce some initial ideas for dealing with these problems. We present experimental results from applying our algorithms on several large, real-world data sets. The results indicate that the approach proposed here is both computationally feasible and successful in identifying interesting causal structures. An interesting outcome is that it is perhaps easier to infer the lack of causality than to infer causality, information that is useful in preventing erroneous decision making.
Article
The use of k nearest neighbor (k-NN) and Parzen density estimates to obtain estimates of the Bayes error is investigated under limited design set conditions. By drawing analogies between the k-NN and Parzen procedures, new procedures are suggested, and experimental results are given which indicate that these procedures yield a significant improvement over the conventional k-NN and Parzen procedures. We show that, by varying the decision threshold, many of the biases associated with the k-NN or Parzen density estimates may be compensated, and successful error estimation may be performed in spite of these biases. Experimental results are given which demonstrate the effect of kernel size and shape (Parzen), the size of k (k-NN), and the number of samples in the design set.
Article
The CART concept induction algorithm recursively partitions the measurement space, displaying the resulting partitions as decision trees. Care, however, must be taken not to overfit the trees to the data, and CART employs cross-validation (cv) as the means by which an appropriately sized tree is selected. Although unbiased, cv estimates exhibit high variance, a troublesome characteristic, particularly for small learning sets. This paper describes Monte Carlo experiments which illustrate the effectiveness of the ·632 bootstrap as an alternative technique for tree selection and error estimation. In addition, a new incremental learning extension to CART is described.
Article
Neural networks have been successfully applied in a wide range of supervised and unsupervised learning applications. Neural-network methods are not commonly used for data-mining tasks, however, because they often produce incomprehensible models and require long training times. In this article, we describe neural-network learning algorithms that are able to produce comprehensible models, and that do not require excessive training times. Specifically, we discuss two classes of approaches for data mining with neural networks. The first type of approach, often called rule extraction, involves extracting symbolic models from trained neural networks. The second approach is to directly learn simple, easy-to-understand networks. We argue that, given the current state-of-the-art, neural-network methods deserve a place in the tool boxes of data-mining specialists.
Article
Multi-modal classification problems involve the recognition of patterns where the patterns associated with each class can come from disjoint regions in feature space. Traditional linear discriminant methods cannot cope with these problems. While a number of approaches exist for classifying patterns with multiple modes, decision trees and backpropagation neural networks represent leading algorithms with special capabilities for dealing with this problem class. This paper provides a comparison of decision trees with backpropagation neural networks for three distinct multi-modal problems: two from emitter classification and one from digit recognition. These real-world problems provide an interesting range of problem characteristics for our comparison: one emitter classification problem has few features and a large data set; and the other has many features and a small data set. Additionally, both emitter classification problems have real-valued features, while the digit recognition problem has binary-valued features. The results show that both methods produce comparable error rates but that direct application of either method will not necessarily produce the lowest error rate. In particular, we improve decision tree results with multi-variable splits and we improve backpropagation neural networks with feature selection and mode identification.
Article
Building Protos, a learning apprentice system for heuristic classification, has forced us to scrutinize the usefulness of inductive learning and deductive problem solving. While these inference methods have been widely studied in machine learning, their seductive elegance in artificial domains (e.g. mathematics) does not carry-over to natural domains (e.g. medicine). This paper briefly describes our rationale in the Protos system for relegating inductive learning and deductive problem solving to minor roles in support of retaining, indexing, and matching exemplars. The problems that arise from “lazy generalization” are described along with their solutions in Protos. Finally, an example of Pro tos in the domain of clinical audiology is discussed.
Article
Data mining can be regarded as a collection of methods for drawing inferences from data. The aims of data mining, and some of its methods, overlap with those of classical statistics. However, there are some philosophical and methodological differences. We examine these differences, and we describe three approaches to machine learning that have developed largely independently: classical statistics, Vapnik's statistical learning theory, and computational learning theory. Comparing these approaches, we conclude that statisticians and data miners can profit by studying each other's methods and using a judiciously chosen combination of them.
Article
Modelling a target attribute by other attributes in the data is perhaps the most traditional data mining task. When there are many attributes in the data, one needs to know which of the attribute(s) are relevant for modelling the target, either as a group or the one feature that is most appropriate to select within the model construction process in progress. There are many approaches for selecting the attribute(s) in machine learning. We examine various important concepts and approaches that are used for this purpose and contrast their strengths. Discretization of numeric attributes is also discussed for its use is prevalent in many modelling techniques.
Conference Paper
Recent studies show that constraint pushing may substantially improve the performance of frequent pattern mining, and methods have been proposed to incorporate interesting constraints in frequent pattern mining. However, some popularly encountered constraints are still considered as "tough" constraints which cannot be pushed deep into the mining process. In this study, we extend our scope to those tough constraints and identify an interesting class, called convertible constraints, which can be pushed deep into frequent pattern mining. Then we categorize all the constraints into five classes and show that four of them can be integrated into the frequent pattern mining process. This covers most of the constraints popularly encountered and composed by SQL primitives. Moreover, a new constraint-based frequent pattern mining method, called constrained frequent pattern growth, or simply CFG, which integrates constraint pushing with a recently developed frequent pattern growth method, is developed. We show this integration opens more room on constraint pushing since finer constraint checking can be enforced on each projected database. Our performance study shows that the method is powerful and outperforms substantially the existing constrained frequent pattern mining algorithms.
Conference Paper
Sequential pattern mining, which finds the set of frequent subsequences in sequence databases, is an important data-mining task and has broad applications. Usually, sequence patterns are associated with different circumstances, and such circumstances form a multiple dimensional space. For example, customer purchase sequences are associated with region, time, customer group, and others. It is interesting and useful to mine sequential patterns associated with multi-dimensional information.In this paper, we propose the theme of multi-dimensional sequential pattern mining, which integrates the multidimensional analysis and sequential data mining. We also thoroughly explore efficient methods for multi-dimensional sequential pattern mining. We examine feasible combinations of efficient sequential pattern mining and multi-dimensional analysis methods, as well as develop uniform methods for high-performance mining. Extensive experiments show the advantages as well as limitations of these methods. Some recommendations on selecting proper method with respect to data set properties are drawn.
Conference Paper
In this paper, we address the issue of evaluating decision trees generated from training examples by a learning algorithm. We give a set of performance measures and show how some of them relate to oth- ers. We derive results suggesting that the number of leaves in a decision tree is the important mea- sure to minimize. Minimizing this measure will, in a probabilistic sense, improve performance along the other measures. Notably it is expected to produce trees whose error rates are less likely to exceed some acceptable limit. The motivation for deriving such results is two-fold: 1. To better understand what constitutes a good mea- sure of performance, and 2. To provide guidance when deciding which aspects of a decision tree generation algorithm should be changed in order to improve the quality of the decision trees it generates. The results presented in this paper can be used as a basis for a methodology for formally proving that one decision tree generation algorithm is better than another. This would provide a more satisfactory al- ternative to the current empirical evaluation method for comparing algorithms.
Conference Paper
We address the problem of selecting an attribute and some of its values for branching during the top-down generation of decision trees. We study the class of impurity measures, members of which are typically used in the literature for selecting attributes during decision tree generation (e.g. entropy in ID3, GID3*, and CART; Gini Index in CART). We argue that this class of measures is not particularly suitable for use in classification learning. We define a new class of measures, called C-SEP, that we argue is better suited for the purposes of class separation. A new measure from C-SEP is formulated and some of its desirable properties are shown. Finally, we demonstrate empirically that the new algorithm, O-BTree, that uses this measure indeed produces better decision trees than algorithms that use impurity measures.
Conference Paper
Applications of techniques from linear algebra to information retrieval and hypertext analysis are discussed. It focused on linear algebraic methods that make use of eigenvectors and the singular value decomposition. A variety of ways in which methods from linear algebra can be brought to baer on problems in information retrieval and hypertext analysis is presented. The application of such techniques is made possible by the long-standing use of vector representations for documents in information retrieval, and the deep connections that exist between the combinatorics of link structures and the eigenvectors of their adjacency matrices.
Conference Paper
The problem of mining spatiotemporal ,patterns is finding sequences,of events ,that occur frequently in spatiotemporal datasets. Spatiotemporal datasets store the evolution of objects over time. Examples include sequences of sensor images of a geographical region, data that describes the location and movement of individual objects over time, or data that describes the evolution of natural phenomena, such as forest coverage. The discovered patterns are sequences,of events,that occur most frequently. In this paper, we present DFS_MINE, a new algorithm for fast mining of ,frequent ,spatiotemporal ,patterns ,in ,environmental data. DFS_MINE, as its name suggests, uses a Depth-First-Search-like approach to the problem which ,allows very fast discoveries of long sequential patterns. ,DFS_MINE performs database ,scans ,to discover ,frequent sequences rather than relying on information stored in main memory, which has the advantage ,that the a mount of space required is minimal. Previous approaches utilize a Breadth-First-Search-like approach ,and ,are ,not efficient for discovering long frequent sequences. Moreover, they require storing in main memory all occurrences ,of each sequence in the database and, as a result, the a mount of space needed is rather large. Experiments showthat the I/O cost o fthe database scans is o ffset byth e efficiency of the DFS-like approach ,that ensures ,fast discovery of long frequent patterns. DFS_MINE is also ideal for mining frequent spatiotemporal sequences,with various spatial granularities. Spatial granularit y refers to how fine or how general our view of the space we are examining is.
Conference Paper
This paper describes ID5, an incremental algorithm that produces decision trees similar to those built by Quinlan's ID3. The principal benefit of ID5 is that new training instances can be processed by revising the decision tree instead of building a new tree from scratch. ID3, ID4, and ID5 are compared theoretically and empirically.
Conference Paper
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
Conference Paper
We investigate the problem of learning DNF concepts from examples using decision trees as a concept description language. Due to the replication problem, DNF concepts do not always have a concise decision tree descrip­ tion when the tests at the nodes are lim­ ited to the initial attributes. However, the representational complexity may be overcome by using high level attributes as tests. We present a novel algorithm that modifies the ini­ tial bias determined by the primitive attributes by adaptively enlarging the attribute set with high level attributes. We show empirically that this algorithm outperforms a standard decision tree algorithm for learning small random DNF with and without noise, when the examples are drawn from the uniform distribution.
Conference Paper
Probabilistic networks which provide compact descriptions of complex stochastic relationships among several random variables are rapidly becoming the tool of choice for uncertain reasoning in artificial intelligence. We show that networks with fixed structure containing hidden variables can be learned automatically from data using a gradient-descent mechanism similar to that used in neural networks We also extend the method to networks with intensionally represented distributions, including networks with continuous variables and dynamic probabilistic networks Because probabilistic networks provide explicit representations of causal structure human experts can easily contribute pnor knowledge to the training process, thereby significantly improving the learning rate Adaptive probabilistic networks (APNs) may soon compete directly with neural networks as models in computational neuroscience as well as in industrial and financial applications.
Article
Rule induction can achieve orders of magnitude reduction in the volume of data descriptions. For example, we applied a commercial tool (IXLtm ) to a 1,819 record tropical storm database, yielding 161 rules. However, human comprehension of the discovered results may require further reduction. We present a rule refinement strategy, partly implemented in a Prolog program, that operationalizes “interestingness” into performance, simplicity, novelty, and significance. Applying the strategy to the induced rulebase yielded 10 “genuinely interesting” rules.
Article
Research in data mining and knowledge discovery in databases has mostly concentrated on developing good algorithms for various data mining tasks (see for example the recent proceedings of KDD conferences). Some parts of the research effort have gone to investigating data mining process, user interface issues, database topics, or visualization (7). Relatively little has been published about the theoretical foundations of data mining. In this paper I present some possible theoretical approaches to data mining. The area is at its infancy, and there probably are more questions than answers in this paper. First of all one has to answer questions such as "Why look for a theory of data mining? Data mining is an applied area, why should we care about having a theory for it?" Probably the simplest answer is to recall the development of the area of relational databases. Databases existed already in the 1960s, but the field was considered to be a murky backwater of different applications without any clear structure and without any interesting theoretical issues. Codd's relational model was a nice and simple framework for specifying the structure of data and the operations to be performed on it. The mathematical elegance of the relational model made it possible to develop advanced methods of query optimization and transactions, and these in turn made efficient general purpose database management systems possible. The relational model is a clear example of how theory in computer science has transformed an area from a hodgepodge of unconnected methods to an interesting and understandable whole, and at the same time enabled an area of industry. Given that theory is useful, what would be the properties that a theoretical framework should satisfy in order that it could be
Article
Case-based reasoning (CBR) is an approach to problem solving based on the retrieval and adaptation of cases, or episodic descriptions of problems and their associated solutions. In theory, CBR can be considered as a five-step problem-solving process.