Performance comparison on Statlog dataset

Performance comparison on Statlog dataset

Source publication
Preprint
Full-text available
Heart disease is the primary cause of morbidity and mortality in the world. It includes numerous problems and symptoms. The diagnosis of heart disease is difficult because there are too many factors to analyze. What's more, the misclassification cost could be very high. In this paper, I firstly propose a cost-sensitive ensemble model to improve the...

Similar publications

Article
Full-text available
Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may...
Article
Full-text available
Metastatic cancers account for up to 90% of cancer-related deaths. The clear differentiation of metastatic cancers from primary cancers is crucial for cancer type identification and developing targeted treatment for each cancer type. DNA methylation patterns are suggested to be an intriguing target for cancer prediction and are also considered to b...
Conference Paper
Full-text available
At the moment, the most prevalent form of cancerdiagnosed in women across the globe is breast cancer. It developsin the breast tissue and is one of the most frequent causes ofwomen’s death. This cancer can be cured if it is diagnosedat preliminary stage. Malignant and benign are two types oftumor found in case of breast cancer. Malignant tumors are...
Article
Full-text available
The correct classification of requirements has become an essential task within software engineering. This study shows a comparison among the text feature extraction techniques, and machine learning algorithms to the problem of requirements engineer classification to answer the two major questions "Which works best (Bag of Words (BoW) vs. Term Frequ...
Article
Full-text available
Breast cancer is one of the most common diseases among women, accounting for many deaths each year. Even though cancer can be treated and cured in its early stages, many patients are diagnosed at a late stage. Data mining is the method of finding or extracting information from massive databases or datasets, and it is a field of computer science wit...

Citations

... An issue arises when the magnitude of one feature surpasses that of others, resulting in its dominance over the remaining features. Consequently, raw data must be scaled to mitigate the influence of varying quantitative units [31], [32]. Normalization is a common method for rescaling feature values. ...
Article
Full-text available
Employee turnover poses a critical challenge that affects many organizations globally. Although advanced machine learning algorithms offer promising solutions for predicting turnover, their effectiveness in real-world scenarios is often limited because of their inability to fully utilize the relational structure within tabulated employee data. To address this gap, this study introduces a promising framework that converts traditional tabular employee data into a knowledge graph structure, harnessing the power of Graph Convolutional Networks (GCN) for more nuanced feature extraction. The proposed methodology extends beyond prediction and incorporates explainable artificial intelligence (XAI) techniques to unearth the pivotal factors influencing an employee’s decision to either remain with or depart from a particular organization. The empirical analysis was conducted using a comprehensive dataset from IBM that includes the records of 1,470 employees. We benchmarked the performance against five prevalent machine learning models and observed that our enhanced linear Support Vector Machine (L-SVM) model, combined with knowledge-graph-based features, achieved an impressive accuracy of 0.925. Moreover, the successful integration of XAI techniques for attribute evaluation sheds light on the significant impact of job environment, job satisfaction, and job involvement on turnover intentions. This study not only furthers the development of advanced predictive models for employee turnover but also provides organizations with actionable insights to strategically address and reduce turnover rates.
... In nature, class imbalance problem happens as the number of instances belonging to a specific class is overwhelming to that belonging to the other classes, further causing some traditional supervised learning algorithms exceedingly focus on the majority class, but depress the performance of minority classes. In recent two decades, a number of learning algorithms for addressing imbalanced classification problem have been proposed, and they could be roughly divided into four following categories: sampling [13][14][15], cost-sensitive learning [16][17][18], decision threshold moving (DTM) [19][20][21][22], and ensemble learning [23][24][25][26]. Sampling can be regarded as a pre-processing technique for dealing with class imbalance learning problem as it balances the data distribution of different classes by either adding the instances belonging to the minority class or decreasing the instances belonging to the majority class. ...
Article
Full-text available
Class imbalance learning (CIL), which aims to addressing the performance degradation problem of traditional supervised learning algorithms in the scenarios of skewed data distribution, has become one of research hotspots in fields of machine learning, data mining, and artificial intelligence. As a postprocessing CIL technique, the decision threshold moving (DTM) has been verified to be an effective strategy to address class imbalance problem. However, no matter adopting random or optimal threshold designation ways, the classification hyperplane could be only moved parallelly, but fails to vary its orientation, thus its performance is restricted, especially on some complex and density variable data. To further improve the performance of the existing DTM strategies, we propose an improved algorithm called CDTM by dividing majority training instances into multiple different density regions, and further conducting DTM procedure on each region independently. Specifically, we adopt the well-known DBSCAN clustering algorithm to split training set as it could adapt density variation well. In context of support vector machine (SVM) and extreme learning machine (ELM), we respectively verified the effectiveness and superiority of the proposed CDTM algorithm. The experimental results on 40 benchmark class imbalance datasets indicate that the proposed CDTM algorithm is superior to several other state-of-the-art DTM algorithms in term of G-mean performance metric.
... A problem arises when one feature's magnitude is higher than the rest, as it will then naturally dominate other features. As a consequence, raw data should be scaled to fit classification algorithms and eliminate the impact of various quantitative units [28]. Therefore, in this research, the MinMaxScaler technique was used to rescale the features between 0 and 1. ...
... A pro lem arises when one feature's magnitude is higher than the rest, as it will then natural dominate other features. As a consequence, raw data should be scaled to fit classificatio algorithms and eliminate the impact of various quantitative units [28]. Therefore, in th research, the MinMaxScaler technique was used to rescale the features between 0 and The benefit of this technique is that it is robust to outliers as it uses statistics techniqu that do not affect the variance of the data (Equation (1)) [9]. ...
... According to the No Free Lunch Theorem, no single model or algorithm can handle all classification problems [26,28]. Furthermore, each different algorithm has its advantages and disadvantages as illustrated in Table 2 [16,[41][42][43][44]. Consequently, the combination of several algorithms exploits the weaknesses of a single one, such as overfitting. ...
Article
Full-text available
The negative effect of financial crimes on financial institutions has grown dramatically over the years. To detect crimes such as credit card fraud, several single and hybrid machine learning approaches have been used. However, these approaches have significant limitations as no further investigation on different hybrid algorithms for a given dataset were studied. This research proposes and investigates seven hybrid machine learning models to detect fraudulent activities with a real word dataset. The developed hybrid models consisted of two phases, state-of-the-art machine learning algorithms were used first to detect credit card fraud, then, hybrid methods were constructed based on the best single algorithm from the first phase. Our findings indicated that the hybrid model Adaboost + LGBM is the champion model as it displayed the highest performance. Future studies should focus on studying different types of hybridization and algorithms in the credit card domain.
Article
Full-text available
Heart disease (HD) is a major threat to human health, and the medical field generates vast amounts of data that doctors struggle to effectively interpret and use. Early prediction and classification of HD types are crucial for effective medical treatment. Researchers have found it important to use learning-based techniques from machine and deep learning, such as supervised and deep neural networks, to develop automatic models for HD. These techniques have been used to simulate HD management and extract important features from complex data sets. This survey examines various HD prediction models, classifying the learning-based techniques, datasets, and contexts used, and analyzing the performance metrics of each contribution. It also clarifies which method suits a type of HD. With the growth of data sets, researchers are increasingly utilizing these techniques to create more precise models. However, there is still much work to be done to improve the accuracy of HD predictions.